CN116993762B

CN116993762B - Image segmentation method, device, electronic equipment and storage medium

Info

Publication number: CN116993762B
Application number: CN202311249539.6A
Authority: CN
Inventors: 郑昊; 魏东; 郑冶枫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2024-01-19
Anticipated expiration: 2043-09-26
Also published as: CN116993762A

Abstract

The embodiment of the application provides an image segmentation method, an image segmentation device, electronic equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, medical image analysis and the like. The method comprises the following steps: for a first image and a second image obtained based on downsampling of the first image, respectively dividing a target object through a first module in a pre-trained image dividing model to obtain a corresponding first feature map and a corresponding second feature map; the target object is a tubular structure object included in the first image; fusing the first feature map and the second feature map to obtain a fused feature map; the fusion feature map is used for representing the central line of the tubular structure of the target object; and aiming at the first image and the fusion feature map, the second module based on the attention mechanism in the image segmentation model is used for segmenting the target object to obtain a segmentation result map corresponding to the target object. The application can promote the precision that tubular structure cut apart.

Description

Image segmentation method, device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an image segmentation method, an image segmentation device, electronic equipment and a storage medium.

Background

Image segmentation (image segmentation) refers to the process of dividing an image into regions of similar nature. In the task of segmenting an object of a tubular structure, the prior art provides a method for improving the recognition capability of an edge by a supervision and lifting model on the edge part of a pipeline, however, the method is difficult to weigh the interior and the edge class of the object with a smaller pipe diameter, so that the segmentation accuracy of the method for the object with the smaller pipe diameter is very low.

Disclosure of Invention

The embodiment of the application provides an image segmentation method, an image segmentation device, electronic equipment and a storage medium for solving at least one technical problem. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an image segmentation method, including:

aiming at a first image and a second image obtained based on downsampling of the first image, respectively carrying out segmentation processing on a target object through a first module in a pre-trained image segmentation model to obtain a corresponding first feature map and a corresponding second feature map; the target object is a tubular structure object included in the first image;

fusing the first feature map and the second feature map to obtain a fused feature map; the fusion feature map is used for representing the central line of the tubular structure of the target object;

And aiming at the first image and the fusion feature map, performing segmentation processing on the target object through a second module based on an attention mechanism in the image segmentation model to obtain a segmentation result map corresponding to the target object.

In a possible embodiment, the first module includes a first convolution unit and a second convolution unit; the first convolution unit comprises a first encoder and a first decoder which are distributed by adopting a U-shaped network structure; the output end of each convolution layer in the first decoder is connected with the input end of the second convolution unit;

the first image and the second image obtained based on the downsampling of the first image are respectively subjected to segmentation processing on a target object through a first module in a pre-trained image segmentation model to obtain a corresponding first feature map and a corresponding second feature map, and the method comprises the following steps of respectively executing the following operations on the first image and the second image:

performing, by the first encoder, a feature extraction operation with respect to the target object with respect to an input image;

performing, by the first decoder, a feature fusion operation with respect to the target object with respect to an output of the first encoder;

And carrying out convolution operation on the target object according to the output of each convolution layer in the first decoder through the second convolution unit to obtain a feature map output by the first module.

In a possible embodiment, the fusing the first feature map and the second feature map to obtain a fused feature map includes:

upsampling the second feature map so that the upsampled second feature map size is consistent with the first feature map;

performing addition operation on the first feature map and the up-sampled second feature map to obtain an overall feature map;

subtracting the first characteristic diagram and the up-sampled second characteristic diagram to obtain a detail characteristic diagram;

and based on the integral feature map and the detail feature map, extracting the center line of the target object to obtain a fusion feature map in a mask form.

In a possible embodiment, the adding operation is performed on the first feature map and the up-sampled second feature map to obtain an overall feature map, including:

performing connected component analysis on the up-sampled second feature map to remove false positive areas in the second feature map, so as to obtain an analyzed second feature map;

Extracting a central line of the target object in the graph aiming at the analyzed second characteristic graph, and performing expansion operation on the central line to obtain a processed second characteristic graph;

and adding the first characteristic diagram and the processed second characteristic diagram to obtain an overall characteristic diagram.

In a possible embodiment, the target object includes a first object having a pipe diameter less than or equal to a preset pixel threshold value and a second object having a pipe diameter greater than the preset pixel threshold value;

the extracting operation is performed on the center line of the target object based on the integral feature map and the detail feature map to obtain a fusion feature map in a mask form, which comprises the following steps:

extracting the center line of the target object in the integral feature map to obtain first center line information;

multiplying the first central line information with the detail characteristic diagram to obtain central line information of the first object;

subtracting the first central line information from the central line information of the first object to obtain central line information of the second object;

performing expansion operation on the center line indicated by the center line information of the first object, and obtaining first rest information of the first object except the center line information based on the center line information of the first object after expansion, the overall feature map and the center line information of the first object before expansion;

Obtaining second rest information of the second object except the central line information based on the central line information of the expanded first object, the central line information of the second object and the integral feature map;

determining background information in the first image based on a difference value between a preset numerical value and the integral feature map;

and obtaining a fusion characteristic diagram based on the background information, the central line information of the first object, the first rest information, the central line information of the second object and the second rest information.

In a possible embodiment, the second module includes a third convolution unit and a fourth convolution unit; the third convolution unit comprises a second encoder and a second decoder which are distributed by adopting a U-shaped network structure; the second encoder includes at least two feature extraction subunits comprising at least one block composed of at least one attention layer and a pooling layer connected to the attention layer, and at least one block composed of at least one convolution layer and a pooling layer connected to the convolution layer, connected in sequence; the second decoder comprises at least two feature fusion subunits, wherein the feature fusion subunits comprise at least one block consisting of at least one convolution layer and an up-sampling layer connected with the convolution layer and at least one block consisting of at least one attention layer and an up-sampling layer connected with the attention layer which are connected in sequence; each convolution layer and each attention layer in the second decoder are connected to the fourth convolution unit;

The processing of the segmentation of the target object by a second module based on an attention mechanism in the image segmentation model aiming at the first image and the fusion feature map to obtain a segmentation result map corresponding to the target object comprises the following steps:

performing, by the second encoder, a feature extraction operation with respect to the target object for the first image based on an attention mechanism in combination with the fused feature map;

performing, by the second decoder, a feature fusion operation with respect to the target object for an output of the second encoder based on an attention mechanism in combination with the fusion feature map;

and carrying out convolution operation on the target object according to the output of each convolution layer and each attention layer in the second decoder through the fourth convolution unit to obtain a segmentation result diagram corresponding to the target object.

In a possible embodiment, the attention layer comprises a first branch for performing attention computation based on the input feature map and a second branch for extracting tokens based on the fused feature map; the first branch comprises a first convolution network, an attention network and a second convolution network which are connected in sequence; the output end of the first convolution network is connected with the output end of the second convolution network in a jumping manner; the second branch is connected with the attention network;

Performing attention calculation on the input feature map and the fusion feature map through the attention layer, wherein the attention calculation comprises the following steps:

performing convolution operation on the input feature map through the first convolution network to obtain a feature map to be processed;

calculating an inquiry value for each sub-graph output by the second branch on the basis of the feature graph to be processed through the attention network, calculating a key and a value for each token in each sub-graph output by the second branch, and performing attention calculation on the basis of the inquiry value, the key and the value to obtain an attention feature graph;

and performing convolution operation on the attention feature map through the second convolution network to obtain the output of the second convolution network, and adding the output to the output of the first convolution network to obtain the feature map of the current attention layer output.

In a possible embodiment, extracting, by the second branch, the token based on the fused feature map includes:

dividing the fusion feature map to obtain at least two subgraphs;

performing aggregation operation on mask information included in each sub-graph to obtain tokens corresponding to all information;

the mask information comprises background information, central line information of a first object, first rest information, central line information of a second object and second rest information; the first object comprises an object with the pipe diameter smaller than or equal to a preset pixel threshold value in the target object; the second object comprises an object with the pipe diameter larger than the preset pixel threshold value in the target object; the first rest information comprises information except centerline information in the first object; the second remaining information includes information of the second object other than centerline information.

In a possible embodiment, the image segmentation model is trained by:

acquiring training data, wherein the training data comprises a plurality of training images, and each training image is provided with a real mark corresponding to a real point on the central line of the target object;

performing iterative training on the image segmentation model based on the training data until a preset stopping condition is met, so as to obtain a trained image segmentation model;

in each iteration training, a target loss value between a predicted value and a true value output by a model is calculated based on a predicted mark and the true mark of a predicted point corresponding to a central line of the target object in a predicted segmentation graph output by the model through a preset loss function, and network parameters of the model are adjusted based on the target loss value.

In a possible embodiment, the preset loss function includes a first function that uses a joint loss function;

the calculating, by the preset loss function, a target loss value between a predicted value and a true value output by a model based on a predicted mark and the true mark of a predicted point corresponding to a centerline of the target object in a predicted segmentation graph output by the model, includes:

For each real point in the current training image, determining a weight value of the distance from the current pixel to the real point;

and calculating a joint loss value between the predicted value and the true value output by the model based on the weight value, the predicted mark and the true mark through the first function, and determining a target loss value based on the joint loss value.

In a possible embodiment, the preset loss function further includes a second function, where the second function includes a regularization term based on a shape constraint;

for each first object in the current training image, sampling operation is carried out based on a predicted point corresponding to the predicted mark to obtain predicted sampling information, and sampling operation is carried out based on a real point corresponding to the real mark to obtain real sampling information; the first object is an object of which the pipe diameter in the target object is smaller than or equal to a preset pixel threshold value;

Calculating a cosine similarity value between a predicted value and a true value output by a model based on the predicted sampling information and the true sampling information through the second function;

and determining the sum of the joint loss value and the cosine similarity value as a target loss value.

In a possible embodiment, the sampling operation includes: and sparse sampling is carried out on each point on the central line by taking a point on the central line of the first object as a center and taking half of the average pipe diameter of the first object as a sampling interval, so as to obtain other preset numerical sampling points related to the point, and sampling information is obtained by taking position information corresponding to the sampling points.

In a possible embodiment, the acquiring training data includes: dividing the acquired training data into a training set, a verification set and a test set; the training results corresponding to the verification set are used for determining the super parameters in the model;

wherein the training data comprises a data set corresponding to at least one of a fundus vessel segmentation task, a coronary artery segmentation task and a three-dimensional tubular structure segmentation task;

the method comprises the steps of performing network parameter adjustment of a model by adopting a random gradient descent method, performing training of iteration rounds set in an initial stage based on a set first learning rate, and performing training of iteration rounds set in a later stage based on a set second learning rate; the first learning rate is greater than the second learning rate.

In a second aspect, an embodiment of the present application provides an image segmentation apparatus, including:

the first segmentation module is used for carrying out segmentation processing on the target object through a first module in the pre-trained image segmentation model respectively aiming at the first image and a second image obtained based on the downsampling of the first image to obtain a corresponding first feature map and a corresponding second feature map; the target object is a tubular structure object included in the first image;

the feature fusion module is used for fusing the first feature map and the second feature map to obtain a fused feature map; the fusion feature map is used for representing the central line of the tubular structure of the target object;

and the second segmentation module is used for carrying out segmentation processing on the target object through a second module based on an attention mechanism in the image segmentation model aiming at the first image and the fusion feature map to obtain a segmentation result map corresponding to the target object.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the method provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided in the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method provided in the first aspect described above.

The beneficial effects that technical scheme that this application embodiment provided brought are:

the embodiment of the application provides an image segmentation method, specifically, when a first image needing image segmentation is obtained, a second image can be obtained by performing downsampling operation based on the first image, and segmentation processing is performed on target objects in the first image and the second image respectively through a first module in a pre-trained image segmentation model to obtain a first feature image corresponding to the first image and a second feature image corresponding to the second image; the target object is a tubular structure object included in the image; then, the first feature map and the second feature map can be fused, and a fused feature map for representing the central line of the tubular structure of the target object is obtained; on the basis, aiming at the first image and the fusion feature map, the segmentation processing on the target object can be carried out through a second module based on an attention mechanism in the image segmentation model, so as to obtain a segmentation result map corresponding to the target object. According to the implementation of the method, the device and the system, the segmentation processing of the target object can be carried out on input images of different scales to obtain the respectively corresponding feature images, then the obtained feature images are fused to obtain the central line representing the tubular structure of the target object, on the basis, the segmentation processing of the target object is carried out on the first image based on the attention mechanism by combining the fused feature images, the segmentation precision of the tubular structure can be effectively improved, the image segmentation is carried out by combining the extracted central line, the segmentation capability of the target object with smaller pipe diameter can be effectively improved, and therefore the network performance on the segmentation task of the tubular structure is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a flowchart of an image segmentation method according to an embodiment of the present application;

FIG. 2 is a schematic illustration of a segmentation of a tubular structure according to an embodiment of the present application;

fig. 3 is a schematic view of a fundus blood vessel according to an embodiment of the present application;

FIG. 4 is a schematic illustration of a coronary artery according to an embodiment of the present application;

fig. 5 is a schematic architecture diagram of an image segmentation model according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of feature fusion according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart of attention computation according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a sampling point according to an embodiment of the present disclosure;

fig. 9 is a schematic flow chart of an application scenario provided in an embodiment of the present application;

FIG. 10 is a flowchart of another image segmentation method according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of comparison of segmentation results according to an embodiment of the present disclosure;

fig. 12 is a schematic diagram of an image segmentation apparatus according to an embodiment of the present application;

Fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present application. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and expand, sense the environment, acquire knowledge and use knowledge to obtain the best results theory, method, technique and application system. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In particular, embodiments of the present application relate to techniques such as computer vision and deep learning.

Computer Vision (CV) is a science of how to "look" a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for the human eye to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The deep learning technology is a branch of machine learning, and is an algorithm for carrying out characterization learning on data by taking an artificial neural network as a framework. The technology can simulate the human brain and perform different degrees of perception analysis on the real world.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

In the related art, in order to realize the task of dividing the tubular structure, the recognition capability of the edge is improved by directly supervising the edge portion of the tubular structure. However, for small pipes (pipes of smaller pipe diameters), the weights of the inner and edge categories are difficult to adjust. If the weight of the edge portion is too high, most of the inner area is also divided into edges; if the weight at the edge is too low, part of the edge pixels will be divided into pipeline interiors, the segmentation effect of this method is poor, and the existence of this problem makes this method unable to complete accurate small pipeline segmentation.

Aiming at least one technical problem in the prior art, the embodiment of the application provides a method for correcting the characteristics based on the central line of a tubular structure, and the implementation of the method can improve the distinguishing degree of the characteristics at the edge and improve the accuracy of the segmentation of the tubular structure.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

The following describes an image segmentation method in the embodiment of the present application.

Specifically, the execution subject of the method provided in the embodiment of the present application may be a terminal or a server; the terminal (may also be referred to as a device) may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device (e.g., a smart speaker), a wearable electronic device (e.g., a smart watch), a vehicle-mounted terminal, a smart home appliance (e.g., a smart television), an AR/VR device, and the like. The server may be an independent physical server, a server cluster formed by a plurality of physical servers or a distributed system (such as a distributed cloud storage system), or a cloud server for providing cloud computing and cloud storage services.

In an example, the execution body may be a server, and when the application scenario is medical image analysis, the detection apparatus may acquire a first image (for example, a schematic diagram on the left side in fig. 2), and send the first image to the server, and the server executes the image segmentation method provided in the embodiment of the present application and feeds back a segmentation result diagram to the result display apparatus (for example, a schematic diagram on the right side in fig. 2). As shown in fig. 9, in the on-line application, a first image may be acquired through the front end a, then the image segmentation method provided in the embodiment of the present application is executed through the back end, and a segmentation result diagram is output to the front end B for display.

When the embodiment of the application is applied to the medical field, many structures in a human body are considered to be tubular, such as blood vessels, air pipes, nerve fibers and the like. Segmentation of the tubular structure may aid in diagnosis of disease on the one hand and may be used for preoperative modeling and intra-operative navigation of interventional procedures on the other hand. For example, intraretinal microvascular lesions and neovascularization are important grounds for diabetic retinopathy stage. In percutaneous coronary intervention surgery, real-time coronary artery segmentation can navigate the surgery to ensure that the catheter safely reaches a predetermined position. In bronchoscopy, a model of the tracheobronchial tree (such as determining the position of the trachea by image segmentation) is required to be built before operation based on CT images, and the travel route of the bronchoscope is planned accordingly.

In an example, the image segmentation method provided in the embodiment of the present application is performed through a neural network, and training the neural network may obtain an image segmentation model suitable for the embodiment of the present application. As shown in fig. 5, the image segmentation model may include a first module and a second module, the input of the first module including the input of the image segmentation model, and the input of the second model including the input of the image segmentation model and the output of the first model.

Specifically, as shown in fig. 1, the image segmentation method includes steps S101 to S103:

step S101: aiming at a first image and a second image obtained based on downsampling of the first image, respectively carrying out segmentation processing on a target object through a first module in a pre-trained image segmentation model to obtain a corresponding first feature map and a corresponding second feature map; the target object is a tubular structure object included in the first image.

The first image may be any image including a tubular structural object, for example, when the first image is suitable for the medical field, it may be a fundus image (including a fundus blood vessel as shown in fig. 3), an X-ray angiography image (including a coronary artery as shown in fig. 4), a CT image (including a trachea), or the like; for example, when the circuit is suitable for the field of electronic circuits, the circuit can be a PCB layout schematic diagram (including circuits) and the like. The first image may be a different image with tubular structure objects, adapted to different application fields and application scenarios.

Wherein the second image is an image downsampled based on the first image, which is different from the first image in size. Optionally, considering that the tubular structure object with the smaller pipe diameter may be in an invisible state after downsampling, to ensure the accuracy of segmentation prediction on the object with the smaller pipe diameter to a certain extent, a downsampling operation with a smaller multiple (such as 2 times) is adopted, so that missing of some objects which can be segmented in the original size due to the downsampling operation is avoided.

Optionally, the first image and the second image belong to two mutually independent inputs with respect to a first module of the image segmentation model, the first module processes the first image and the second image respectively, and outputs a first feature map and a second feature map corresponding to each other. It may be appreciated that the segmentation processing of the target object is performed on the input image by the first module, which belongs to the segmentation processing of the first stage in the embodiment of the present application, and the output of the first module may be used as the basic data of the segmentation processing of the subsequent stage to assist the segmentation processing of the subsequent stage.

Step S102: fusing the first feature map and the second feature map to obtain a fused feature map; the fusion profile is used to characterize a centerline of the target object tubular structure.

The first image and the second image belong to images with different sizes, so that the corresponding first feature map and second feature map also belong to feature maps with different sizes, and the processing in step S102 is to fuse features with different sizes.

The first feature map and the second feature map belong to the image segmentation processing result, so that a central line (also called skeleton line in the embodiment of the application, the width of the skeleton line can be 1 pixel) of the tubular structure of the target object can be extracted in image segmentation, accordingly, the feature of model learning can be effectively enriched by carrying out segmentation processing on images with different sizes and then fusing the segmentation processing result, and the performance of a network executing segmentation task is improved.

Alternatively, the center line of the target object may be a connection of points by marking the pipe of the target object (the midpoint of the pipe may be marked), that is, when the center line is referred to as the skeleton line of the pipe, each point on the center line may be referred to as a skeleton point.

Step S103: and aiming at the first image and the fusion feature map, performing segmentation processing on the target object through a second module based on an attention mechanism in the image segmentation model to obtain a segmentation result map corresponding to the target object.

The input of the second module includes a first image of the input image segmentation model and a fusion feature map obtained by fusing the output of the first module, that is, the second module may combine the result of the segmentation processing of the target object performed by the first module, perform the segmentation processing of the target object with respect to the first image, so as to perform feature extraction and correction based on the direction of the central line of the pipeline on the basis of combining the prior knowledge of the tubular structure (the local tubular structure in the image has stronger consistency along the direction of the skeleton line), so as to improve the distinguishing degree of the features at the edge of the pipeline, and further improve the accuracy of the segmentation of the model with respect to the tubular structure, so as to effectively improve the performance of the network.

The second module processes the input based on the attention mechanism, so that a higher similarity can be obtained between similar pixels, and accuracy of segmentation of the pipeline structure is improved. If the pixel of the edge position of the pipeline is more similar to the representation at the skeleton line, the similarity of the pixel and the representation at the edge position of the pipeline is further increased after the attention treatment; in contrast, if the characterization is more similar to that of the background, the similarity of the two is increased after the attention treatment.

The following describes specific operation steps of the first module of the image segmentation module in the embodiment of the present application.

In a possible embodiment, the first module comprises a first convolution unit and a second convolution unit; the first convolution unit includes a first encoder and a first decoder.

When the first module is provided with 8 blocks, the first 7 blocks are connected to form a first convolution unit, and the 8 th block is a second convolution unit, as shown in fig. 5. Alternatively, it may be divided into an encoding part and a decoding part for the first convolution unit.

For the encoded portion, the first encoder may include at least two feature extraction subunits comprising at least one convolutional layer and a pooling layer coupled to the convolutional layer. That is, each feature extraction subunit corresponds to one block in the first module. Alternatively, as shown in fig. 5, one feature extraction subunit may be laid out with 3 convolutional layers (CONV) and 1 pooling layer (POOL) connected in sequence. Where the "CONV" layer represents a convolution layer consisting of a 3x3 convolution layer, a normalization (Instance Normalization, IN) layer, and an activation layer (e.g., using a ReLU activation function). Alternatively, the number of convolution layers disposed in one feature extraction subunit may be adjusted according to requirements, which is not limited in this application. Where the "POOL" layer represents a maximum pooling (MaxPooling) layer, in one example, the pooling layer may be a downsampling layer.

For the decoding portion, the first decoder may include at least two feature fusion subunits including at least one convolutional layer and an upsampling layer coupled to the convolutional layer. That is, each feature fusion subunit corresponds to a block in the first module. Alternatively, as shown in fig. 5, one feature fusion subunit may be laid out with a convolutional layer (CONV) and an UP-sampling layer (UP) connected in sequence. Where the "UP" layer represents an upsampling layer based on a linear difference.

The first encoder and the first decoder are symmetrically arranged, and the output end of the characteristic extraction subunit of the first encoder is also connected with the input end of the corresponding characteristic fusion subunit of the first decoder; each convolution layer in the first decoder is connected to the second convolution unit.

Optionally, as shown in fig. 5, the first convolution unit of the first module adopts a U-net network structure, and a characteristic pyramid network structure is adopted between the decoding part of the first convolution unit and the second convolution unit, so that the output end of each convolution layer in the first decoder and the input end of the second convolution unitAnd (5) connection. The 4 th block shown in fig. 5 is taken as the bottom layer in the U-shaped network layout structure, and can be understood as an encoding part and a decoding part. When the coding part is provided with 3 pooling layers, the decoding part is correspondingly provided with 3 up-sampling layers. Alternatively, the output of each convolutional layer of the first modular decoding section passes through a 1 The 1 convolution compresses the number of channels to 2, the image is restored to its original size by means of upsampling ("UP" layers), 12-layer features of the last 6 convolution layers (the last 6 convolution layers in the first convolution unit) are stacked together, and finally by a 1 × bit>The 1 convolution (second convolution unit) results in the final segmentation prediction.

Alternatively, fig. 5 is merely an exemplary network structure, under which the number of blocks, convolution layers, pooling layers, upsampling layers, and the like may be arranged according to requirements, which is not limited by the embodiment of the present application.

Alternatively, the network structure for performing the downsampling operation on the first image may also be considered as a part of the first module, that is, the first module is further configured with a downsampling unit before the first convolution unit, where the downsampling unit is configured to downsample the input image to obtain a second image with a different size than the original input image.

In a possible embodiment, as shown in fig. 10, for a first image and a second image obtained by downsampling the first image, a segmentation process about a target object is performed by a first module in a pre-trained image segmentation model to obtain a corresponding first feature map and a second feature map, which includes performing the following steps S101 a-S101 c for the first image and the second image, respectively:

Step S101a: with the first encoder, a feature extraction operation with respect to the target object is performed with respect to an input image (first image or second image).

Alternatively, as shown in fig. 5, in the first 4 blocks connected in sequence in the first module, the input image is subjected to four downsampling operations through convolution and maximum pooling, so as to obtain a four-level feature map.

Step S101b: and performing, by the first decoder, a feature fusion operation with respect to the target object with respect to an output of the first encoder.

Optionally, as shown in fig. 5, in the blocks connected in sequence from the 5 th to the 7 th of the first module, feature fusion is performed on each level of feature map output by the first encoder and the feature map obtained through deconvolution in a jumper manner.

Step S101c: and carrying out convolution operation on the target object according to the output of each convolution layer in the first decoder through the second convolution unit to obtain a feature map output by the first module.

Optionally, as shown in fig. 5, in the last block of the first module, the feature map of each level output by the first decoder is convolved to obtain a feature map (such as a segmentation result of semantic prediction) corresponding to the input image.

The following describes a portion of the embodiment of the present application for feature fusion in detail.

Alternatively, from the network structure, the feature fusion part may be used as a part of the first module in the image segmentation model, that is, a feature fusion unit is further connected after the second convolution unit; the feature fusion part may also be an independent feature fusion module arranged between the first module and the second module as an image segmentation model, i.e. the output of the first module is used as input of the feature fusion module.

Optionally, obtaining a segmentation result (fusion feature map) of the first stage after fusion of predictions of the first image and the second image through a certain strategy; the result may be used to extract a bone line of the tubular structure, which may be used for the attention calculation of the second stage (performed by the second module of the image segmentation model).

In a possible embodiment, as shown in fig. 10, in step S102, the first feature map and the second feature map are fused to obtain a fused feature map, which includes the following steps A1 to A4:

step A1: and up-sampling the second characteristic diagram so that the up-sampled second characteristic diagram is consistent with the first characteristic diagram in size.

As shown in fig. 6, considering that the first image and the second image are images with different sizes, the first feature map and the second feature map also belong to feature maps with different sizes, so before feature fusion, the size of the second feature map needs to be restored to be consistent with the size of the first feature map. Alternatively, on the basis that the second image is obtained by 2 times of downsampling, 2 times of upsampling operation may be performed on the second feature map to obtain an upsampled second feature map.

Step A2: and carrying out addition operation on the first characteristic diagram and the up-sampled second characteristic diagram to obtain an overall characteristic diagram.

As shown in fig. 6 (taking fundus angiogram as an example), compared with the prediction of the original size (first feature map) and the prediction of the small size (second feature map), it can be seen that the sensitivity of the small size prediction is higher for the crude blood vessel of the peripheral region, while the sensitivity of the original size limited by the convolution kernel has a part of less missing score. The embodiment of the application can supplement the missing part in the original size prediction by adding the two predictions (the flow shown in the lower half of fig. 6), and obtain the whole feature map so as to enrich and perfect the features of the network learning.

In a possible embodiment, as shown in fig. 10, in step A2, an addition operation is performed on the first feature map and the up-sampled second feature map to obtain an overall feature map, which includes the following steps a21 to a23:

step A21: and carrying out connected component analysis on the up-sampled second feature map so as to remove false positive areas in the second feature map and obtain an analyzed second feature map.

Step A22: and extracting the center line of the target object in the graph aiming at the analyzed second characteristic graph, and performing expansion operation on the center line to obtain a processed second characteristic graph.

Step A23: and adding the first characteristic diagram and the processed second characteristic diagram to obtain an overall characteristic diagram.

As shown in fig. 6 (taking fundus angiogram as an example), the second feature map after upsampling is post-processed before the addition operation with the first feature map because many false positive portions on the background, particularly the periphery of the eyeball, may be included in the segmentation prediction for the second image. The post-treatment comprises the following three steps:

the first step, maximum communication domain extraction, aims to remove the false positive part of the peripheral large block in the second feature map after up-sampling.

And secondly, extracting skeleton lines on the basis of removing false positive areas.

And thirdly, performing expansion operation in image processing on the extracted bone lines. It is considered that small-sized predictions may be significantly coarser after upsampling than original-sized predictions, i.e. there are significant false positives at the edges. To remove false positives on these edges, the present embodiments retain only the bone wire portions and thicken the bone wire in an expansion operation. Comparing the primary size predictions with the small size predictions, as shown in fig. 6, it can be seen that the sensitivity of the small size predictions is higher for the coarse vessels of the peripheral region, while the primary size is limited by the receptive field of the convolution kernel, with a few missed portions. By adding the two predictions, the missing part of the original size prediction can be complemented with the small size prediction.

Step A3: and performing subtraction operation on the first characteristic diagram and the up-sampled second characteristic diagram to obtain a detail characteristic diagram.

In order to better extract the attention to the small duct, the embodiment of the present application also extracts the small duct based on predictions of two sizes (first feature map and second feature map), as shown in fig. 6 (taking fundus angiogram as an example). Considering that the image has only two pixels with left and right width (the object with smaller tube diameter), the downsampled image may be in an invisible state, so that small tubes which can be segmented in the original size may be omitted in the small-size prediction (the second feature map). In the embodiment of the application, the original size prediction (the first characteristic diagram) is used for subtracting the small size prediction (the second characteristic diagram after upsampling), and the small pipelines are extracted to obtain a detail characteristic diagram; i.e. the flow shown in the upper part of fig. 6.

Step A4: and based on the integral feature map and the detail feature map, extracting the center line of the target object to obtain a fusion feature map in a mask form.

Alternatively, for the overall prediction (overall feature map) obtained by addition and the small pipeline prediction (detail feature map) obtained by subtraction, the residual predictions corresponding to all the pipelines in the extraction map can be fused; such as predicting other parts than the centerline, pipeline interior and edge pixels, etc., to take more refined features as inputs to the second module.

In a possible embodiment, the target objects include a first object having a pipe diameter less than or equal to a preset pixel threshold and a second object having a pipe diameter greater than the preset pixel threshold. If a pipe having a pipe diameter of less than or equal to 3 pixels is referred to as a small pipe (first object), a pipe having a pipe diameter of more than 3 pixels is referred to as a large pipe (second object). Alternatively, the preset pixel threshold may be adjusted according to an actual segmentation object, which is not limited in this application.

Alternatively, the overall feature map is written asAnd the detail feature map is marked as +.>. In step A4, based on the overall feature map and the detail feature map, an extraction operation is performed for a center line of the target object to obtain a fusion feature map in a mask form, which includes the following steps a41 to a47:

Step A41: and extracting the center line of the target object in the integral feature map to obtain first center line information.

Wherein, toExtracting skeleton line to obtain whole skeleton line ∈>。

Step A42: and multiplying the first central line information with the detail characteristic diagram to obtain central line information of the first object.

Wherein the skeleton line of the small pipeline is calculated as。

Step A43: and subtracting the first central line information from the central line information of the first object to obtain the central line information of the second object.

Wherein the skeleton line of the large pipeline is calculated as。

Step A44: performing expansion operation on the center line indicated by the center line information of the first object, and obtaining first rest information of the first object except the center line information based on the center line information of the first object after expansion, the overall characteristic diagram and the center line information of the first object before expansion.

Wherein other predictions of the small pipeline can be made by 55, for example, the first remaining information is +.>Where D () represents the expansion operation.

Step A45: and obtaining second rest information of the second object except the central line information based on the central line information of the expanded first object, the central line information of the second object and the integral feature map.

Wherein, other predictions of the large pipeline can be calculated by the following formula, such as the second remaining information isWhere D () represents the expansion operation.

Step A46: and determining background information in the first image based on a difference value between a preset numerical value and the integral feature map.

Wherein it can pass throughRepresenting the background portion. The preset value is denoted by 1 and indicates all pixel features in the first image.

Step A47: and obtaining a fusion characteristic diagram based on the background information, the central line information of the first object, the first rest information, the central line information of the second object and the second rest information.

Wherein, the results obtained in the step A41-the step A46 can be stacked into a 5-channel fusion characteristic diagram, which is expressed as。

The following describes a specific network structure of the second module in the image segmentation model provided in the embodiment of the present application.

In a possible embodiment, the second module comprises a third convolution unit and a fourth convolution unit; the third convolution unit includes a second encoder and a second decoder.

The second encoder includes at least two feature extraction subunits comprising at least one block of at least one attention layer and a pooling layer connected to the attention layer, and at least one block of at least one convolution layer and a pooling layer connected to the convolution layer, connected in sequence.

The second decoder includes at least two feature fusion subunits comprising at least one block consisting of at least one convolutional layer and an upsampling layer connected to the convolutional layer, and at least one block consisting of at least one attention layer and an upsampling layer connected to the attention layer, connected in sequence.

The second encoder and the second decoder are symmetrically arranged, and the output end of the characteristic extraction subunit of the second encoder is also connected with the input end of the characteristic fusion subunit of the second decoder; each convolution layer and each attention layer in the second decoder are connected to the fourth convolution unit.

It will be appreciated that the second module differs from the first module in that the front end of the encoding portion and the back end of the decoding portion replace the Convolutional (CONV) layer with an attention (STT) Transformer with Skeleton-based Tokens layer. The input of the STT layer comprises an output characteristic diagram of the previous level and a mask for attention calculation; as shown in fig. 7.

In a possible embodiment, as shown in fig. 10, for the first image and the fused feature map in step S103, a segmentation process is performed on the target object by a second module based on an attention mechanism in the image segmentation model, so as to obtain a segmentation result map corresponding to the target object, which includes steps S103 a-S103 c:

Step S103a: and carrying out feature extraction operation on the target object on the first image based on an attention mechanism by the second encoder in combination with the fusion feature map.

Optionally, as shown in fig. 5, in the first 4 blocks connected in sequence in the second module, the input image (STT layer includes the output feature map of the previous level and the fusion feature map, CONV layer includes the output feature map of the previous level) is subjected to four downsampling operations through convolution (or attention calculation) and maximum pooling, so as to obtain a feature map of four levels.

Step S103b: and performing, by the second decoder, a feature fusion operation with respect to the target object for an output of the second encoder based on an attention mechanism in combination with the fusion feature map.

Optionally, as shown in fig. 5, in the 5 th to 7 th blocks connected in sequence of the second module, feature fusion is performed on each level of features output by the second encoder and feature graphs obtained through deconvolution (STT layer further includes fusion feature graphs) in a jumper manner.

Step S103c: and carrying out convolution operation on the target object according to the output of each convolution layer and each attention layer in the second decoder through the fourth convolution unit to obtain a segmentation result diagram corresponding to the target object.

Optionally, as shown in fig. 5, in the last block of the second module, the feature map of each level output by the second decoder is convolved to obtain a segmentation result map corresponding to the first image.

In a possible embodiment, as shown in fig. 7, the attention layer comprises a first branch for performing an attention calculation based on the input feature map and a second branch for extracting tokens based on the fused feature map.

The first branch comprises a first convolution network, an attention network and a second convolution network which are connected in sequence; the output end of the first convolution network is connected with the output end of the second convolution network in a jumping manner; the second branch is connected to the attention network.

Optionally, as shown in fig. 10, in the embodiment of the present application, attention calculation is performed by the attention layer for the input feature map and the fused feature map, including steps B1 to B3:

step B1: and carrying out convolution operation on the input feature map through the first convolution network to obtain a feature map to be processed.

Optionally, the input feature map is processed by a 3And 3, performing feature extraction on the first convolution network combined by the convolution layer, the IN layer and the ReLU activation layer to obtain a feature map to be processed.

Optionally, as shown in fig. 10, for the input mask, in this embodiment of the present application, the token is extracted by the second branch based on the fused feature map, including steps C1-C2:

step C1: and dividing the fusion feature map to obtain at least two subgraphs.

As shown in fig. 7, the input fusion feature map is divided into 4 square small blocks, that is, 4 subgraphs are obtained, which can be recorded as 4 local areas. Alternatively, the division manner of the fusion feature map may be adjusted according to the requirement, for example, dividing the fusion feature map into two rectangular small blocks, or dividing the fusion feature map into irregularly shaped areas, which is not limited in this application.

Step C2: and performing aggregation operation on the mask information included in each sub-graph to obtain tokens corresponding to the information. The mask information comprises background information, central line information of a first object, first rest information, central line information of a second object and second rest information; the first object comprises an object with the pipe diameter smaller than or equal to a preset pixel threshold value in the target object; the second object comprises an object with the pipe diameter larger than the preset pixel threshold value in the target object; the first rest information comprises information except centerline information in the first object; the second remaining information includes information of the second object other than centerline information.

Wherein, as shown in fig. 7, within each local area (for each sub-graph), mask= []Is aggregated into a token. The method and the device can extract the token in an average mode, namely the token is the average of the characteristics of the pixels with the value of 1 in the layer. In the step C2, 5 token extracted in each local area may be obtained, where each token corresponds to the background information, the small pipeline skeleton line, the small pipeline others (the first remaining information), the large pipeline skeleton line, and the large pipeline others (the second remaining information), respectively.

Step B2: and calculating an inquiry value for each sub-graph based on the feature graph to be processed through the attention network, calculating a key and a value for each token in each sub-graph, and performing attention calculation based on the inquiry value, the key and the value to obtain an attention feature graph.

In this embodiment, as shown in fig. 7, in each local area (for each sub-graph), the Query Value Query may be calculated at each pixel point using the linear layer, and the Key and the Value may be calculated for each token. Then the calculation is performed by the ATTN layer according to the self-attention mode of the transducer, and the calculation is as formula As shown, a plot of the attention profile of the ATTN layer output is obtained.

Step B3: and performing convolution operation on the attention feature map through the second convolution network to obtain the output of the second convolution network, and adding the output to the output of the first convolution network to obtain the feature map of the current attention layer output.

Alternatively, as shown in FIG. 7, by a 1After the attention characteristic map is processed by the 1 convolution layer, the normalization (Layer Normalization) layer and the GeLU activation layer, the attention characteristic map is added with the characteristics output by the first convolution network in a jump link mode, and the final output of the STT layer can be obtained.

The token in the STT layer provided in this embodiment extracts the feature expressions of the skeleton line, the background and other foreground positions, and for the pixel of the edge position, if the pixel is more similar to the feature of the skeleton line, the similarity will be further increased after the attention calculation, and, conversely, if the pixel is more similar to the feature of the background, the pixel is more similar to the background after the attention calculation. The STT layer corrects the characteristics of each pixel based on the attention of the skeleton points, so that the distinguishing degree of the edge part can be effectively improved, and finally the accuracy of segmentation is increased.

The following specifically describes training content of an image segmentation model in an embodiment of the present application.

In a possible embodiment, the image segmentation model is obtained by training the following operations corresponding to step D1-step D2:

step D1: training data is acquired, the training data including a plurality of training images, and each training image having a real marker corresponding to a real point on a centerline of the target object.

Wherein the real markers may indicate location information of each real point. The centerline of the same target object includes a plurality of real points. It will be appreciated that the actual marking is information obtained by labeling the target object in the training image.

Step D2: and carrying out iterative training on the image segmentation model based on the training data until a preset stopping condition is met, so as to obtain the trained image segmentation model.

Alternatively, the preset stop condition may be a round of iterative training reaching a preset round. Different training data are adopted for segmentation tasks of different scenes, and the training rounds required by the model can be adjusted according to the requirements of the segmentation tasks under the condition that the training data are different.

In each iteration training, a target loss value between a predicted value and a true value output by a model is calculated based on a predicted mark and the true mark of a predicted point corresponding to a central line of the target object in a predicted segmentation graph output by the model through a preset loss function, and network parameters of the model are adjusted based on the target loss value. Alternatively, the predicted values output by the model include a map of the segmented prediction results obtained by processing the model, in which each predicted point forming the centerline of the target object is predicted. The true values include respective true points annotated for the centerline of the training image corresponding to the target object.

In a possible embodiment, the preset loss function comprises a first function employing a joint loss function.

In step D2, a target loss value between a predicted value and a true value output by a model is calculated based on a predicted mark and the true mark corresponding to a predicted point on a central line of the target object in a predicted segmentation graph output by the model through a preset loss function, and the method includes steps D21-D22:

step D21: for each real point in the current training image, determining a weight value of the distance from the current pixel to the real point.

Step D22: and calculating a joint loss value between the predicted value and the true value output by the model based on the weight value, the predicted mark and the true mark through the first function, and determining a target loss value based on the joint loss value.

Optionally, when the preset loss function includes the first function, the calculation of the target loss value is as shown in the following formula (1):

formula (1)

In the case of the formula (1),for the current pixel to bone point distance +.>Weights of (2), and->Wherein->For all +.>Is a maximum value of (a). The weight decaying portion is located at the edge of the coarse pipe (large pipe) while the weights at the small pipe are fully preserved. The calculation mode of the loss value improves the attention of the small pipeline in the training process by restraining the edge part of the large pipeline, so that the problem of unbalanced area occupation ratio between the large pipeline and the small pipeline is solved. Wherein (1) >Being a superparameter, the training process may be aided in focusing more on the misclassified foreground portions. />Bone points corresponding to the predictive markers; />Corresponding bone points are marked for reality.

In one possible embodiment, in order to further enhance the segmentation effect at the small pipeline, the embodiment of the present application provides a regularization term based on shape constraint. Optionally, the preset loss function further includes a second function, and the second function includes a regular term based on shape constraint.

In step D2, calculating a target loss value between a predicted value and a true value output by the model based on a predicted mark and the true mark corresponding to a predicted point on a centerline of the target object in a predicted segmentation map output by the model by a preset loss function, including steps D23-D25:

step D23: for each first object in the current training image, sampling operation is carried out based on a predicted point corresponding to the predicted mark to obtain predicted sampling information, and sampling operation is carried out based on a real point corresponding to the real mark to obtain real sampling information; the first object is an object of which the pipe diameter in the target object is smaller than or equal to a preset pixel threshold value.

Optionally, the sampling operation includes: and sparse sampling is carried out on each point on the central line by taking a point on the central line of the first object as a center and taking half of the average pipe diameter of the first object as a sampling interval, so as to obtain other preset numerical sampling points related to the point, and sampling information is obtained by taking position information corresponding to the sampling points.

Wherein, as shown in FIG. 8, for the bone points on each small pipeline branch, sampling 9 (the numerical value can be set according to the requirement) sparse by taking the bone points as the centers is carried out, and the sampling interval is as followsWherein->To get round downwards, add>Is the average pipe diameter of the branch. Based on this, the tube diameter based sampling points form a compact description of the local shape. When used for calculation of regularization terms, the labeling and prediction can be sampled at each bone point, first of all, based on the position of the marker-localized bone point, resulting in +.>Andwherein->For the jth sample point at the ith point in the prediction,is the j-th sample point at the i-th point in the real label.

Step D24: and calculating a cosine similarity value between a predicted value and a true value output by the model based on the predicted sampling information and the true sampling information through the second function.

Optionally, the pre-similarity value corresponding to the second function is calculated as shown in the following formula (2):

formula (2)

Wherein,cos () represents cosine similarity for the number of skeletal points belonging to a small pipe. The regularization term penalizes shape similarity predicted and annotated near the small-pipeline skeletal points.

Step D25: and determining the sum of the joint loss value and the cosine similarity value as a target loss value.

Alternatively, the resulting final loss function, which is obtained by combining the first function and the second function, may be expressed as shown in the following formula (3):

formula (3)

In the formula (3) of the present invention,is a weight super parameter.

The acquiring training data in step D1 includes: dividing the acquired training data into a training set, a verification set and a test set; and training results corresponding to the verification set are used for determining the super parameters in the model.

Wherein the training data comprises a data set corresponding to at least one of a fundus vessel segmentation task, a coronary artery segmentation task and a three-dimensional tubular structure segmentation task.

In the fundus blood vessel segmentation task, PRIME dataset (containing 15 cases of data, consisting of wide-angle fundus image and corresponding pixel-level labeling, resolution of 40004000 As a test set). 80 images in the private dataset BUVS (image resolution 3072 +.>3900 Divided into 50 cases of training set, 10 cases of verification set and 20 cases of test set. The training set is used for training a model, and the super parameters are determined according to the result in the verification set. Wherein, in the loss function. The input of the network is the green channel of the wide-angle fundus picture. The network sets up a total training 80 rounds, in each roundSampling 40 +.>The manner of sampling the slice may be performed with reference to the related art, and is not limited herein. The optimization mode can adopt a random gradient descent method, the initial learning rate is set to be 0.05, and the initial learning rate is respectively descended to be 0.01 and 0.001 at the 40 th round and the 70 th round. In the reasoning stage, a sliding window mode can be adopted, and the size of each input isThe sliding distance is 512.

In the coronary artery segmentation task, a dataset (acquired from multiple patients using a camera system, image size 512512, resolution between pixels is 0.3 +. >0.3 Divided into 143 training sets, 12 verification sets and 36 test sets. The input size is the image size +.>. The network is set to be trained for 200 rounds, the optimization mode can adopt a random gradient descent mode, the initial learning rate is set to be 0.05, and the initial learning rate is respectively descended to be 0.01 and 0.001 in 150 th round and 190 th round.

In the three-dimensional tubular segmentation task, a three-dimensional example image can be used as training data of a model, such as cardiac blood vessels and airway images of CT. In the three-dimensional image segmentation task, segmentation processing may also be performed for depth.

In the present embodiment, one difficulty in considering the segmentation of tubular structures is the accurate extraction of small ducts. As can be seen from the fundus vessel tree of the fundus photograph shown in fig. 3 and the coronary artery in the X-ray angiogram shown in fig. 4, the trunk portion of the vessel tree is thicker, but as the vessel tree diverges, the tube diameter becomes smaller and smaller, and the narrowest branch visible in the figure is only 1 to 2 pixels wide. Segmentation of thicker vessels is less difficult, but segmentation of small vessels such as end vessels is a very challenging task. Taking blood vessel segmentation as an example, on one hand, the difference between a small blood vessel and a peripheral background in an image is not obvious enough, and a model needs special design to accurately capture the related characteristics of a small pipeline; on the other hand, the area ratio of small blood vessels in the whole blood vessel tree is very low, and the model does not actively learn the characteristics related to the small blood vessels if special attention is not paid to the loss function. In order to improve the segmentation accuracy of the small pipeline, the embodiment of the application designs a special network structure frame (as shown in fig. 5) and improves the loss function. When designing the network structure, a priori knowledge of the tubular structure is utilized, i.e. the local tubular structure in the image has stronger consistency along the bone line direction. To this end, within each local slice, 5 tokens are extracted based on the skeletal line of the large and small pipeline, each pixel point interacts with these 5 tokens with a self-attention mechanism, and the own features are corrected in this way. The corrected feature has a higher differentiation at the edge portion of the tube. Furthermore, an additional regularization term for the small pipe is added to the loss function. In the embodiment of the application, sampling is performed by taking each bone point of the small pipeline as a center and taking the pipe diameter of each branch as a distance, and the collection of sampling points forms a description of the local shape. By comparing the predictions and labels of the collection of sample points in a group manner, the mode can reflect the difference between model prediction and real labels in shape, and the supervised network learns more accurate shape characteristics.

To better illustrate the effects achieved by the examples of the present application, the following description is made in conjunction with tables 1 and 2.

Table 1 comparison of accuracy of the present application over BUVS and PRIME datasets with prior art schemes (%)

Table 1 gives the quantization results of the scheme proposed in the present application on BUVS and PRIME. The evaluation index selects the Dice coefficient, sensitivity, accuracy, and detected length ratio. The detected length ratio may be calculated from the number of detected bone points divided by the total number of bone points. In addition, the comparison was made in all blood vessels and small blood vessel portions, respectively. To calculate the segmented appearance near the small vessel, the end region is first extracted. Specifically, the region R1 is obtained by performing an expansion operation with a check vascular tree of size 5. Each foreground point is then assigned to the nearest bone point according to the distance between the coordinates, while the pixels belonging to the end branch bone points constitute the region R2. The intersection of regions R1 and R2 is the region near the end blood vessel, and four metrics are recalculated in this region as an evaluation of the end portion segmentation effect.

In the semantic segmentation method provided by the related technology, the front background type label of the blood vessel is converted into five types of crude blood vessel, small blood vessel, crude blood vessel edge, small blood vessel edge and background, in order to fairly compare the segmentation capability of different methods on small pipelines, the whole segmentation effect is controlled to be approximate level, the weight of the small blood vessel class is improved during reproduction, 72.3% of small blood vessels in the BUVS data set can be detected in the mode, and the number of the small blood vessels in the BUIME data set is 79.0%.

In the self-attention mode provided by the related technology, the self-attention characteristic correction among pixels is carried out in a local area, and in order to only compare the influence brought by a network structure, the loss function provided by the scheme of the application is applied in realizing the method, so that compared with the semantic segmentation method, the index on a small blood vessel is obviously improved. In the result of the method proposed in the present application, the segmentation effect of the whole and small blood vessel parts is improved. Compared to the self-attention method, the line-based attention pattern can boost the terminal vascular DSC index in the BUVS dataset from 74.2% to 75.1%, whereas on the PRIME dataset the overall DSC index is boosted from 77.2% to 77.8%. Compared with the result of semantic segmentation, the improvement of the terminal vessel DSC in the BUVS and PRIME data sets is more than 3 percent due to the improvement of the network structure and the loss function.

TABLE 2 comparison of accuracy of the present application on coronary artery segmentation with the prior art scheme (%)

Table 2 gives the quantification results of the scheme of the present application on coronary artery segmentation. Compared with the semantic segmentation method, the accuracy is improved from 79.9% to 81.3%, the length detection ratio is improved from 82.0% to 85.5%, and more end blood vessels are extracted with higher accuracy. In contrast to the self-attention method, the same loss function is used and thus the magnitude of the relative lifting is smaller, but the detected length ratio is 1.6% lifting.

Fig. 11 shows the segmentation results of the method proposed in the embodiment of the present application on two tasks, and the schematic diagram is sequentially ordered into an original image, a true marked image, and a predicted marked image. In the coronary artery segmentation (schematic of the first line) and in the fundus vessel segmentation task (schematic of the second line), the method proposed in the present application can detect some small vessels that are not labeled (compare the schematic of the real markers with the schematic of the predicted markers).

It should be noted that, in the alternative embodiment of the present application, the related data (such as the data related to the first image, the second image, the segmentation result graph, etc.) needs to be licensed or agreed upon by the user when the above embodiment of the present application is applied to a specific product or technology, and the collection, use and processing of the related data needs to comply with the relevant laws and regulations and standards of the relevant country and region. That is, in the embodiments of the present application, if data related to the subject is involved, the data needs to be obtained through the subject authorization consent, and in compliance with relevant laws and regulations and standards of the country and region.

An embodiment of the present application provides an image segmentation apparatus, as shown in fig. 12, the image segmentation apparatus 100 may include: a first segmentation module 101, a feature fusion module 102, and a second segmentation module 103.

The first segmentation module 101 is configured to perform segmentation processing on a target object through a first module in a pre-trained image segmentation model respectively for a first image and a second image obtained based on downsampling of the first image, so as to obtain a corresponding first feature map and a corresponding second feature map; the target object is a tubular structure object included in the first image; the feature fusion module 102 is configured to fuse the first feature map and the second feature map to obtain a fused feature map; the fusion feature map is used for representing the central line of the tubular structure of the target object; and a second segmentation module 103, configured to perform, for the first image and the fused feature map, segmentation processing on the target object by using a second module based on an attention mechanism in the image segmentation model, so as to obtain a segmentation result map corresponding to the target object.

In a possible embodiment, the fusion feature module 102 is further specifically configured to, when performing fusion of the first feature map and the second feature map to obtain a fusion feature map:

In a possible embodiment, when performing the adding operation for the first feature map and the up-sampled second feature map, the fusion feature module 102 is further specifically configured to:

In a possible embodiment, the target object includes a first object having a pipe diameter less than or equal to a preset pixel threshold value and a second object having a pipe diameter greater than the preset pixel threshold value.

The fusion feature module 102 is further specifically configured to, when performing an extraction operation for a center line of the target object based on the overall feature map and the detail feature map to obtain a fusion feature map in a mask form:

The second segmentation module 103 is further specifically configured to, when configured to perform segmentation processing on the target object by using the second module based on an attention mechanism in the image segmentation model with respect to the first image and the fused feature map to obtain a segmentation result map corresponding to the target object:

The second segmentation module 103 is further specifically configured to, when configured to perform attention computation on the input feature map and the fused feature map through the attention layer:

In a possible embodiment, the second segmentation module 103 is further specifically configured to, when executing the extraction of the token based on the fused feature map by the second branch:

dividing the fusion feature map to obtain at least two subgraphs;

In a possible embodiment, the apparatus 100 further includes a training module for performing the following operations training to train the resulting image segmentation model:

In a possible embodiment, the preset loss function includes a first function that uses a joint loss function; the training module is used for executing the prediction mark and the real mark corresponding to the prediction point on the central line of the target object in the prediction segmentation graph output by the model through the preset loss function, and is specifically used for calculating the target loss value between the prediction value and the real value output by the model when the training module is used for executing the target loss value between the prediction value and the real value output by the model:

In a possible embodiment, the preset loss function further includes a second function, where the second function includes a regularization term based on a shape constraint; the training module is used for executing the prediction mark and the real mark corresponding to the prediction point on the central line of the target object in the prediction segmentation graph output by the model through the preset loss function, and is specifically used for calculating the target loss value between the prediction value and the real value output by the model when the training module is used for executing the target loss value between the prediction value and the real value output by the model:

In a possible embodiment, the training module is further specifically configured to divide the acquired training data into a training set, a verification set and a test set when used for executing the acquisition of the training data; and training results corresponding to the verification set are used for determining the super parameters in the model.

The apparatus of the embodiments of the present application may perform the method provided by the embodiments of the present application, and implementation principles of the method are similar, and actions performed by each module in the apparatus of each embodiment of the present application correspond to steps in the method of each embodiment of the present application, and detailed functional descriptions of each module of the apparatus may be referred to in the corresponding method shown in the foregoing, which is not repeated herein.

The modules referred to in the embodiments described in the present application may be implemented by software. The name of the module is not limited to the module itself in a certain case, for example, the first segmentation module may be further described as a "module for performing segmentation processing on the target object by the first module in the pre-trained image segmentation model to obtain the corresponding first feature map and the second feature map" for the first image and the second image obtained based on the first image by downsampling, a "first module" and so on.

An embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement steps of an image segmentation method, and compared with the related art, the steps of the method may be implemented: when a first image needing to be subjected to image segmentation is obtained, a second image can be obtained by carrying out downsampling operation based on the first image, and segmentation processing is respectively carried out on target objects in the first image and the second image through a first module in a pre-trained image segmentation model to obtain a first feature image corresponding to the first image and a second feature image corresponding to the second image; the target object is a tubular structure object included in the image; then, the first feature map and the second feature map can be fused, and a fused feature map for representing the central line of the tubular structure of the target object is obtained; on the basis, aiming at the first image and the fusion feature map, the segmentation processing on the target object can be carried out through a second module based on an attention mechanism in the image segmentation model, so as to obtain a segmentation result map corresponding to the target object. According to the implementation of the method, the device and the system, the segmentation processing of the target object can be carried out on input images of different scales to obtain the feature images corresponding to the target object respectively, the obtained feature images are fused to obtain the central line representing the tubular structure of the target object, on the basis, the segmentation processing of the target object is carried out on the first image by combining the fused feature images, the segmentation precision of the tubular structure can be effectively improved, the image segmentation is carried out by combining the extracted central line, the segmentation capability of the target object with smaller pipe diameter can be effectively improved, and therefore the network performance on the segmentation task of the tubular structure is improved.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 13, the electronic device 4000 shown in fig. 13 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 13, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program that executes an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: terminal and server.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.

The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.

The foregoing is merely an optional implementation manner of the implementation scenario of the application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the application are adopted without departing from the technical ideas of the application, and also belong to the protection scope of the embodiments of the application.

Claims

1. An image segmentation method, comprising:

fusing the first feature map and the second feature map to obtain a fused feature map, including: upsampling the second feature map so that the upsampled second feature map size is consistent with the first feature map; performing connected component analysis on the up-sampled second feature map to remove false positive areas in the second feature map, so as to obtain an analyzed second feature map; extracting a central line of the target object in the graph aiming at the analyzed second characteristic graph, and performing expansion operation on the central line to obtain a processed second characteristic graph; adding the first feature map and the processed second feature map to obtain an overall feature map; subtracting the first characteristic diagram and the up-sampled second characteristic diagram to obtain a detail characteristic diagram; based on the integral feature map and the detail feature map, extracting the center line of the target object to obtain a fusion feature map in a mask form; the fusion feature map is used for representing the central line of the tubular structure of the target object;

2. The method of claim 1, wherein the first module comprises a first convolution unit and a second convolution unit; the first convolution unit comprises a first encoder and a first decoder which are distributed by adopting a U-shaped network structure; the output end of each convolution layer in the first decoder is connected with the input end of the second convolution unit;

3. The method of claim 1, wherein the target objects comprise a first object having a pipe diameter less than or equal to a preset pixel threshold and a second object having a pipe diameter greater than the preset pixel threshold;

4. The method of claim 1, wherein the second module comprises a third convolution unit and a fourth convolution unit; the third convolution unit comprises a second encoder and a second decoder which are distributed by adopting a U-shaped network structure; the second encoder includes at least two feature extraction subunits comprising at least one block composed of at least one attention layer and a pooling layer connected to the attention layer, and at least one block composed of at least one convolution layer and a pooling layer connected to the convolution layer, connected in sequence; the second decoder comprises at least two feature fusion subunits, wherein the feature fusion subunits comprise at least one block consisting of at least one convolution layer and an up-sampling layer connected with the convolution layer and at least one block consisting of at least one attention layer and an up-sampling layer connected with the attention layer which are connected in sequence; each convolution layer and each attention layer in the second decoder are connected to the fourth convolution unit;

5. The method of claim 4, wherein the attention layer comprises a first leg for attention computation based on the input feature map and a second leg for token extraction based on the fused feature map; the first branch comprises a first convolution network, an attention network and a second convolution network which are connected in sequence; the output end of the first convolution network is connected with the output end of the second convolution network in a jumping manner; the second branch is connected with the attention network;

6. The method of claim 5, wherein extracting, by the second leg, a token based on the fused feature map, comprises:

dividing the fusion feature map to obtain at least two subgraphs;

7. The method of claim 1, wherein the image segmentation model is trained by:

8. The method of claim 7, wherein the preset loss function comprises a first function employing a joint loss function;

9. The method of claim 8, wherein the preset loss function further comprises a second function comprising a regularization term based on a shape constraint;

10. The method of claim 9, wherein the sampling operation comprises:

and sparse sampling is carried out on each point on the central line by taking a point on the central line of the first object as a center and taking half of the average pipe diameter of the first object as a sampling interval, so as to obtain other preset numerical sampling points related to the point, and sampling information is obtained by taking position information corresponding to the sampling points.

11. The method of claim 7, wherein the acquiring training data comprises: dividing the acquired training data into a training set, a verification set and a test set; the training results corresponding to the verification set are used for determining the super parameters in the model;

12. An image dividing apparatus, comprising:

the feature fusion module is configured to fuse the first feature map and the second feature map to obtain a fused feature map, and includes: upsampling the second feature map so that the upsampled second feature map size is consistent with the first feature map; performing connected component analysis on the up-sampled second feature map to remove false positive areas in the second feature map, so as to obtain an analyzed second feature map; extracting a central line of the target object in the graph aiming at the analyzed second characteristic graph, and performing expansion operation on the central line to obtain a processed second characteristic graph; adding the first feature map and the processed second feature map to obtain an overall feature map; subtracting the first characteristic diagram and the up-sampled second characteristic diagram to obtain a detail characteristic diagram; based on the integral feature map and the detail feature map, extracting the center line of the target object to obtain a fusion feature map in a mask form; the fusion feature map is used for representing the central line of the tubular structure of the target object;

13. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-11.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-11.