CN115471499A

CN115471499A - Image target detection and segmentation method, system, storage medium and electronic equipment

Info

Publication number: CN115471499A
Application number: CN202211281775.1A
Authority: CN
Inventors: 袁铭康; 李叶; 许乐乐; 徐金中; 郭丽丽; 马忠松; 金山
Original assignee: Technology and Engineering Center for Space Utilization of CAS
Current assignee: Technology and Engineering Center for Space Utilization of CAS
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-12-13

Abstract

The invention relates to an image target detection and segmentation method, a system, a storage medium and an electronic device, comprising: training a preset deep learning model for image target detection and segmentation based on a plurality of original training images to obtain a target deep learning model; and inputting the image to be detected into the target deep learning model to obtain a target prediction result of target detection and segmentation of the image to be detected. The invention enhances the expression ability of the model to the image target and improves the target detection and segmentation precision of the object in the image by training the improved deep learning model.

Description

Image target detection and segmentation method, system, storage medium and electronic equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image target detection and segmentation method, system, storage medium, and electronic device.

Background

With the development and maturity of computer vision technology, it is widely applied to various fields. Object detection and segmentation is one of the important issues in computer vision research, and is an important basis for understanding high-level semantic features of images, with the task of returning rectangular bounding box coordinates or regions of single or multiple specific objects in a given image. Existing image target detection and segmentation algorithms can be generally classified into two categories: one is a two-stage model, such as Faster R-CNN, which extracts candidate regions independently, and first screens out candidate regions in which objects may exist from an input image, determines whether a target exists in the candidate regions, and then outputs a target type, a position feature, or a segmentation region. The other is a one-stage model, such as YOLO, which does not extract candidate regions independently, and directly inputs images to obtain object classes, corresponding position features or segmentation regions existing in the images. However, the above algorithms all have the problem of low precision of target detection and segmentation.

Therefore, it is desirable to provide a technical solution to solve the above technical problems.

Disclosure of Invention

In order to solve the technical problem, the invention provides an image target detection and segmentation method, an image target detection and segmentation system, a storage medium and electronic equipment.

The technical scheme of the image target detection and segmentation method is as follows:

s1, training a preset deep learning model for image target detection and segmentation based on a plurality of original training images to obtain a target deep learning model;

s2, inputting the image to be detected into the target deep learning model to obtain a target prediction result of target detection and segmentation of the image to be detected.

The image target detection and segmentation method has the following beneficial effects:

the method of the invention enhances the expression ability of the model to the image target and improves the target detection and segmentation precision of the object in the image by training the improved deep learning model.

On the basis of the scheme, the image target detection and segmentation method can be further improved as follows.

Further, the preset deep learning model includes: an original backbone network, an original neck network, and a plurality of original header networks; before S1, further comprising:

s01, labeling each original training image by adopting at least one labeling mode to obtain at least one labeled training image corresponding to each original training image;

the S1 comprises:

s11, inputting any original training image into the original backbone network for multi-scale image feature extraction to obtain a first training image corresponding to any original training image;

s12, inputting a first training image corresponding to any original training image into the original neck network for image feature extraction to obtain a second training image corresponding to any original training image;

s13, respectively inputting a second training image corresponding to any original training image into each original head network for prediction to obtain a training prediction result of any original training image in each original head network until a training prediction result of each original training image in each original head network is obtained;

and S14, loss calculation is carried out on all training prediction results corresponding to each original training image and at least one kind of labeled training image, the preset deep learning model is obtained and optimized according to the loss calculation result, the optimized preset deep learning model is used as the preset deep learning model and returns to the step S11 for iterative training, and when the preset deep learning model converges, the optimized preset deep learning model corresponding to the preset deep learning model converging is determined as the target deep learning model.

Further, the original backbone network comprises: the output end of each first convolution layer is correspondingly connected with the input end of one first down-sampling layer; the S11 comprises:

inputting any original training image to a first convolution layer of the original backbone network, sequentially passing through all the first convolution layers and all the first down-sampling layers, and performing multi-scale image feature extraction on any original training image to obtain a first training image corresponding to any original training image.

Further, the original neck network comprises: the output end of each second convolution layer is correspondingly connected with the input end of one first up-sampling layer; the S12 includes:

and inputting the first training image corresponding to any original training image to a first second convolution layer of the original neck network, sequentially passing through all second convolution layers and all first up-sampling layers, and performing image feature extraction on the first training image corresponding to any original training image to obtain a second training image corresponding to any original training image.

Further, the plurality of original header networks comprises: presetting a target classification head network, a target detection head network, an image segmentation head network and a central skeleton head network; the training prediction result of any original training image comprises: a first training prediction result, a second training prediction result, a third training prediction result, and a fourth training prediction result; the step of inputting the second training image corresponding to any original training image into each original head network respectively for prediction to obtain a training prediction result of any original training image in each original head network includes:

inputting a second training image corresponding to any original training image into the preset target classification head network for prediction to obtain a first training prediction result obtained by performing target classification on any original training image;

inputting a second training image corresponding to any original training image into the preset target detection head network for prediction to obtain a second training prediction result obtained by performing target detection on any original training image;

inputting a second training image corresponding to any original training image into the preset image segmentation head network for prediction to obtain a third training prediction result obtained by performing image segmentation on any original training image;

inputting a second training image corresponding to any original training image into the preset central skeleton head network for prediction, and obtaining a fourth training prediction result obtained by performing central skeleton extraction on any original training image.

The beneficial effect of adopting the further technical scheme is that: by adding the central skeleton network into the head network of the deep learning model, the target detection head network and the image segmentation head network can be helped to acquire more characteristics of object forms, so that the accuracy of object detection and image segmentation is improved.

Further, the step of labeling any original training image to obtain at least one labeled training image corresponding to any original training image includes:

labeling each object in any original training image based on the class of the object to obtain a first labeled training image containing labeled class information of each object;

labeling each object in any original training image based on the position of the object to obtain a second labeling training image containing the position information of each object;

masking each object in any original training image to obtain a third labeling training image containing mask information of each object;

and acquiring the central skeleton of each object in the third annotation training image corresponding to any original training image, and arranging all the central skeletons according to a preset arrangement sequence to obtain a fourth annotation training image.

The beneficial effect of adopting the further technical scheme is that: the central skeleton is extracted from the object example in the training image and is expressed as the point array arranged according to the set sequence, so that the object retains the characteristics related to the shape, the expression capability of the model to the image target is enhanced, the model is more favorable for learning the characteristic relation related to the object, and the detection and segmentation precision of the object is improved.

Further, the target prediction result is: and the target detection result of the image to be detected and the image segmentation result of the image to be detected.

The technical scheme of the image target detection and segmentation system is as follows:

the method comprises the following steps: a processing module and an operation module;

the processing module is used for: training a preset deep learning model for image target detection and segmentation based on a plurality of original training images to obtain a target deep learning model;

the operation module is used for: and inputting the image to be detected into the target deep learning model to obtain a target prediction result of target detection and segmentation of the image to be detected.

The image target detection and segmentation system has the following beneficial effects:

the system of the invention enhances the expression capability of the model to the image target and improves the target detection and segmentation precision of the object in the image by training the improved deep learning model.

On the basis of the scheme, the image target detection and segmentation system can be further improved as follows.

Further, the preset deep learning model includes: an original backbone network, an original neck network, and a plurality of original header networks; before the processing module, the method further comprises the following steps: a labeling module;

the labeling module is used for: labeling each original training image by adopting at least one labeling mode to obtain at least one labeled training image corresponding to each original training image;

the processing module comprises: a first processing module, a second processing module, a third processing module and a fourth processing module,

The first processing module is configured to: inputting any original training image into the original backbone network for multi-scale image feature extraction to obtain a first training image corresponding to any original training image;

the second processing module is configured to: inputting a first training image corresponding to any original training image into the original neck network for image feature extraction to obtain a second training image corresponding to any original training image;

the third processing module is configured to: inputting a second training image corresponding to any original training image into each original head network respectively for prediction to obtain a training prediction result of any original training image in each original head network until obtaining a training prediction result of each original training image in each original head network respectively;

the fourth processing module is configured to: performing loss calculation based on all training prediction results corresponding to each original training image and at least one labeled training image to obtain and optimize the preset deep learning model according to the loss calculation result, taking the optimized preset deep learning model as the preset deep learning model and returning to call the first processing module for iterative training until the preset deep learning model is converged, and determining the optimized preset deep learning model corresponding to the convergence of the preset deep learning model as the target deep learning model.

Further, the original backbone network comprises: the output end of each first convolution layer is correspondingly connected with the input end of one first down-sampling layer; the first processing module is specifically configured to:

Further, the original neck network comprises: the output end of each second convolution layer is correspondingly connected with the input end of one first up-sampling layer; the second processing module is specifically configured to:

inputting the first training image corresponding to any original training image to a first second convolution layer of the original neck network, sequentially passing through all second convolution layers and all first up-sampling layers, and performing image feature extraction on the first training image corresponding to any original training image to obtain a second training image corresponding to any original training image.

Further, the plurality of original header networks comprises: presetting a target classification head network, a target detection head network, an image segmentation head network and a central skeleton head network; the training prediction result of any original training image comprises: a first training prediction result, a second training prediction result, a third training prediction result, and a fourth training prediction result;

the third processing module is specifically configured to:

Further, the labeling module is specifically configured to:

labeling each object in any original training image based on the position of the object to obtain a second labeled training image containing the position information of each object;

and acquiring the central skeleton of each object in the third labeling training image corresponding to any original training image, and arranging all the central skeletons according to a preset arrangement sequence to obtain a fourth labeling training image.

The beneficial effect of adopting the further technical scheme is that: by extracting the central skeleton from the object example in the training image and expressing the central skeleton as a point array arranged according to a set sequence, the object retains the characteristics related to the shape thereof, the expression capability of the model to the image target is enhanced, the model is more favorable for learning the characteristic relationship related to the object, and the detection and segmentation precision of the object is improved.

The technical scheme of the storage medium of the invention is as follows:

the storage medium has stored therein instructions which, when read by a computer, cause the computer to perform the steps of the image object detection and segmentation method according to the invention.

The technical scheme of the electronic equipment is as follows:

comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the computer to perform the steps of the image object detection and segmentation method according to the invention.

Drawings

FIG. 1 is a schematic flow chart of a method for detecting and segmenting an image target according to an embodiment of the present invention;

FIG. 2 is a structural diagram of a preset deep learning model in the image target detection and segmentation method according to the embodiment of the present invention;

fig. 3 is a schematic structural diagram of an image target detection and segmentation system according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1, the image target detecting and segmenting method according to the embodiment of the present invention includes the following steps:

s1, training a preset deep learning model for image target detection and segmentation based on a plurality of original training images to obtain a target deep learning model.

Wherein, the original training image is an image containing a strip-shaped object; for example: worms, vines, bridges, trees, and the like.

The preset deep learning model comprises three parts: the system comprises a backbone network, a neck network and a head network, wherein a preset deep learning model is an untrained deep learning model and can be used for carrying out target detection and segmentation on objects (especially strip-shaped objects) in an image. The target deep learning model is as follows: and (5) training the deep learning model.

Specifically, a plurality of original training images containing at least one strip-shaped object are obtained, each original training image is respectively input into a preset deep learning model for iterative training, the expression capacity of the model on an image target is continuously improved, and the target deep learning model for image target detection and segmentation is obtained until the preset deep learning model converges.

Wherein, the image to be measured is: randomly selected images. The target prediction results include: and the target detection result of the image to be detected and the image segmentation result of the image to be detected.

It should be noted that the target detection result is: and carrying out target detection on the image to be detected to obtain the category and the position of each object in the image to be detected. The image segmentation result is: and carrying out image segmentation on the image to be detected to obtain an image area corresponding to each object.

Preferably, the preset deep learning model comprises: an original backbone network, an original neck network, and a plurality of original header networks.

In the present embodiment, the model structure of the preset deep learning model is as shown in fig. 2.

The original backbone network is an untrained backbone network, and can be used for extracting multi-scale features of the image, and the original backbone network can adopt the following steps: and mature backbone networks such as a ResNet50 network and a ResNet101 network. In this embodiment, the original backbone network includes: a plurality of first convolution layers and a plurality of first downsampling layers.

Wherein the original neck network is an untrained neck network that can be used to extract image features; the original neck network may employ, for example, a FPN network. In this embodiment, the original neck network comprises: at least one second convolutional layer and at least one first upsampling layer.

Wherein the original head network is an untrained head network that can be used to predict the image. The plurality of original header networks includes: the method comprises the steps of presetting a target classification head network, presetting a target detection head network, presetting an image segmentation head network and presetting a center skeleton head network.

Before S1, further comprising:

and S01, labeling each original training image by adopting at least one labeling mode to obtain at least one labeled training image corresponding to each original training image.

Wherein, at least one labeling mode comprises: the method comprises the steps of marking the type of an object in an image, marking the position of the object in the image, marking a mask of the object in the image and extracting a central skeleton of the mask of the object in the marked image.

Specifically, the step of labeling any original training image to obtain at least one labeled training image corresponding to any original training image includes:

and labeling each object in any original training image based on the class of the object to obtain a first labeled training image containing labeled class information of each object in any original training image.

Wherein, the annotation category information includes but is not limited to: and (4) selecting the objects to form rectangular frames, wherein the rectangular frames corresponding to each category are displayed in different colors. The first labeled training image is: an image containing a rectangular frame corresponding to each object.

And labeling each object in any original training image based on the position of the object to obtain a second labeled training image containing the position information of each object.

Wherein the location information includes but is not limited to: the image including the position of each object may be the same as or different from the annotation category information, and is not limited thereto.

And masking each object in any original training image to obtain a third labeling training image containing mask information of each object.

The process of masking the image is the prior art, and is not described herein in detail.

Specifically, a third labeling training image after mask processing is utilized to extract a central skeleton of each object in the third labeling training image, and the extracted central skeletons of all the objects are represented in a point array form arranged according to a preset arrangement sequence. The point array can be a plurality of equidistant points or non-equidistant points on the central line, and if the point array is a strip-shaped object, the head and the tail of the object have a sequence. The preset arrangement sequence can be from head to tail or from tail to head according to the actual situation.

It should be noted that, the above only describes the process of labeling any original training image, and the rest of the original training images may all adopt the process of labeling the images, which is not always described herein.

The S1 comprises:

s11, inputting any original training image into the original backbone network for multi-scale image feature extraction, and obtaining a first training image corresponding to any original training image.

Wherein the first training image is: and (4) carrying out multi-scale image feature extraction on the original training image to obtain an image.

In this embodiment, the original backbone network takes two first convolution layers and two first down-sampling layers as an example. The output end of each first convolution layer is correspondingly connected with the input end of one first down-sampling layer.

Specifically, any original training image is input to a first convolution layer of the original backbone network, and the first convolution layer, a second convolution layer and a second downsampling layer are sequentially arranged, so that multi-scale image features of any original training image are extracted, and a first training image corresponding to any original training image is obtained.

It should be noted that, the process of extracting image features through the convolutional layer and the downsampling layer is the prior art, and is not described herein in detail.

And S12, inputting the first training image corresponding to any original training image into the original neck network for image feature extraction to obtain a second training image corresponding to any original training image.

Wherein the second training image is: and the first training image is an image obtained by performing image feature extraction on the original neck network.

In this embodiment, the original neck network is exemplified by a second convolutional layer and a first upsampling layer. The output end of the first convolution layer is connected with the input end of the first up-sampling layer.

Specifically, a first training image corresponding to any original training image is input to a second convolution layer of the original neck network, and image feature extraction is performed on the first training image corresponding to any original training image through a first up-sampling layer, so that a second training image corresponding to any original training image is obtained.

It should be noted that, the process of extracting image features through the convolution layer and the upsampling layer is the prior art, and is not described herein in detail.

S13, inputting the second training image corresponding to any original training image into each original head network respectively for prediction, and obtaining a training prediction result of any original training image in each original head network until obtaining a training prediction result of each original training image in each original head network respectively.

Wherein the training prediction result of any original training image comprises: a first training prediction result, a second training prediction result, a third training prediction result, and a fourth training prediction result.

Specifically, a second training image corresponding to any original training image is input into the preset target classification head network for prediction, and a first training prediction result obtained by performing target classification on any original training image is obtained.

The preset target classification head network comprises at least one first full connection layer and is used for carrying out target classification on the image.

Inputting a second training image corresponding to any original training image into the preset target detection head network for prediction, and obtaining a second training prediction result obtained by performing target detection on any original training image.

The preset target detection head network comprises at least one second full connection layer and is used for carrying out target detection on the image.

Inputting a second training image corresponding to any original training image into the preset image segmentation head network for prediction, and obtaining a third training prediction result obtained by performing image segmentation on any original training image.

The preset image segmentation head network comprises at least one third convolution layer and is used for carrying out image segmentation on the image.

The preset central skeleton head network comprises at least one fourth convolution layer and at least one third full-connection layer and is used for extracting the central skeleton of the image.

It should be noted that both the preset center skeleton head network and the preset target detection head network predict points, and therefore the same network structure may be adopted.

And S14, performing loss calculation based on all training prediction results corresponding to each original training image and at least one labeled training image to obtain and optimize the preset deep learning model according to the loss calculation result, taking the optimized preset deep learning model as the preset deep learning model and returning to the step S11 for iterative training until the preset deep learning model is converged, and determining the optimized preset deep learning model corresponding to the preset deep learning model when the preset deep learning model is converged as the target deep learning model.

The preset deep learning model convergence means that the error between the predicted value and the true value obtained through the model is smaller than a preset threshold value, and the preset threshold value can be set according to user requirements.

The loss calculation process is the prior art, for example, cross entropy loss is adopted, the main process is to substitute the true value and the model prediction value into a loss function to calculate the difference between the true value and the model prediction value, and the lower the loss value is, the better the prediction effect of the deep learning model is.

Wherein, the target deep learning model comprises: a target backbone network, a target neck network, and a plurality of target head networks; the plurality of target header networks includes: the method comprises the steps of training a target classification head network, training a target detection head network, training an image segmentation head network and a training center skeleton head network.

According to the technical scheme, the improved deep learning model is trained, so that the expression capability of the model on the image target is enhanced, and the target detection and segmentation precision of the object in the image is improved.

As shown in fig. 3, the image target detecting and segmenting system 200 according to the embodiment of the present invention includes: a processing module 210 and an execution module 220;

the processing module 210 is configured to: training a preset deep learning model for image target detection and segmentation based on a plurality of original training images to obtain a target deep learning model;

the operation module 220 is configured to: and inputting the image to be detected into the target deep learning model to obtain a target prediction result of target detection and segmentation of the image to be detected.

Preferably, the preset deep learning model comprises: an original backbone network, an original neck network, and a plurality of original header networks; before the processing module 210, further comprising: a labeling module;

the processing module 210 includes: a first processing module 211, a second processing module 212, a third processing module 213, and a fourth processing module 214;

the first processing module 211 is configured to: inputting any original training image into the original backbone network to perform multi-scale image feature extraction to obtain a first training image corresponding to any original training image;

the second processing module 212 is configured to: inputting a first training image corresponding to any original training image into the original neck network for image feature extraction to obtain a second training image corresponding to any original training image;

the third processing module 213 is configured to: inputting a second training image corresponding to any original training image into each original head network respectively for prediction to obtain a training prediction result of any original training image in each original head network until a training prediction result of each original training image in each original head network is obtained;

the fourth processing module 214 is configured to: and performing loss calculation on all training prediction results corresponding to each original training image and at least one kind of marked training image to obtain and optimize the preset deep learning model according to the loss calculation result, taking the optimized preset deep learning model as the preset deep learning model, returning and calling the first processing module to perform iterative training until the preset deep learning model is converged, and determining the optimized preset deep learning model corresponding to the preset deep learning model in convergence as the target deep learning model.

Preferably, the original backbone network comprises: the output end of each first convolution layer is correspondingly connected with the input end of one first down-sampling layer; the first processing module 211 is specifically configured to:

Preferably, the original neck network comprises: the output end of each second convolution layer is correspondingly connected with the input end of one first up-sampling layer; the second processing module 212 is specifically configured to:

Preferably, the plurality of original header networks comprises: presetting a target classification head network, a target detection head network, an image segmentation head network and a central skeleton head network; the training prediction result of any original training image comprises: a first training prediction result, a second training prediction result, a third training prediction result, and a fourth training prediction result;

the third processing module 213 is specifically configured to:

Preferably, the labeling module is specifically configured to:

labeling each object in any original training image based on the class of the object to obtain a first labeling training image containing labeling class information of each object;

Preferably, the target prediction result is: and the target detection result of the image to be detected and the image segmentation result of the image to be detected.

The above steps for implementing the corresponding functions of the parameters and modules in the image target detection and segmentation system 200 of the present embodiment may refer to the parameters and steps in the above embodiments of the image target detection and segmentation method, which are not described herein again.

An embodiment of the present invention provides a storage medium, including: the storage medium stores instructions, and when the computer reads the instructions, the computer executes steps such as the image target detection and segmentation method, which may specifically refer to the parameters and steps in the above embodiments of the image target detection and segmentation method, and are not described herein again.

Computer storage media such as: flash disks, portable hard disks, and the like.

An electronic device provided in an embodiment of the present invention includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that when the processor executes the computer program, the computer executes steps such as an image target detection and segmentation method, which may specifically refer to each parameter and step in the above embodiment of the image target detection and segmentation method, and are not described herein again.

Those skilled in the art will appreciate that the present invention may be embodied as methods, systems, storage media and electronic devices.

Thus, the present invention may be embodied in the form of: may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software, and may be referred to herein generally as a "circuit," module "or" system. Furthermore, in some embodiments, the invention may also be embodied in the form of a computer program product in one or more computer-readable media having computer-readable program code embodied in the medium. Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An image target detection and segmentation method is characterized by comprising the following steps:

2. The image target detection and segmentation method according to claim 1, wherein the predetermined deep learning model includes: an original backbone network, an original neck network, and a plurality of original header networks; before S1, further comprising:

the S1 comprises:

s11, inputting any original training image into the original backbone network to perform multi-scale image feature extraction to obtain a first training image corresponding to any original training image;

3. The image object detection and segmentation method of claim 2 wherein the original backbone network comprises: the output end of each first convolution layer is correspondingly connected with the input end of one first down-sampling layer; the S11 comprises:

4. The image target detection and segmentation method of claim 2, wherein the original neck network comprises: the output end of each second convolution layer is correspondingly connected with the input end of one first up-sampling layer; the S12 includes:

5. The image object detection and segmentation method of claim 2 wherein the plurality of original header networks comprises: presetting a target classification head network, a target detection head network, an image segmentation head network and a central skeleton head network; the training prediction result of any original training image comprises: a first training prediction result, a second training prediction result, a third training prediction result, and a fourth training prediction result; the step of inputting the second training image corresponding to any original training image into each original head network respectively for prediction to obtain a training prediction result of any original training image in each original head network includes:

6. The image target detecting and segmenting method according to claim 2, wherein the step of labeling any original training image to obtain at least one labeled training image corresponding to the any original training image includes:

7. The image target detection and segmentation method of claim 1, wherein the target prediction result is: and the target detection result of the image to be detected and the image segmentation result of the image to be detected.

8. An image object detection and segmentation system, comprising: a processing module and an operation module;

9. A storage medium having stored therein instructions which, when read by a computer, cause the computer to execute the image object detection and segmentation method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, causes the computer to perform the image object detection and segmentation method according to any one of claims 1 to 7.