WO2018232592A1 - Fully convolutional instance-aware semantic segmentation - Google Patents

Fully convolutional instance-aware semantic segmentation Download PDF

Info

Publication number
WO2018232592A1
WO2018232592A1 PCT/CN2017/089189 CN2017089189W WO2018232592A1 WO 2018232592 A1 WO2018232592 A1 WO 2018232592A1 CN 2017089189 W CN2017089189 W CN 2017089189W WO 2018232592 A1 WO2018232592 A1 WO 2018232592A1
Authority
WO
WIPO (PCT)
Prior art keywords
pixel cell
segmentation
pixel
interest
detection
Prior art date
Application number
PCT/CN2017/089189
Other languages
French (fr)
Inventor
Yichen Wei
Jifeng Dai
Han HU
Original Assignee
Microsoft Technology Licensing, Llc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc. filed Critical Microsoft Technology Licensing, Llc.
Priority to PCT/CN2017/089189 priority Critical patent/WO2018232592A1/en
Publication of WO2018232592A1 publication Critical patent/WO2018232592A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/143Segmentation; Edge detection involving probabilistic approaches, e.g. Markov random field [MRF] modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • FCNs Fully convolutional networks have recently dominated the field of semantic image segmentation.
  • An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories. Thanks to the simplicity, efficiency, and the local weight sharing properties of convolution, FCNs provide an accurate, fast, and end-to-end solution for semantic segmentation.
  • the ROI pooling step loses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., 14 x 14) for fc layers.
  • a fixed-size representation e.g. 14 x 14
  • Such distortion and fixed-size representation degrades the segmentation accuracy, especially for large objects.
  • the fc layers over-parametrize the task, without using regularization of local weight sharing. For example, the last fc layer has high dimensional 784-way output to estimate a 28 x 28 mask.
  • the per-ROI network computation in the last step is not shared among ROIs. A considerably complex sub-network in the last step is necessary to obtain good accuracy.
  • the various embodiments herein present systems, methods, and software of an end-to-end, fully convolutional approach for instance-aware semantic segmentation.
  • Instance-aware semantic segmentation is a fundamental technique for many vision applications, such as photograph management and advanced driver assistance systems (ADAS) , among other applications.
  • ADAS advanced driver assistance systems
  • Embodiments herein for a given visual image, locate predefined categories of obi ects therein and produce pixel masks for each object instance. While the background section above sets forth some solutions to obtain such outputs, the present embodiments do so with greater efficiency (e.g., speed) and accuracy.
  • One such embodiment in the form of a method, includes receiving an image and generating a pixel-wise score map for a given region of interest (ROI) within the received image, a score generated within the ROI for each pixel cell present therein.
  • the method then, for each pixel cell within the region of interest, may detect whether the pixel cell belongs to an object to obtain a detection result and determine whether the pixel cell is inside an object instance boundary to obtain a segmentation result.
  • the method may then fuse the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell and form at least one mask based on the inside and outside values of at least one of the pixel cells.
  • Another embodiment is a system that includes an input, at least one processor, and a memory device that stores instructions executable on the at least one processor to perform data processing activities.
  • the data processing activities may include generating a pixel-wise score map for a given ROI within an image received via the input, a score generated within the ROI for each pixel cell present therein.
  • the data processing activities also include processing pixel cells of the ROI to detect whether the pixel cell belongs to an object to obtain a detection and determining whether the pixel cell is inside an object instance boundary to obtain a segmentation result.
  • the data processing activities may then fuse the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell and form at least one mask based on the inside and outside values of at least one of the pixel cells.
  • FIG. 1 illustrates an example of a fully convolutional instance-aware semantic segmentation method, according to an example embodiment.
  • FIG. 2 illustrates an instance fully convolutional network according to some instance segment embodiments.
  • FIG. 3 illustrates segmentation and classification results of different regions of interest for a category “person” , according to an example embodiment.
  • FIG. 4 illustrates an architecture of a fully convolutional instance-aware semantic segmentation network, according to an example embodiment.
  • FIG. 5 is a block flow diagram of a method, according to an example embodiment.
  • FIG. 6 is a block diagram of a computing device, according to an example embodiment.
  • the various embodiments herein present systems, methods, and software of an end-to-end, fully convolutional approach for instance-aware semantic segmentation.
  • Instance-aware semantic segmentation is a fundamental technique for many vision applications, such as photograph management and advanced driver assistance systems (ADAS) , among other applications.
  • ADAS advanced driver assistance systems
  • Embodiments herein for a given visual image, locate predefined categories of objects therein and produce pixel masks for each object instance. While the background section above sets forth some solutions to obtain such outputs, the present embodiments do so with greater efficiency (e.g., speed) and accuracy.
  • FCIS fully convolutional instance-aware semantic segmentation
  • embodiments herein resolve challenges not addressed in prior solutions, such as discussed above in the Background section, while exploiting the merits of FCNs for end-to-end instance-aware semantic segmentation.
  • the underlying convolutional representation and the score maps are fully shared for object segmentation and detection sub-tasks, via a novel, joint formulation with no extra parameters.
  • the FCN network structure is also highly integrated and efficient.
  • the per-ROI computation is simple, fast, and does not involve any image warping or resizing operations. The approach is briefly illustrated in FIG. 1. Some of these embodiments operate on box proposals instead of sliding windows.
  • the functions or algorithms described herein are implemented in hardware, software or a combination of software and hardware in one embodiment.
  • the software comprises computer executable instructions stored on computer readable media such as memory or other type of storage devices. Further, described functions may correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions are performed in one or more modules as desired, and the embodiments described are merely examples.
  • the software is executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a system, such as a personal computer, server, a router, or other device capable of processing data including network interconnection devices.
  • Some embodiments implement the functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit.
  • the exemplary process flow is applicable to software, firmware, and hardware implementations.
  • a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some object category” .
  • Such predictions are translation invariant and unaware of individual object instances. For example, the same pixel can be foreground with regard to one object but background with regard to another, or adjacent, object.
  • a single score map per-category is insufficient to distinguish these two cases.
  • Each score map has the same spatial extent of the original image (in a lower resolution, e.g., 16x smaller) .
  • Each score represents the likelihood of “the pixel belongs to some object instance at a relative position” .
  • the first map is for “at top left position” in Figure 2.
  • its pixel-wise foreground likelihood map may be produced by assembling its k x k cells from the corresponding score maps.
  • a pixel can have different scores in different instances as long as the pixel is at different relative positions in the instances.
  • This approach is works well for the object mask proposal task. However, this approach may also be limited by the task. Only a fixed-size square sliding window is used in some such embodiments.
  • the FCN is then applied on multi-scale images to find object instances of different sizes. The approach is therefore blind to the object categories. Only a separate “objectness” classification sub-network is used to categorize the window as object or background. For the instance-aware semantic segmentation task, a separate downstream network may be used to further classify the mask proposals into object categories.
  • Various embodiments herein enhance position-sensitive score mapping to perform object segmentation and detection sub-tasks jointly and simultaneously.
  • the same set of score maps may then be shared for the two sub-tasks, as well as the underlying convolutional representation.
  • this approach brings no extra parameters and eliminates non-essential design choices to better exploit the strong correlation between the two sub-tasks.
  • pixel-wise score maps may be produced by an assembling operation within the ROI.
  • ROI region-of-interest
  • a simple solution of some embodiments is to train two classifiers, separately.
  • the two answers may then be fused into two scores: inside and outside.
  • the two scores answer the two questions jointly via softmax and max operations/functions. For detection, some embodiments use the max operation to differentiate cases 1) -2) (detection+) from case 3) (detection-) .
  • the detection score of the whole ROI may then be obtained via average pooling over all pixels’likelihoods (followed by a softmax operator across all the categories) .
  • some embodiments may use softmax to differentiate cases 1) (segmentation+) from 2) (segmentation-) , at each pixel.
  • the foreground mask, considered in probabilities, of the ROI is the union of the per-pixel segmentation scores for each category.
  • the two sets of scores are from two 1 x 1 convolution layer.
  • the inside/outside classifiers may be trained jointly in some embodiments as they receive the back-propagated gradients from both segmentation and detection losses.
  • All the per-ROI components as in FIG. 1 do not have free parameters.
  • the score maps in some embodiments are produced by a single FCN, without involving any feature warping, resizing or fc layers. All the features and score maps in some such embodiments respect the aspect ratio of the original image.
  • the local weight sharing property of FCNs is preserved and serves as a regularization mechanism. All per-ROI computation is simple (k 2 cell division, score map copying, softmax, max, average pooling) and fast, giving rise to a negligible per-ROI computation cost.
  • FIG. 4 illustrates an architecture of an end-to-end solution of some embodiments. While any convolutional network architecture can be used, some embodiments may be deployed that adopt the ResNet model. In some such embodiments, a last fully-connected layer for 1000 way-classification may discarded, retaining instead only the previous convolutional layers. The resulting feature maps in some embodiments have 2048 channels. Additionally, a 1 x 1 convolutional layer may be added in some such embodiments to reduce the dimension to 1024.
  • the effective feature stride (i.e., the decrease in feature map resolution) at the top of the network is 32. This may be too coarse for some embodiments of instance-aware semantic segmentation.
  • the “hole algorithm” is applied.
  • the stride in the first block of conv5 convolutional layers may be decreased in such embodiments from 2 to 1.
  • the effective feature stride may thus be reduced to 16.
  • the “hole algorithm” may be applied on all the convolutional layers of conv5 by setting the dilation as 2.
  • Some embodiments may also use a region proposal network (RPN) to generate ROIs.
  • RPN region proposal network
  • the RPN may be added on top of the conv4 layers in the same way. Note that RPN is also fully convolutional.
  • Bounding box (bbox) regression may then be used in some embodiments to refine the initial input ROIs.
  • a sibling 1 x 1 convolutional layer with 4k 2 channels may be added on the conv5 feature maps to estimate the bounding box shift in location and size.
  • 300 ROIs with highest scores may be generated from RPN.
  • the ROIs may pass through the bbox regression branch and give rise to another 300 ROIs.
  • classification scores and foreground mask (in probability) may be obtained for all categories.
  • FIG. 3 illustrates such an example.
  • NMS Non-maximum suppression
  • IoU intersection-over-union
  • Their foreground masks may be obtained by mask voting in some embodiments as follows. For an ROI under consideration, find all the ROIs from the 600 ROIs, or other number depending on the particular embodiment, with IoU scores higher than 0.5.
  • Foreground masks of a category may be averaged on a per-pixel basis and weighted by their classification scores. The averaged mask may then be binarized as the output in some embodiments.
  • an ROI may be positive in some embodiments when its box IoU with respect to the nearest ground truth object is larger than 0.5, otherwise it is negative.
  • Each ROI has three loss terms in equal weights: a softmax detection loss over C + 1 categories, a softmax segmentation loss over the foreground mask of the ground-truth category only, and a bbox regression loss. This may sum per-pixel loses over the ROI and normalize the such by the ROI’s size in some embodiments. The latter two loss terms may be effective only on the positive ROIs.
  • the model may be initialized from a pre-trained model on ImageNet classification in some embodiments. Layers absent in the pre-trained model may be randomly initialized.
  • the training images may be resized to have a shorter side, such as of 600 pixels.
  • Some embodiments use Stochastic gradient descent ( “SGD” ) optimization.
  • OHEM online hard example mining
  • forward propagation may be performed on 300 proposed ROIs on one image. Among them, 128 ROIs with the highest losses may be selected to back-propagate their error gradients.
  • 9 anchors (3 scales x 3 aspect ratios) may be used by default.
  • Three additional anchors at a finer scale may be used in some embodiments for experiments, such as on the well-known COCO dataset.
  • joint training is performed.
  • FIG. 5 is a block flow diagram of a method 500, according to an example embodiment.
  • the method 500 is an example method that may be performed on a computing device, such as a personal computer or server when processing photographs or a computing device of an advanced driver assistance system that controls operation of an automobile.
  • the method 500 includes receiving 502 an image and generating 504 a pixel-wise score map for a given ROI within the received 502 image, a score generated within the ROI for each pixel cell present therein.
  • the method 500 then continues, for each pixel cell within the region of interest, by detecting 506 whether the pixel cell belongs to an object to obtain a detection result, such as detection+ or detection-.
  • the method 500 includes determining 508 whether the pixel cell is inside an object instance boundary to obtain a segmentation result, such as segmentation+ or segmentation-.
  • the method 500 may then fuse 510 the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell.
  • a result pair of (detection+, segmentation +) and (detection+, segmentation-) indicate inside the and (detection-, segmentation+) or (detection-, segmentation-) indicate outside the object.
  • the method 500 may then form 512 at least one mask based on the inside and outside values of at least one of the pixel cells, although some embodiments may take into account values of all the pixel cells.
  • generating 504 a pixel-wise score map includes determining a probability that the respective pixel cell belongs to some object at the relative position.
  • the score map is position-sensitive score map within the received 502 image.
  • each pixel cell of the pixel-wise score map is of a k 2 size, where k is a number of pixels.
  • each pixel cell may be formed by evenly partitioning a region of interest into k 2 parts, where k is a number of pixels.
  • the method 500 may be performed iteratively against the received 502 image with varied values of k to identify object locations at a plurality of resolutions.
  • forming 512 the at least one mask includes forming a foreground mask and a background mask.
  • the foreground mask may include pixel cells with an inside value and the background mask may include pixel cells with an outside value.
  • the method 500 may be performed iteratively in some embodiments against a sequence of images received from an imaging device.
  • the method 500 is performed by, and the imaging device is that of, an advanced driver assistance system.
  • FIG. 6 is a block diagram of a computing device, according to an example embodiment.
  • multiple such computer systems are utilized in a distributed network to implement multiple components in a transaction based environment.
  • An object-oriented, service-oriented, or other architecture may be used to implement such functions and communicate between the multiple systems and components.
  • One example computing device in the form of a computer 610 may include a processing unit 602 (which may be or include a graphics processing unit) , memory 604, removable storage 612, and non-removable storage 614.
  • Memory 604 may include volatile memory 606 and non-volatile memory 608.
  • Computer 610 may include -or have access to a computing environment that includes -a variety of computer-readable media, such as volatile memory 606 and non-volatile memory 608, removable storage 612 and non-removable storage 614.
  • Computer storage includes random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) & electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technologies
  • compact disc read-only memory (CD ROM) compact disc read-only memory (CD
  • Computer 610 may include or have access to a computing environment that includes input 616, output 618, and a communication connection 620.
  • the computer may operate in a networked environment using a communication connection to connect to one or more remote computers, such as database servers.
  • the remote computer may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like.
  • the communication connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) or other networks.
  • LAN Local Area Network
  • WAN Wide Area Network
  • Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 602 of the computer 610.
  • a hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium.
  • a computer program 625 capable of providing a generic technique to perform access control check for data access and/or for doing an operation on one of the servers in a component object model (COM) based system according to the teachings of the present invention may be included on a CD-ROM and loaded from the CD-ROM to a hard drive.
  • the computer-readable instructions allow computer 610 to provide generic access controls in a COM based computer network system having multiple users and servers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

Embodiments herein present a fully convolutional approach for instance-aware semantic segmentation. Embodiments herein, for a given visual image, locate predefined categories of objects therein and produce pixel masks for each object instance. One method embodiment includes receiving an image and generating a pixel-wise score map for a given region of interest within the received image for each pixel cell present therein. For each pixel cell within the region of interest, the method may detect whether the pixel cell belongs to an object to obtain a detection result and determine whether the pixel cell is inside an object instance boundary to obtain a segmentation result. The method may then fuse the results to obtain a result of inside or outside for each pixel cell and form at least one mask based on those values.

Description

FULLY CONVOLUTIONAL INSTANCE-AWARE SEMANTIC SEGMENTATION
BACKGROUND INFORMATION
Fully convolutional networks (FCNs) have recently dominated the field of semantic image segmentation. An FCN takes an input image of arbitrary size, applies a series of convolutional layers, and produces per-pixel likelihood score maps for all semantic categories. Thanks to the simplicity, efficiency, and the local weight sharing properties of convolution, FCNs provide an accurate, fast, and end-to-end solution for semantic segmentation.
However, conventional FCNs often do not work for instance-aware semantic segmentation tasks, which often require detection and segmentation of individual object instances. This limitation is inherent in such solutions. Because convolution is translation invariant, the same image pixel receives the same responses, and thus classification scores, irrespective to its relative position in the context. However, instance-aware semantic segmentation needs to operate on region level, and the same pixel can have different semantics in different regions. This behavior is not modeled in a single FCN on the whole image.
Certain translation-variant properties are required to solve such problems. In a prevalent family of instance-aware semantic segmentation approaches, these problems are addressed by adopting different types of sub-networks in three stages: 1) an FCN is applied on the whole image to generate intermediate and shared feature maps; 2) from the shared feature maps, a pooling layer warps each region of interest (ROI) into fixed-size per-ROI feature maps; and 3) one or more fully-connected (fc) layer (s) in the last network convert the per-ROI feature maps to per-ROI masks. Note that the translation-variant property is introduced in the fc layer (s) in the last step.
But again, such solutions have several drawbacks. First, the ROI pooling step loses spatial details due to feature warping and resizing, which however, is necessary to obtain a fixed-size representation (e.g., 14 x 14) for fc layers. Such distortion and fixed-size representation degrades the segmentation accuracy,  especially for large objects. Second, the fc layers over-parametrize the task, without using regularization of local weight sharing. For example, the last fc layer has high dimensional 784-way output to estimate a 28 x 28 mask. Last, the per-ROI network computation in the last step is not shared among ROIs. A considerably complex sub-network in the last step is necessary to obtain good accuracy. It is therefore slow for a large number of ROIs (typically hundreds or thousands of region proposals) . For example, in the MNC method (see generally, J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016) , 10 layers in the ResNet-101 model are kept in the per-ROI sub-network. Such an approach has been measured to take 1.4 seconds per image, where more than 80%of the time is spent on the last per-ROI step.
Recently, a fully convolutional approach has been brought forth for instance mask proposal generation. This approach extends the translation invariant score maps in conventional FCNs to position-sensitive score maps, which are somewhat translation-variant. However, this approach is only used for mask proposal generation and presents several drawbacks. It is blind to semantic categories and requires a downstream network for detection. The object segmentation and detection sub-tasks are separated and the solution is not end-to-end. It operates on square, fixed-size sliding windows (224 x 224 pixels) and adopts a time-consuming image pyramid scanning to find instances at different scales.
SUMMARY
The various embodiments herein present systems, methods, and software of an end-to-end, fully convolutional approach for instance-aware semantic segmentation. Instance-aware semantic segmentation is a fundamental technique for many vision applications, such as photograph management and advanced driver assistance systems (ADAS) , among other applications. Embodiments herein, for a given visual image, locate predefined categories of obi ects therein and produce pixel masks for each object instance. While the background section above sets forth some solutions to obtain such outputs, the present embodiments do so with greater efficiency (e.g., speed) and accuracy.
One such embodiment, in the form of a method, includes receiving an image and generating a pixel-wise score map for a given region of interest (ROI) within the received image, a score generated within the ROI for each pixel cell present therein. The method then, for each pixel cell within the region of interest, may detect whether the pixel cell belongs to an object to obtain a detection result and determine whether the pixel cell is inside an object instance boundary to obtain a segmentation result. The method may then fuse the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell and form at least one mask based on the inside and outside values of at least one of the pixel cells.
Another embodiment is a system that includes an input, at least one processor, and a memory device that stores instructions executable on the at least one processor to perform data processing activities. The data processing activities may include generating a pixel-wise score map for a given ROI within an image received via the input, a score generated within the ROI for each pixel cell present therein. The data processing activities also include processing pixel cells of the ROI to detect whether the pixel cell belongs to an object to obtain a detection and determining whether the pixel cell is inside an object instance boundary to obtain a segmentation result. The data processing activities may then fuse the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell and form at least one mask based on the inside and outside values of at least one of the pixel cells.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates an example of a fully convolutional instance-aware semantic segmentation method, according to an example embodiment.
FIG. 2 illustrates an instance fully convolutional network according to some instance segment embodiments.
FIG. 3 illustrates segmentation and classification results of different regions of interest for a category “person” , according to an example embodiment.
FIG. 4 illustrates an architecture of a fully convolutional instance-aware semantic segmentation network, according to an example embodiment.
FIG. 5 is a block flow diagram of a method, according to an example embodiment.
FIG. 6 is a block diagram of a computing device, according to an example embodiment.
DETAILED DESCRIPTION
The various embodiments herein present systems, methods, and software of an end-to-end, fully convolutional approach for instance-aware semantic segmentation. Instance-aware semantic segmentation is a fundamental technique for many vision applications, such as photograph management and advanced driver assistance systems (ADAS) , among other applications. Embodiments herein, for a given visual image, locate predefined categories of objects therein and produce pixel masks for each object instance. While the background section above sets forth some solutions to obtain such outputs, the present embodiments do so with greater efficiency (e.g., speed) and accuracy.
Referred to as fully convolutional instance-aware semantic segmentation, or FCIS, embodiments herein resolve challenges not addressed in prior solutions, such as discussed above in the Background section, while exploiting the merits of FCNs for end-to-end instance-aware semantic segmentation. In some such embodiments, the underlying convolutional representation and the score maps are fully shared for object segmentation and detection sub-tasks, via a novel, joint formulation with no extra parameters. The FCN network structure is also highly integrated and efficient. The per-ROI computation is simple, fast, and does not involve any image warping or resizing operations. The approach is briefly illustrated in FIG. 1. Some of these embodiments operate on box proposals instead of sliding windows.
Extensive experiments have verified that the approach of the embodiments herein is state-of-the-art in both accuracy and efficiency. For example, the background section above refers to a prior solution that took 1.4 seconds to  process each image. On the same test, embodiments as set forth herein achieved 0.24 second per image speeds. This is an 83-percent improvement. At the same time, accuracy of the various embodiments here in was compared against prior solutions. The accuracy of these embodiments exceeded all other solutions measured, outperforming the next closest with a twelve-percent accuracy improvement.
These and other embodiments are described herein with reference to the figures.
In the following detailed description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments in which the inventive subject matter may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice them, and it is to be understood that other embodiments may be utilized and that structural, logical, and electrical changes may be made without departing from the scope of the inventive subject matter. Such embodiments of the inventive subject matter may be referred to, individually and/or collectively, herein by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept ifmore than one is in fact disclosed.
The following description is, therefore, not to be taken in a limited sense, and the scope of the inventive subject matter is defined by the appended claims.
The functions or algorithms described herein are implemented in hardware, software or a combination of software and hardware in one embodiment. The software comprises computer executable instructions stored on computer readable media such as memory or other type of storage devices. Further, described functions may correspond to modules, which may be software, hardware, firmware, or any combination thereof. Multiple functions are performed in one or more modules as desired, and the embodiments described are merely examples. The software is executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a system, such as a personal computer, server, a  router, or other device capable of processing data including network interconnection devices.
Some embodiments implement the functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit. Thus, the exemplary process flow is applicable to software, firmware, and hardware implementations.
In FCNs, a classifier is trained to predict each pixel’s likelihood score of “the pixel belongs to some object category” . Such predictions are translation invariant and unaware of individual object instances. For example, the same pixel can be foreground with regard to one object but background with regard to another, or adjacent, object. A single score map per-category is insufficient to distinguish these two cases.
To introduce translation-variant properties, a fully convolutional solution is first applied in some embodiments. Some such embodiments utilize k2 position-sensitive score maps that correspond to k x k evenly partitioned cells of objects. This is illustrated in Figure 2 (k = 3) . Each score map has the same spatial extent of the original image (in a lower resolution, e.g., 16x smaller) . Each score represents the likelihood of “the pixel belongs to some object instance at a relative position” . For example, the first map is for “at top left position” in Figure 2.
During training and inference, for a fixed-size square sliding window (224 x 224 pixels) , its pixel-wise foreground likelihood map may be produced by assembling its k x k cells from the corresponding score maps. In this way, a pixel can have different scores in different instances as long as the pixel is at different relative positions in the instances. This approach is works well for the object mask proposal task. However, this approach may also be limited by the task. Only a fixed-size square sliding window is used in some such embodiments. The FCN is then applied on multi-scale images to find object instances of different sizes. The approach is therefore blind to the object categories. Only a separate “objectness” classification sub-network is used to categorize the window as object or background.  For the instance-aware semantic segmentation task, a separate downstream network may be used to further classify the mask proposals into object categories.
For the instance-aware semantic segmentation task, other approaches, such as simultaneous detection and segmentation ( “SDS” ) , Hypercolumn, convolution feature masking, instance aware semantic segmentation via multi-task network cascades ( “MNC” ) , and MultiPathNet, share a similar structure. For example, two sub-networks may be used in some embodiments for object segmentation and detection sub-tasks, separately and sequentially. However, the design choices in such efforts, e.g., the two-network structure, parameters and execution order, are kind of arbitrary. These design choices appear to have been made in some instances for convenience rather than for fundamental considerations. It appears that the separated sub-network design has not fully exploited a tight correlation between the two tasks.
Various embodiments herein enhance position-sensitive score mapping to perform object segmentation and detection sub-tasks jointly and simultaneously. The same set of score maps may then be shared for the two sub-tasks, as well as the underlying convolutional representation. In some embodiments, this approach brings no extra parameters and eliminates non-essential design choices to better exploit the strong correlation between the two sub-tasks.
The approach of some such embodiments is illustrated in FIG. 1 and FIG. 3. For example, given a region-of-interest (ROI) , pixel-wise score maps may be produced by an assembling operation within the ROI. In some embodiments, for each pixel in a ROI, there are two tasks: 1) detection: whether the pixel belongs to an object bounding box at a relative position (detection+) or not (detection-) ; and 2) segmentation: whether the pixel is inside an object instance’s boundary (segmentation+) or not (segmentation-) . A simple solution of some embodiments is to train two classifiers, separately.
In some such embodiments, the two answers may then be fused into two scores: inside and outside. There are three possible result cases in such embodiments. These three cases are: 1) high inside score and low outside score: detection+, segmentation+; 2) low inside score and high outside score: detection+,  segmentation-; 3) both scores are low: detection-, segmentation-. The two scores answer the two questions jointly via softmax and max operations/functions. For detection, some embodiments use the max operation to differentiate cases 1) -2) (detection+) from case 3) (detection-) . The detection score of the whole ROI may then be obtained via average pooling over all pixels’likelihoods (followed by a softmax operator across all the categories) . For segmentation, some embodiments may use softmax to differentiate cases 1) (segmentation+) from 2) (segmentation-) , at each pixel. The foreground mask, considered in probabilities, of the ROI is the union of the per-pixel segmentation scores for each category. Similarly, the two sets of scores are from two 1 x 1 convolution layer. The inside/outside classifiers may be trained jointly in some embodiments as they receive the back-propagated gradients from both segmentation and detection losses.
Such embodiments have many desirable properties. All the per-ROI components as in FIG. 1 do not have free parameters. The score maps in some embodiments are produced by a single FCN, without involving any feature warping, resizing or fc layers. All the features and score maps in some such embodiments respect the aspect ratio of the original image. The local weight sharing property of FCNs is preserved and serves as a regularization mechanism. All per-ROI computation is simple (k2 cell division, score map copying, softmax, max, average pooling) and fast, giving rise to a negligible per-ROI computation cost.
FIG. 4 illustrates an architecture of an end-to-end solution of some embodiments. While any convolutional network architecture can be used, some embodiments may be deployed that adopt the ResNet model. In some such embodiments, a last fully-connected layer for 1000 way-classification may discarded, retaining instead only the previous convolutional layers. The resulting feature maps in some embodiments have 2048 channels. Additionally, a 1 x 1 convolutional layer may be added in some such embodiments to reduce the dimension to 1024.
In the original ResNet, the effective feature stride (i.e., the decrease in feature map resolution) at the top of the network is 32. This may be too coarse for some embodiments of instance-aware semantic segmentation. To reduce the  feature stride and maintain the field of view, the “hole algorithm” is applied. The stride in the first block of conv5 convolutional layers may be decreased in such embodiments from 2 to 1. The effective feature stride may thus be reduced to 16. To maintain the field of view, the “hole algorithm” may be applied on all the convolutional layers of conv5 by setting the dilation as 2.
Some embodiments may also use a region proposal network (RPN) to generate ROIs. For fair comparison with the MNC method, the RPN may be added on top of the conv4 layers in the same way. Note that RPN is also fully convolutional.
From the conv5 feature maps, 2k2 (C +1) score maps are produced (C object categories, one background category, two sets of k2 score maps per category, k = 7 by default in some embodiments) using a 1 x 1 convolutional layer. Over the score maps, each ROI may be projected into a 16x smaller region. The segmentation probability maps and classification scores over all the categories may be computed as described above and elsewhere herein.
Bounding box (bbox) regression may then be used in some embodiments to refine the initial input ROIs. A sibling 1 x 1 convolutional layer with 4k2 channels may be added on the conv5 feature maps to estimate the bounding box shift in location and size.
With regard to inference, for an input image in some embodiments, 300 ROIs with highest scores may be generated from RPN. The ROIs may pass through the bbox regression branch and give rise to another 300 ROIs. For each ROI, classification scores and foreground mask (in probability) may be obtained for all categories. FIG. 3 illustrates such an example. Non-maximum suppression (NMS) with an intersection-over-union (IoU) threshold 0.3 may used in some embodiments to filter out highly overlapping ROIs. The remaining ROIs may be classified as the categories with highest classification scores. Their foreground masks may be obtained by mask voting in some embodiments as follows. For an ROI under consideration, find all the ROIs from the 600 ROIs, or other number depending on the particular embodiment, with IoU scores higher than 0.5. Foreground masks of a category may be averaged on a per-pixel basis and weighted  by their classification scores. The averaged mask may then be binarized as the output in some embodiments.
With regard to training, an ROI may be positive in some embodiments when its box IoU with respect to the nearest ground truth object is larger than 0.5, otherwise it is negative. Each ROI has three loss terms in equal weights: a softmax detection loss over C + 1 categories, a softmax segmentation loss over the foreground mask of the ground-truth category only, and a bbox regression loss. This may sum per-pixel loses over the ROI and normalize the such by the ROI’s size in some embodiments. The latter two loss terms may be effective only on the positive ROIs.
During training, the model may be initialized from a pre-trained model on ImageNet classification in some embodiments. Layers absent in the pre-trained model may be randomly initialized. The training images may be resized to have a shorter side, such as of 600 pixels. Some embodiments use Stochastic gradient descent ( “SGD” ) optimization.
Generally, as the per-ROI computation is negligible, the training benefits from inspecting more ROIs at small training cost. Some embodiments may apply online hard example mining (OHEM) during training. In some embodiments, in each mini batch, forward propagation may be performed on 300 proposed ROIs on one image. Among them, 128 ROIs with the highest losses may be selected to back-propagate their error gradients.
For the RPN proposals, 9 anchors (3 scales x 3 aspect ratios) may be used by default. Three additional anchors at a finer scale may be used in some embodiments for experiments, such as on the well-known COCO dataset. To enable feature sharing between FCIS and RPN in such sharing embodiments, joint training is performed.
FIG. 5 is a block flow diagram of a method 500, according to an example embodiment. The method 500 is an example method that may be performed on a computing device, such as a personal computer or server when processing photographs or a computing device of an advanced driver assistance system that controls operation of an automobile.
The method 500 includes receiving 502 an image and generating 504 a pixel-wise score map for a given ROI within the received 502 image, a score generated within the ROI for each pixel cell present therein. The method 500 then continues, for each pixel cell within the region of interest, by detecting 506 whether the pixel cell belongs to an object to obtain a detection result, such as detection+ or detection-. Also with regard to each pixel, the method 500 includes determining 508 whether the pixel cell is inside an object instance boundary to obtain a segmentation result, such as segmentation+ or segmentation-. The method 500 may then fuse 510 the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell. In some such embodiments, a result pair of (detection+, segmentation +) and (detection+, segmentation-) indicate inside the and (detection-, segmentation+) or (detection-, segmentation-) indicate outside the object. The method 500 may then form 512 at least one mask based on the inside and outside values of at least one of the pixel cells, although some embodiments may take into account values of all the pixel cells.
In some embodiments of the method 500, generating 504 a pixel-wise score map includes determining a probability that the respective pixel cell belongs to some object at the relative position. In some such embodiments, the score map is position-sensitive score map within the received 502 image. In these and some other embodiments of the method 500, each pixel cell of the pixel-wise score map is of a k2 size, where k is a number of pixels. In some such embodiments, each pixel cell may be formed by evenly partitioning a region of interest into k2 parts, where k is a number of pixels. Further, the method 500 may be performed iteratively against the received 502 image with varied values of k to identify object locations at a plurality of resolutions.
In some embodiments of the method 500, forming 512 the at least one mask includes forming a foreground mask and a background mask. In such embodiments, the foreground mask may include pixel cells with an inside value and the background mask may include pixel cells with an outside value.
The method 500 may be performed iteratively in some embodiments against a sequence of images received from an imaging device. In some such  embodiments, the method 500 is performed by, and the imaging device is that of, an advanced driver assistance system.
FIG. 6 is a block diagram of a computing device, according to an example embodiment. In one embodiment, multiple such computer systems are utilized in a distributed network to implement multiple components in a transaction based environment. An object-oriented, service-oriented, or other architecture may be used to implement such functions and communicate between the multiple systems and components. One example computing device in the form of a computer 610, may include a processing unit 602 (which may be or include a graphics processing unit) , memory 604, removable storage 612, and non-removable storage 614. Memory 604 may include volatile memory 606 and non-volatile memory 608. Computer 610 may include -or have access to a computing environment that includes -a variety of computer-readable media, such as volatile memory 606 and non-volatile memory 608, removable storage 612 and non-removable storage 614. Computer storage includes random access memory (RAM) , read only memory (ROM) , erasable programmable read-only memory (EPROM) & electrically erasable programmable read-only memory (EEPROM) , flash memory or other memory technologies, compact disc read-only memory (CD ROM) , Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions. Computer 610 may include or have access to a computing environment that includes input 616, output 618, and a communication connection 620. The computer may operate in a networked environment using a communication connection to connect to one or more remote computers, such as database servers. The remote computer may include a personal computer (PC) , server, router, network PC, a peer device or other common network node, or the like. The communication connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) or other networks.
Computer-readable instructions stored on a computer-readable medium are executable by the processing unit 602 of the computer 610. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory  computer-readable medium. For example, a computer program 625 capable of providing a generic technique to perform access control check for data access and/or for doing an operation on one of the servers in a component object model (COM) based system according to the teachings of the present invention may be included on a CD-ROM and loaded from the CD-ROM to a hard drive. The computer-readable instructions allow computer 610 to provide generic access controls in a COM based computer network system having multiple users and servers.
It will be readily understood to those skilled in the art that various other changes in the details, material, and arrangements of the parts and method stages which have been described and illustrated in order to explain the nature of the inventive subject matter may be made without departing from the principles and scope of the inventive subject matter as expressed in the subjoined claims.

Claims (20)

  1. A method comprising:
    receiving an image;
    generating a pixel-wise score map for a given region of interest (ROI) within the received image, a score generated within the ROI for each pixel cell present therein;
    for each pixel cell within the region of interest:
    detecting whether the pixel cell belongs to an object to obtain a detection result of detection+ or detection-; and
    determining whether the pixel cell is inside an object instance boundary to obtain a segmentation result of segmentation+ or segmentation-;
    fusing the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell, a result pair of (detection+, segmentation +) and (detection+, segmentation-) indicating inside and (detection-, segmentation+) or (detection-, segmentation-) indicating outside; and
    forming at least one mask based on the inside and outside values of at least one of the pixel cells.
  2. The method of claim 1, wherein generating a pixel-wise score map includes determining a probability that the respective pixel cell belongs to some object at the relative position.
  3. The method of claim 2, wherein the score map is position-sensitive score map within the received image.
  4. The method of claim 3, wherein each pixel cell is formed by evenly partitioning a region of interest into k2 parts.
  5. The method of claim 4, wherein where k is a number of pixels.
  6. The method of claim 1 wherein, wherein determining whether the pixel cell is inside an object instance boundary includes applying a softmax operation.
  7. The method of claim 1, wherein forming the at least one mask includes forming a foreground mask and a background mask, the foreground mask including pixel cells with an inside value and the background mask including pixel cells with an outside value.
  8. The method of claim 1, wherein the method is performed iteratively against a sequence of images received from an imaging device.
  9. The method of claim 8, wherein the method is performed by, and the imaging device is that of, an advanced driver assistance system.
  10. The method of claim 1, wherein the method is performed with regard to a plurality regions of interest within the received image.
  11. The method of claim 10, wherein the plurality of regions of interest are generated by a region proposal network.
  12. The method of claim 10, further comprising:
    performing a bounding box regression to refine the plurality of regions of interest.
  13. A system comprising:
    an input;
    at least one processor; and
    a memory device storing instructions executable on the at least one processor to perform data processing activities comprising:
    generating a pixel-wise score map for a given region of interest (ROI) within an image received via the input, a score generated within the ROI for each pixel cell present therein;
    for each pixel cell within the region of interest:
    detecting whether the pixel cell belongs to an object to obtain a detection result; and
    determining whether the pixel cell is inside an object instance boundary to obtain a segmentation result;
    fusing the detecting and segmentation results of each pixel cell to obtain a result of inside or outside for each respective pixel cell; and
    forming at least one mask based on the inside and outside values of at least one of the pixel cells.
  14. The system of claim 13, wherein the input is an imaging device.
  15. The system of claim 14, wherein the system is an advanced driver assistance system and the data processing activities are performed iteratively against a stream of images received from the imaging device.
  16. The system of claim 13, wherein the at least one processor includes at least one graphics processing unit (GPU) .
  17. The system of claim 13, wherein:
    the score map is position-sensitive score map within the received image; and
    each pixel cell is formed by evenly partitioning a region of interest into k2 parts, where k is a number of pixels.
  18. The system of claim 13, wherein forming the at least one mask includes forming a foreground mask and a background mask, the foreground mask including pixel cells with an inside value and the background mask including pixel cells with an outside value.
  19. The system of claim 13, wherein the data processing activities are performed with regard to each of a plurality regions of interest within each of a plurality of received images.
  20. The system of claim 19, wherein the plurality of regions of interest are generated by a region proposal network and are refined by performing a bounding box regression on the plurality of regions of interest.
PCT/CN2017/089189 2017-06-20 2017-06-20 Fully convolutional instance-aware semantic segmentation WO2018232592A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/089189 WO2018232592A1 (en) 2017-06-20 2017-06-20 Fully convolutional instance-aware semantic segmentation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/089189 WO2018232592A1 (en) 2017-06-20 2017-06-20 Fully convolutional instance-aware semantic segmentation

Publications (1)

Publication Number Publication Date
WO2018232592A1 true WO2018232592A1 (en) 2018-12-27

Family

ID=64736258

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/089189 WO2018232592A1 (en) 2017-06-20 2017-06-20 Fully convolutional instance-aware semantic segmentation

Country Status (1)

Country Link
WO (1) WO2018232592A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978893A (en) * 2019-03-26 2019-07-05 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image, semantic segmentation network
CN110009573A (en) * 2019-01-29 2019-07-12 北京奇艺世纪科技有限公司 Model training, image processing method, device, electronic equipment and computer readable storage medium
CN110008808A (en) * 2018-12-29 2019-07-12 北京迈格威科技有限公司 Panorama dividing method, device and system and storage medium
CN110399840A (en) * 2019-05-22 2019-11-01 西南科技大学 A kind of quick lawn semantic segmentation and boundary detection method
CN111127502A (en) * 2019-12-10 2020-05-08 北京地平线机器人技术研发有限公司 Method and device for generating instance mask and electronic equipment
CN111415364A (en) * 2020-03-29 2020-07-14 中国科学院空天信息创新研究院 Method, system and storage medium for converting image segmentation samples in computer vision
CN111627029A (en) * 2020-05-28 2020-09-04 北京字节跳动网络技术有限公司 Method and device for acquiring image instance segmentation result
CN112509583A (en) * 2020-11-27 2021-03-16 贵州电网有限责任公司 Auxiliary supervision method and system based on scheduling operation order system
CN112633185A (en) * 2020-09-04 2021-04-09 支付宝(杭州)信息技术有限公司 Image processing method and device
WO2021068182A1 (en) * 2019-10-11 2021-04-15 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for instance segmentation based on semantic segmentation
CN115294066A (en) * 2022-08-09 2022-11-04 重庆科技学院 Sandstone particle size detection method
CN115620199A (en) * 2022-10-24 2023-01-17 四川警察学院 Traffic safety risk diagnosis method and device
CN116402999A (en) * 2023-06-05 2023-07-07 电子科技大学 SAR (synthetic aperture radar) instance segmentation method combining quantum random number and deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078957A1 (en) * 2007-12-14 2009-06-25 Flashfoto, Inc. Systems and methods for rule-based segmentation for objects with full or partial frontal view in color images
US9058664B2 (en) * 2011-09-07 2015-06-16 Siemens Aktiengesellschaft 2D-2D fusion for interventional guidance in trans-catheter aortic valve implantation
CN105574513A (en) * 2015-12-22 2016-05-11 北京旷视科技有限公司 Character detection method and device
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078957A1 (en) * 2007-12-14 2009-06-25 Flashfoto, Inc. Systems and methods for rule-based segmentation for objects with full or partial frontal view in color images
US9058664B2 (en) * 2011-09-07 2015-06-16 Siemens Aktiengesellschaft 2D-2D fusion for interventional guidance in trans-catheter aortic valve implantation
CN105574513A (en) * 2015-12-22 2016-05-11 北京旷视科技有限公司 Character detection method and device
CN106296728A (en) * 2016-07-27 2017-01-04 昆明理工大学 A kind of Segmentation of Moving Object method in unrestricted scene based on full convolutional network
CN106709568A (en) * 2016-12-16 2017-05-24 北京工业大学 RGB-D image object detection and semantic segmentation method based on deep convolution network

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008808A (en) * 2018-12-29 2019-07-12 北京迈格威科技有限公司 Panorama dividing method, device and system and storage medium
CN110008808B (en) * 2018-12-29 2021-04-09 北京迈格威科技有限公司 Panorama segmentation method, device and system and storage medium
CN110009573A (en) * 2019-01-29 2019-07-12 北京奇艺世纪科技有限公司 Model training, image processing method, device, electronic equipment and computer readable storage medium
CN110009573B (en) * 2019-01-29 2022-02-01 北京奇艺世纪科技有限公司 Model training method, image processing method, device, electronic equipment and storage medium
CN109978893A (en) * 2019-03-26 2019-07-05 腾讯科技(深圳)有限公司 Training method, device, equipment and the storage medium of image, semantic segmentation network
CN110399840A (en) * 2019-05-22 2019-11-01 西南科技大学 A kind of quick lawn semantic segmentation and boundary detection method
CN110399840B (en) * 2019-05-22 2024-04-02 西南科技大学 Rapid lawn semantic segmentation and boundary detection method
WO2021068182A1 (en) * 2019-10-11 2021-04-15 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for instance segmentation based on semantic segmentation
CN111127502A (en) * 2019-12-10 2020-05-08 北京地平线机器人技术研发有限公司 Method and device for generating instance mask and electronic equipment
CN111127502B (en) * 2019-12-10 2023-08-29 北京地平线机器人技术研发有限公司 Method and device for generating instance mask and electronic equipment
CN111415364A (en) * 2020-03-29 2020-07-14 中国科学院空天信息创新研究院 Method, system and storage medium for converting image segmentation samples in computer vision
CN111415364B (en) * 2020-03-29 2024-01-23 中国科学院空天信息创新研究院 Conversion method, system and storage medium for image segmentation sample in computer vision
CN111627029A (en) * 2020-05-28 2020-09-04 北京字节跳动网络技术有限公司 Method and device for acquiring image instance segmentation result
CN112633185A (en) * 2020-09-04 2021-04-09 支付宝(杭州)信息技术有限公司 Image processing method and device
CN112509583B (en) * 2020-11-27 2023-07-18 贵州电网有限责任公司 Auxiliary supervision method and system based on scheduling operation ticket system
CN112509583A (en) * 2020-11-27 2021-03-16 贵州电网有限责任公司 Auxiliary supervision method and system based on scheduling operation order system
CN115294066A (en) * 2022-08-09 2022-11-04 重庆科技学院 Sandstone particle size detection method
CN115620199A (en) * 2022-10-24 2023-01-17 四川警察学院 Traffic safety risk diagnosis method and device
CN115620199B (en) * 2022-10-24 2023-06-13 四川警察学院 Traffic safety risk diagnosis method and device
CN116402999A (en) * 2023-06-05 2023-07-07 电子科技大学 SAR (synthetic aperture radar) instance segmentation method combining quantum random number and deep learning
CN116402999B (en) * 2023-06-05 2023-09-15 电子科技大学 SAR (synthetic aperture radar) instance segmentation method combining quantum random number and deep learning

Similar Documents

Publication Publication Date Title
WO2018232592A1 (en) Fully convolutional instance-aware semantic segmentation
US10896351B2 (en) Active machine learning for training an event classification
US9830529B2 (en) End-to-end saliency mapping via probability distribution prediction
WO2016037300A1 (en) Method and system for multi-class object detection
US9892326B2 (en) Object detection in crowded scenes using context-driven label propagation
WO2021088365A1 (en) Method and apparatus for determining neural network
US10275667B1 (en) Learning method, learning device for detecting lane through lane model and testing method, testing device using the same
CN110610143B (en) Crowd counting network method, system, medium and terminal for multi-task combined training
CN113272827A (en) Validation of classification decisions in convolutional neural networks
US9715639B2 (en) Method and apparatus for detecting targets
CN111274981B (en) Target detection network construction method and device and target detection method
US20230186100A1 (en) Neural Network Model for Image Segmentation
Lee et al. Dynamic belief fusion for object detection
CN110826411B (en) Vehicle target rapid identification method based on unmanned aerial vehicle image
US20150356350A1 (en) unsupervised non-parametric multi-component image segmentation method
US20210272295A1 (en) Analysing Objects in a Set of Frames
CN114998592A (en) Method, apparatus, device and storage medium for instance partitioning
CN110992404B (en) Target tracking method, device and system and storage medium
CN114998595A (en) Weak supervision semantic segmentation method, semantic segmentation method and readable storage medium
US11816185B1 (en) Multi-view image analysis using neural networks
US20230134508A1 (en) Electronic device and method with machine learning training
CN112241758A (en) Apparatus and method for evaluating a saliency map determiner
Shen et al. Stereo matching using random walks
US20210406693A1 (en) Data sample analysis in a dataset for a machine learning model
Huang Moving object detection in low-luminance images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17915185

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17915185

Country of ref document: EP

Kind code of ref document: A1