WO2023096011A1

WO2023096011A1 - Device and method for zero-shot semantic segmentation

Info

Publication number: WO2023096011A1
Application number: PCT/KR2021/019077
Authority: WO
Inventors: 함범섭; 백동현; 오영민
Original assignee: 연세대학교 산학협력단
Priority date: 2021-11-26
Filing date: 2021-12-15
Publication date: 2023-06-01
Also published as: KR102659399B1; KR20230078134A

Abstract

A device and a method for zero-shot semantic segmentation are disclosed. The disclosed device comprises: a visual encoder for receiving an input image and outputting a visual feature map through a neural network operation; a semantic encoder for receiving a feature vector for each class and outputting a prototype vector for each class through a neural network operation; and a semantic segmentation unit for comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map so as to designate a class for each pixel of the visual feature map, wherein the semantic segmentation unit designates, as a class of a specific pixel, a class corresponding to a prototype vector most similar to a channel vector of the pixel, the prototype vector and the channel vector are configured to have the same length, and the visual encoder and the semantic encoder share at least one same loss and are trained simultaneously. According to the disclosed device and method, semantic segmentation is performed on an unlearned class in a discriminant manner, so that continuous classifier learning is not required, and a bias problem of classifying the unlearned class into a pre-learned class can be reduced.

Description

Zero-shot semantic segmentation device and method

The present invention relates to an apparatus and method for semantic segmentation, and more particularly, to a zero-shot semantic segmentation apparatus and method capable of semantic segmentation even for an unlearned class.

Semantic segmentation means segmenting an input image into regions corresponding to each identifiable class, and can be applied to various application fields such as autonomous driving, medical imaging, and image editing. The goal of this semantic image segmentation is to label each of a plurality of pixels of an input image by classifying objects such as people, cars, and bicycles into designated classes.

Deep learning-based semantic image segmentation techniques using artificial neural networks such as convolutional neural networks (CNNs) show excellent performance, but for learning, each object class is labeled in pixel units, and the object area for each class is accurately represented at the pixel level ( pixel-level) training data is required in large quantities.

On the other hand, the zero-shot semantic technique is a technique for performing semantic segmentation using additional information for not only classes learned in the learning process but also unlearned classes.

On the other hand, zero-shot semantic segmentation refers to a technique capable of semantic segmentation for not only learned classes but also unlearned classes during learning. Existing zero-shot semantic segmentation performed semantic segmentation using a generative method. The generative method is a technique of performing semantic segmentation through multiple stages, and is a technique of performing zero-shot semantic segmentation by separately generating features for unlearned classes. In the generative method, semantic segmentation is performed by generating a feature for an unlearned class in the final stage and inputting the feature to a classifier.

Such a generative method causes a bias problem to perform semantic segmentation without considering semantic features, and the bias problem means a problem of classifying an unlearned class into a learned class.

In addition, the existing generative method has a problem in that a classifier must be newly trained every time a new class appears or disappears, making it difficult to use in practice.

The present invention proposes a zero-shot semantic segmentation apparatus and method that does not require continuous classifier learning by performing semantic segmentation on unlearned classes in a discriminant manner.

In addition, the present invention proposes a zero-shot semantic segmentation apparatus and method capable of reducing the bias problem of classifying an unlearned class into a pre-learned class.

According to one aspect of the present invention, a visual encoder for receiving an input image and outputting a visual feature map through neural network operation; a semantic encoder that receives a feature vector for each class and outputs a prototype vector for each class through a neural network operation; A semantic segmentation unit for designating a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map, A class corresponding to a similar prototype vector is designated as a class of a corresponding pixel, the prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder share at least one same loss and are simultaneously trained. A semantic segmentation device is provided.

The loss shared by the visual encoder and the semantic encoder includes a prototype loss, wherein the prototype loss is between a prototype vector of a specific class output from the semantic encoder and a median value of channel vectors of the corresponding class of the visual feature map. corresponds to the loss of

Based on the prototype loss, a median value of channel vectors of a specific class output from the visual encoder is learned in a direction that becomes the same as a prototype vector of the corresponding class output from the semantic encoder.

The loss shared by the visual encoder and the semantic encoder includes cross entropy loss, and by the cross entropy loss, the visual encoder places channel vectors of the same class relatively close in the embedding space and channel vectors of different classes in the embedding space. It is learned to be located relatively far away.

The semantic encoder is trained using semantic loss such that a distance between feature vectors for each class input to the semantic encoder is equal to a distance between prototype vectors for each class output from the semantic encoder.

To calculate the prototype loss, a first semantic segmentation map generated by applying a feature vector for each class to the input image is input to the semantic encoder.

Prototype loss using the first semantic segmentation map is calculated as in the following equation.

In the above equation, L _center is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is a set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function outputting the distance between two variables.

In order to calculate the prototype loss, the semantic encoder applies a feature vector for each class of the input image, reduces the image to which the feature vector for each class is applied, and linearly interpolates the reduced image to an original image size. The second semantic segmentation map enlarged by is input.

Prototype loss using the second semantic segmentation map is calculated as in the following equation.

In the above equation, L _bar is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is the set of pixels of a particular class, v(p) is the channel vector at pixel position p in the visual feature map,

is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the second semantic segmentation map, and d() is a function outputting the distance between two variables.

According to another aspect of the present invention, the step (a) of receiving an input image and outputting a visual feature map through a neural network operation through a visual encoder; Step (b) of receiving a feature vector for each class and outputting a prototype vector for each class through a semantic encoder through a neural network operation; A step (c) of designating a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map, wherein step (c) includes a channel of a specific pixel The class corresponding to the prototype vector most similar to the vector is designated as the class of the corresponding pixel, the prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder share at least one same loss, A semantic segmentation method that is simultaneously learned is provided.

Therefore, according to the present invention, semantic segmentation is performed on unlearned classes in a discriminant manner, so continuous classifier learning is not required, and the bias problem of classifying unlearned classes into pre-learned classes can be reduced. there is

1 is a block diagram showing the overall structure of a zero-shot semantic segmentation apparatus according to an embodiment of the present invention;

2 is a diagram showing an example of an embedding space of channel vectors output from a visual encoder according to an embodiment of the present invention.

3 is a diagram illustrating a principle of performing semantic segmentation on an unlearned class according to an embodiment of the present invention.

4 is a diagram showing a learning structure of a visual encoder and a semantic encoder according to an embodiment of the present invention.

5 is a diagram illustrating a principle of generating a first semantic segmentation map generated for learning of a semantic segmentation device according to an embodiment of the present invention;

6 is a diagram illustrating a principle of generating a second semantic map generated for learning of a semantic segmentation device according to an embodiment of the present invention;

7 is a diagram for conceptually explaining prototype loss according to an embodiment of the present invention.

8 is a diagram for explaining semantic loss according to an embodiment of the present invention.

9 is a flowchart illustrating a learning method of a zero-shot semantic segmentation apparatus according to an embodiment of the present invention.

In order to fully understand the present invention and its operational advantages and objectives achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the described embodiments. And, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

Throughout the specification, when a part "includes" a certain component, it means that it may further include other components, not excluding other components unless otherwise stated. In addition, terms such as “… unit”, “… unit”, “module”, and “block” described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And it can be implemented as a combination of software.

1 is a block diagram showing the overall structure of a zero-shot semantic segmentation device according to an embodiment of the present invention.

Referring to FIG. 1 , a zero-shot semantic segmentation apparatus according to an embodiment of the present invention includes a visual encoder 100, a semantic encoder 200, and a semantic segmentation unit 300.

An image 50 for semantic segmentation is input to the visual encoder 100 . The visual encoder 100 outputs a visual feature map 150 for the input image 50 through neural network operation. For example, the visual encoder 100 may output a visual feature map through a neural network operation such as CNN. When the input image 50 has a size of H ₁ XW ₁ , the visual feature map 150 output through the visual encoder 100 may have a size of H ₂ XW ₂ XC. Here, H means height and W means width. The size of the input image and the size (height and width) of the visual feature map may be the same or set differently. C means the length of the channel vector of each pixel of the visual feature map. The visual feature map 150 has a channel vector for each pixel, and the visual encoder 100 that outputs the visual feature map 150 learns to output a similar channel vector for each object class included in the input image 50. .

2 is a diagram illustrating an example of an embedding space of channel vectors output from a visual encoder according to an embodiment of the present invention.

In FIG. 2 , points located around the first point 210 are channel vectors of pixels representing cars, and points located around the second point 220 are channel vectors of pixels representing bicycles. As shown in FIG. 2, the visual encoder 100 learns to position pixels representing cars adjacent to each other in the embedding space, and the visual encoder 100 learns to position pixels representing bicycles adjacent to each other in the embedding space. do.

Meanwhile, a feature vector of a class to be divided is input to the semantic encoder 200 . For example, if the classification target classes are cars and bicycles, the feature vectors of cars and bicycles are input to the semantic encoder 200 . The semantic encoder 200 receiving the feature vector of the class to be divided outputs the corresponding prototype vector 250 of the feature vector of each class through neural network operation. For example, when a feature vector for a car and a feature vector for a bicycle are input, the semantic encoder 200 outputs a prototype vector for a bicycle and a prototype vector for a car, respectively. According to an embodiment of the present invention, the semantic encoder 200 may be a Fully Connected (FC) neural network, but is not limited thereto.

Here, the feature vector input to the semantic encoder 200 is a vector that can be obtained from a commercial database. For example, databases such as Wikipedia provide feature vectors for each class, and such commercially obtainable feature vectors are input to the semantic encoder 200.

According to an embodiment of the present invention, the length of the prototype vector output from the semantic encoder 200 and the channel vector for each pixel output through the visual encoder 100 are set to be the same.

The semantic segmentation unit 300 performs semantic segmentation using the prototype vector for each class output from the semantic encoder 200 and the visual feature map output from the visual encoder 100 . The semantic segmentation unit 300 performs semantic segmentation by comparing each channel vector of the visual feature map with the prototype vector for each class and specifying a class for each pixel of the visual feature map. Specifically, the semantic segmentation unit 300 calculates a similarity between a channel vector of a specific pixel and prototype vectors for each class output from the semantic encoder 200 . The semantic segmentation unit 300 determines a prototype vector of a specific class having the highest similarity with the channel vector of the corresponding pixel, and designates the class of the corresponding prototype vector as the class of the corresponding pixel. When such pixel-by-pixel class designation is performed on all pixels, semantic segmentation is performed on the class input to the semantic encoder 200 . For example, when the classes input to the semantic encoder 200 are bicycle, car, and background, one of bicycle, car, and background is designated for each pixel.

As a result, the semantic segmentation unit 300 performs semantic segmentation by searching for a prototype vector for each class that is most similar to the channel vector of each pixel and designating the class of the searched prototype vector as the class of the corresponding pixel. .

The operation of the semantic segmentation apparatus of the present invention described with reference to FIG. 1 is described when the visual encoder 100 and the semantic encoder 200 are completed and the semantic segmentation apparatus is used in the inference step.

The semantic segmentation apparatus shown in FIG. 1 enables semantic segmentation even for classes that have not been learned in advance.

It is assumed that the pre-learned classes in the visual encoder 100 and the semantic encoder 200 are bicycle, car, and background. However, when the neural network user determines that it is necessary to perform semantic segmentation on a person, the neural network user inputs the feature vector of the person to the semantic encoder 200 to perform zero-shot semantic segmentation on a class that has not been learned in advance. do.

In this case, the neural network user inputs feature vectors for a bicycle, a car, and a person into the semantic encoder 200 together. The semantic encoder 200 outputs a prototype vector for a bicycle, a prototype vector for a car, and a prototype vector for a person according to a pre-learned method.

Referring to FIG. 3 , an input image 50 is input to the visual encoder 100 . In addition, feature vectors for car and bicycle, which are pre-learned classes, and feature vectors for person, which are non-learned classes, are input to the semantic encoder 200 .

The semantic segmentation unit 300 performs semantic segmentation using the bicycle prototype vector, the automobile prototype vector, and the human prototype vector output from the semantic encoder 200 .

As shown in FIG. 3, in the embedding space, pixels of channel vectors similar to the prototype vector for a person are classified into a human class, and pixels of channel vectors similar to a prototype vector for a car are classified as a car class. , pixels of channel vectors similar to the prototype vector for a bicycle are classified into the bicycle class.

4 is a diagram illustrating learning structures of a visual encoder and a semantic encoder according to an embodiment of the present invention.

As described above, the visual encoder 100 and the semantic encoder 200 are artificial neural networks, and neural network weights of the visual encoder 100 and the semantic encoder 200 are set through learning.

One of the main features of the learning structure of the present invention is that the visual encoder 100 and the semantic encoder 200 are learned while sharing the same loss, and through this learning structure, the prototype output from the semantic encoder 200 Using a vector, it is possible to perform sematic segmentation on a feature map output from a visual encoder.

Referring to FIG. 4 , the visual encoder 100 and the semantic encoder 200 according to an embodiment of the present invention perform learning while sharing prototype loss and crossentropy loss.

Prototype loss refers to a loss between a prototype vector of a specific class output from the semantic encoder 200 and a median value of channel vectors of a specific class output from the visual encoder 100 . That is, the visual encoder 100 and the semantic value in a direction in which the median value of channel vectors of a specific class output from the visual encoder 100 and the prototype vector of a specific class output from the semantic encoder 200 become the same due to prototype loss. The encoder 200 is learned.

Through the prototype loss, the visual encoder 100 so that the median value of the channel vectors representing the car in the visual feature map output from the visual encoder 100 and the prototype vector for the car output from the semantic encoder 200 are the same. and the semantic encoder 200 is learned.

The cross entropy loss is a loss for learning to bring channel vectors and prototype vectors of the same class closer to each other and channel vectors and prototype vectors of different classes to move away from each other.

Let S be the total set of classes, p is the pixel coordinate, R is the visual feature map, c is the class, and Rc is the set of channel vectors of pixels representing a specific class in the visual feature map. The cross entropy loss is It can be calculated as in Equation 1.

In Equation 1 above, ω _c is a weight vector of a specific class c, ω _j is a weight vector of all classes, and v(p) means a channel vector at pixel position p. Weight vectors can be set through learning, and channel vectors of the same class are learned to have a large dot product with the weight vector of the corresponding class, and are learned to have a small dot product with weight vectors of other classes. Through this, it is possible to learn so that channel vectors and prototype vectors of different classes move away from each other in the embedding space, and channel vectors and prototype vectors of the same class approach each other.

Meanwhile, according to a preferred embodiment of the present invention, a separate semantic segmentation map may be created to perform learning in order to learn the prototype loss. As reviewed with reference to FIG. 1 , in the inference step, a feature vector of a commercially available class is input to the semantic encoder 200 . However, in the learning step, it is preferable to perform learning using the semantic segmentation map rather than the feature vector itself of the class, and the semantic segmentation map is input to the semantic encoder 200 and visual encoder by prototype loss and cross entropy loss ( 100) Me semantic encoder 200 is learned.

5 is a diagram illustrating a principle of generating a first semantic segmentation map generated for learning of a semantic segmentation apparatus according to an embodiment of the present invention.

FIG. 5(a) shows an input image, and FIG. 5(b) shows a principle of generating a first semantic segmentation map from an input image.

As shown in (a) of FIG. 5, the input image is a photographed image of a table and a person. The input image of FIG. 5(a) is divided into three classes, table, person, and background.

Referring to (b) of FIG. 5, each class region is firstly distinguished from the input image. Since the input image is an image for which the correct answer for the class is already known, it is possible to distinguish the class area for it.

When class regions are classified in the input image, a prepared feature vector is applied to each class region. That is, the feature vector for the table, the feature vector for the person, and the feature vector for the background are applied. Each feature vector is obtained from a commercially available database.

In this way, the first semantic segmentation map 500 is generated by applying the feature vector corresponding to the class of each class area to the input image, and the first semantic segmentation map 500 is input to the semantic encoder.

The semantic encoder 200 receives the first semantic segmentation map 500 and outputs a prototype vector for a person, a prototype vector for a table, and a prototype vector for a background.

The first prototype loss (L _center ) when the first semantic partitioning map 500 is used may be calculated as in Equation 2 below.

In Equation 2 above, c is a class, S is a total set of classes, p is a pixel, Rc is a set of pixels of a specific class, v(p) is a channel vector at pixel position p in the visual feature map, μ(p) is Indicates a prototype vector at pixel position p in the first semantic segmentation map. In addition, d(a,b) is a function representing the distance between a and b, and Equation 2 above is learned so that the sum of v(p) and the sum of μ(p) are equal.

As a result, learning is performed so that the median value of the channel vector of a specific class is equal to the prototype vector of the corresponding class through the first prototype loss as shown in Equation 2.

Through such simultaneous learning of the visual encoder 100 and the channel encoder 200, semantic segmentation of the feature map output from the visual encoder 100 is possible based on the prototype vector of the channel encoder 200.

However, a bias problem may occur when learning is performed using the first semantic segmentation map. Since the visual feature map output through the visual encoder 100 is a convolutional neural network (CNN), pixel values are continuous. However, the values output from the semantic encoder 200 have discrete values as shown in (b) of FIG. 5 . These differences can hinder learning and cause bias problems.

Accordingly, the present invention also proposes a structure for learning using a different type of prototype loss (second prototype loss) using the second semantic segmentation map.

6 is a diagram illustrating a principle of generating a second semantic map generated for learning of a semantic segmentation apparatus according to an embodiment of the present invention.

In FIG. 6, the input image is the same as in FIG. 5, and the task of classifying the input image according to the class is performed in the same way.

When regions for each class are divided in the input image, a reduced image 600 by reducing the input image is generated. When a reduced image is created, a feature vector for each class is applied to the reduced image. In the first semantic segmentation map, a feature vector for each class is applied to a non-reduced image, but when a second semantic segmentation map is generated, a feature vector for each class is applied to the reduced image 600 .

After the feature vector for each class is applied to the reduced image 600, the second semantic segmentation map 610 is generated by enlarging the reduced image to which the feature vector for each class is applied through linear interpolation. As the image is enlarged through linear interpolation, the feature vector in the boundary area can have continuous features.

Accordingly, prototype vectors output from the semantic encoder may also have continuous values in the boundary area. As a result, when learning is performed using the second semantic segmentation map 610, the bias problem occurring during learning can be solved because it has continuous features in the boundary area like the feature map of the visual encoder.

Equation 3 below shows a method of calculating the second prototype loss (L _bar ) using the first semantic partitioning map 610 .

The second prototype loss is instead of μ(p) when compared to the first prototype loss in equation (2).

There is a difference in that is used.

Means prototype vectors output by inputting the second semantic segmentation map to the semantic encoder.

Referring to FIG. 7 , a feature vector for a car and a feature vector for a bicycle are input to the semantic encoder 200, and the semantic encoder 200 generates a prototype vector 700 for the car and a bicycle. Outputs a prototype vector 710 for

Meanwhile, an input image is input to the visual encoder 100, and the visual encoder outputs a channel vector for a class for each pixel. A channel vector for a class may be represented on an embedding space 720. At this time, the channel vectors representing the car and the channel vectors representing the bicycle are respectively projected onto the embedding space 720, and the median of the channel vectors representing the car is the same as the prototype vector for the car. Prototype loss is calculated so that Prototype loss is calculated in the same way for the bike.

As a result, the embedding space 720 shown in FIG. 7 can be expressed as a joint embedding space in which the prototype vector of the semantic encoder 200 and the channel vector of the visual encoder 100 are projected together.

Meanwhile, referring to FIG. 4 again, additional learning is performed for the semantic encoder 200 using semantic loss. The semantic loss is a loss of a difference between a distance between input feature vectors of each class and a distance between feature vectors of each class output from the semantic encoder. The semantic encoder 200 performs learning by using semantic loss so that a distance difference between input class-specific feature vectors becomes equal to a distance difference between output prototype vectors for each class.

For example, since dogs and cats have similarities in different classes or shapes, the distance difference between feature vectors will be relatively small. However, since dogs and humans have low shape similarity, the distance difference between feature vectors will be relatively large. Learning using semantic loss is performed so that the distance difference between these feature vectors is also reflected in the prototype vectors.

Referring to FIG. 8 , a feature vector embedding space 800 and a prototype vector embedding space 810 are shown. The classes are Table, Cat, Dog, and Human.

The feature vectors of each class obtained from the commercial database are projected onto the embedding space 800 of the feature vector, and the prototype vectors output after each feature vector is input to the semantic encoder 200 are the embedding space of the prototype vector ( 810).

The difference between the distance between two classes in the embedding space 800 of the feature vector and the distance between the two classes in the embedding space 800 of the prototype vector can be defined as semantic loss. 4 can be defined as

In Equation 4 above, i and j denote classes, S denotes a set of classes, r _ij denotes the distance between class i and class j in the embedding space of the feature vector,

denotes the distance between class i and class j in the embedding space of the prototype vector.

9 is a flowchart illustrating a learning method of a zero-shot semantic segmentation apparatus according to an embodiment of the present invention. The learning method shown in FIG. 9 is a flowchart illustrating a case of performing learning using the second semantic partitioning map.

Referring to FIG. 9 , an input image is input to the visual encoder 100 to generate a visual feature map (step 900).

After classifying the input image by class, the input image is reduced (step 902).

A feature vector corresponding to each class is applied to the reduced input image (step 904).

The reduced image to which the feature vector is applied is enlarged through linear interpolation and restored to the original size to generate a second semantic segmentation map (step 906).

The second semantic segmentation map is input to the semantic encoder 200 to output prototype vectors for each class (step 908).

A second prototype loss is calculated using the channel vectors for each class of the visual feature map and the prototype vector for each class output from the semantic encoder 200 (step 910). The second prototype loss can be calculated as in Equation 3.

Meanwhile, a cross entropy loss is calculated using channel vectors output from the visual feature map (step 912). The cross entropy loss can be calculated as in Equation 1. As described above, cross entropy loss is used to learn channel vectors of the same class to move closer in the embedding space and channel vectors of different classes to move away from the embedding space.

A semantic loss is calculated using the distance between feature vectors for each class input to the semantic encoder and the distance between prototype vectors for each class output from the semantic encoder (step 914). Semantic loss can be calculated as in Equation 4.

The visual encoder 100 and the semantic encoder 200 are simultaneously trained using the second prototype loss and the cross entropy loss (step 916). Weights of the visual encoder 100 and the semantic encoder 200 are simultaneously updated using the second prototype loss and the cross entropy loss. Meanwhile, in the semantic encoder 200, the weight update is performed by reflecting the semantic loss as well.

The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, computer readable media may be any available media that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including read-only memory (ROM) dedicated memory), random access memory (RAM), compact disk (CD)-ROM, digital video disk (DVD)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

Although the present invention has been described with reference to the embodiments shown in the drawings, this is only exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

Claims

a visual encoder that receives an input image and outputs a visual feature map through neural network operation;

a semantic encoder that receives a feature vector for each class and outputs a prototype vector for each class through a neural network operation;

A semantic segmentation unit for designating a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map;

The semantic division unit designates a class corresponding to a prototype vector most similar to a channel vector of a specific pixel as a class of the corresponding pixel;

The prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder share at least one same loss and are learned simultaneously.
According to claim 1,

The loss shared by the visual encoder and the semantic encoder includes a prototype loss,

The prototype loss corresponds to a loss between a prototype vector of a specific class output from the semantic encoder and a median value of channel vectors of the corresponding class of the visual feature map.
According to claim 2,

Based on the prototype loss, a median value of channel vectors of a specific class output from the visual encoder is learned in a direction equal to a prototype vector of the corresponding class output from the semantic encoder.
According to claim 2,

The loss shared by the visual encoder and the semantic encoder includes a cross entropy loss,

According to the cross entropy loss, the visual encoder learns that channel vectors of the same class are relatively close in an embedding space and channel vectors of different classes are relatively far in an embedding space.
According to claim 1,

The semantic encoder is characterized in that learning is performed using semantic loss so that a distance between feature vectors for each class input to the semantic encoder and a distance between prototype vectors for each class output from the semantic encoder are equal Semantic Split Device.
According to claim 3,

The semantic segmentation device characterized in that for calculating the prototype loss, a first semantic segmentation map generated by applying a feature vector for each class to the input image is input to the semantic encoder.
According to claim 6,

The semantic segmentation device, characterized in that the prototype loss is calculated by the following equation.

In the above equation, L center is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is a set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function outputting the distance between two variables.
According to claim 3,

In order to calculate the prototype loss, the semantic encoder applies a feature vector for each class of the input image, reduces the image to which the feature vector for each class is applied, and linearly interpolates the reduced image to an original image size. A semantic segmentation device characterized in that a second semantic segmentation map enlarged by is input.
According to claim 8,

The prototype loss is calculated by the following equation.

In the above equation, L bar is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is the set of pixels of a particular class, v(p) is the channel vector at pixel position p in the visual feature map,
is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the second semantic segmentation map, and d() is a function outputting the distance between two variables.
Step (a) of receiving an input image and outputting a visual feature map through a neural network operation through a visual encoder;

Step (b) of receiving a feature vector for each class and outputting a prototype vector for each class through a semantic encoder through a neural network operation;

(c) designating a class for each pixel of the visual feature map by comparing the prototype vector for each class with a channel vector for each pixel of the visual feature map;

In step (c), a class corresponding to a prototype vector most similar to a channel vector of a specific pixel is designated as a class of the corresponding pixel,

The prototype vector and the channel vector are set to the same length, and the visual encoder and the semantic encoder share at least one same loss and are simultaneously learned.
According to claim 10,

The loss shared by the visual encoder and the semantic encoder includes a prototype loss,

The prototype loss corresponds to a loss between a prototype vector of a specific class output from the semantic encoder and a median value of channel vectors of the corresponding class of the visual feature map.
According to claim 11,

Based on the prototype loss, a median value of channel vectors of a specific class output from the visual encoder is learned in a direction equal to a prototype vector of the corresponding class output from the semantic encoder.
According to claim 11,

The loss shared by the visual encoder and the semantic encoder includes a cross entropy loss,

According to the cross entropy loss, the visual encoder learns that channel vectors of the same class are relatively close in an embedding space and channel vectors of different classes are relatively far in an embedding space.
According to claim 10,

The semantic encoder is characterized in that learning is performed using semantic loss so that a distance between feature vectors for each class input to the semantic encoder and a distance between prototype vectors for each class output from the semantic encoder are equal Semantic Split Device.
According to claim 10,

The semantic encoder is characterized in that learning is performed using semantic loss so that a distance between feature vectors for each class input to the semantic encoder and a distance between prototype vectors for each class output from the semantic encoder are equal Semantic Split Device.
According to claim 15,

The semantic segmentation device, characterized in that the prototype loss is calculated by the following equation.

In the above equation, L center is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is a set of pixels of a specific class, v(p) is the channel vector at pixel position p in the visual feature map, μ(p) is a prototype vector output from the semantic encoder by inputting the feature vector of pixel position p in the first semantic segmentation map, and d() is a function outputting the distance between two variables.
According to claim 12,

In order to calculate the prototype loss, the semantic encoder applies a feature vector for each class to the input image, reduces the image to which the feature vector for each class is applied, and linearly interpolates the reduced image to an original image size. A semantic segmentation device characterized in that a second semantic segmentation map enlarged by is input.
According to claim 17,

The semantic segmentation device, characterized in that the prototype loss is calculated by the following equation.

In the above equation, L bar is the prototype loss, c is the class, S is the total set of classes, p is a pixel, Rc is the set of pixels of a particular class, v(p) is the channel vector at pixel position p in the visual feature map,
is a prototype vector output from the semantic encoder by inputting the feature vector of the pixel position p in the second semantic segmentation map, and d() is a function outputting the distance between two variables.