KR20200115706A

KR20200115706A - Semantic image segmentation method based on deep-learning and apparatus thereof

Info

Publication number: KR20200115706A
Application number: KR1020190027472A
Authority: KR
Inventors: 이록규; 박지혁; 강민석; 김현기
Original assignee: 엔에이치엔 주식회사
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2020-10-08
Also published as: KR102214465B1

Abstract

The present invention relates to a method for deep learning-based image segmentation, and a device therefor. According to the present invention, the device comprises: a communication unit for receiving a training data set; a storage unit for storing a deep learning model; and a processor for training the deep learning model by using the training data set. According to the present invention, segmentation of the character faces in a cartoon image can be rapidly and accurately performed.

Description

Deep learning-based image segmentation method and its device {SEMANTIC IMAGE SEGMENTATION METHOD BASED ON DEEP-LEARNING AND APPARATUS THEREOF}

딥러닝 기반의 이미지 영역화 방법 및 그 장치에 관한 것이다. 보다 상세하게는 딥러닝 신경망을 이용하여 카툰 이미지 내 얼굴을 영역화하는 방법 및 그 장치에 관한 것이다.The present invention relates to a deep learning-based image segmentation method and apparatus therefor. In more detail, the present invention relates to a method and apparatus for regionalizing a face in a cartoon image using a deep learning neural network.

스마트폰, 태블릿 PC, PDA(Personal Digital Assistant), 노트북 등과 같은 사용자 단말의 이용이 대중화되고 정보 처리 기술이 발달함에 따라 사용자 단말을 이용하여 이미지 또는 동영상을 촬영하고, 촬영한 영상을 사용자의 요구에 맞게 편집하는 영상 편집 기술에 대한 연구가 활발하게 진행되고 있는 추세이다.As the use of user terminals such as smartphones, tablet PCs, personal digital assistants (PDAs), notebooks, etc. is popular and information processing technology is developed, images or videos are photographed using user terminals, and the captured images are adapted to the needs of users. There is a trend of active research on video editing technology to edit properly.

이미지 편집을 상업적으로 이용하는 분야 중 하나는 카툰 이미지 내에서 캐릭터 얼굴의 피부색, 헤어 스타일, 헤어 색을 변경하거나 다른 얼굴로 변경하는 카툰 이미지 얼굴 편집 분야이다. 카툰 이미지 내의 캐릭터 얼굴 영역만 편집하기 위해서는 카툰 이미지 내에서 캐릭터의 얼굴을 영역화하는 과정이 필요하다. 이미지 영역화(Image Segmentation)는 임계값(Threshold)이나 엣지(Edge) 등에 기반하여 이미지 영역을 분리하는 기술이다. 특히, 캐릭터의 얼굴과 같이 유의미한 영역을 분리하는 것을 시맨틱 세그멘테이션(Semantic segmentation)라고 한다. 그러나, 이미지 편집자는 서로 다른 이미지에서 노이즈가 없이 정확하고 세밀한 이미지 영역화를 위해 매번 임계값(Threshold)이나 엣지(Edge)를 찾기 위하여 많은 시간과 노력이 필요한 문제가 있다.One of the fields of commercial use of image editing is the field of cartoon image face editing, in which the skin color, hairstyle, and hair color of a character's face is changed or changed to another face within the cartoon image. In order to edit only the character face area in the cartoon image, it is necessary to make the character's face area within the cartoon image. Image segmentation is a technique for separating an image area based on a threshold or an edge. In particular, separating a significant area such as a character's face is called semantic segmentation. However, there is a problem that an image editor needs a lot of time and effort to find a threshold or an edge every time for accurate and detailed image regionization without noise in different images.

이러한 문제점을 극복하기 위하여, 최근에 딥러닝을 통한 이미지 영역화 기술이 활발하게 시도되고 있다. 대표적인 딥러닝을 통한 이미지 영역화 기술로 U-net, Segnet, Tiramisu, Deep Lab이 존재한다. 상기 대표적인 이미지 영역화 기술은 방대한 양의 트레이닝 데이터 셋이 준비되어 있어야 신경망(Neural Network)을 트레이닝할 수 있는 지도학습 방식이라는 한계점이 있다. 트레이닝 데이터 셋은 이미지 영역화하려는 이미지와 같은 도메인의 트레이닝 이미지 및 해당 트레이닝 이미지에 대한 정답 레이블(Ground Truth Labels)을 포함한다. 그러나 카툰 이미지는 이미지 영역화를 위한 트레이닝 데이터 셋이 매우 드물고, 해당 데이터 셋을 수작업으로 가공하여 만드는 것은 시간과 비용이 많이 소모되는 문제점이 있다.In order to overcome this problem, image areaization technology through deep learning has recently been actively attempted. Representative image domainization technologies through deep learning include U-net, Segnet, Tiramisu, and Deep Lab. The representative image segmentation technology has a limitation in that it is a supervised learning method capable of training a neural network only when a vast amount of training data sets are prepared. The training data set includes a training image in the same domain as the image to be imaged and ground truth labels for the training image. However, in the cartoon image, training data sets for image regionization are very rare, and manual processing of the data set to create the cartoon image has a problem that it consumes a lot of time and cost.

이와 같이 지도학습 방식에서 트레이닝 데이터 셋을 필요로 하는 문제를 극복하기 위하여 XGAN, DTN 등 도메인 적용(Domain Adaptaion) 기술이 소개되었으나 특정 도메인의 이미지를 다른 도메인의 이미지로 변환하는 기술로서는 성공적이었으나, 이미지 영역화를 위해 활용하는 것은 불가능한 실정이었다.(A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri, F. Cole, and K. Murphy. XGAN: Unsupervised Image-to-Image Translation for Many-to-Many Mappings. arXiv preprint arxiv:1711.05139, 2018, Y. Taigman, A. Polyak, and L. Wolf. Unsupervised Cross-Domain Image Generation. In ICLR, 2017.)In this supervised learning method, domain adaptation technologies such as XGAN and DTN were introduced to overcome the problem of requiring a training data set, but it was successful as a technology to convert an image of a specific domain into an image of another domain. It was impossible to use it for territorialization (A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri, F. Cole, and K. Murphy. XGAN: Unsupervised Image-to-Image Translation) for Many-to-Many Mappings.arXiv preprint arxiv:1711.05139, 2018, Y. Taigman, A. Polyak, and L. Wolf. Unsupervised Cross-Domain Image Generation.In ICLR, 2017.)

또한, 도메인 적용 기술을 활용한 의료영상 영역화 기법이 소개 되었으나, 매우 간단한 형태의 신경망으로 구성되어 있어서 도메인 간의 특징 차이가 소정의 레벨 이상이 될 경우 도메인 적용 기술을 적용하지 못하거나 적용하여도 정확한 이미지 영역화를 하지 못하는 성능적 한계가 있는 문제가 있다.( M. Javanmardi, and T. Tasdizen. DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING. In ISBI, 2018.) In addition, although a medical image domainization technique using domain application technology was introduced, it is composed of a very simple neural network, so if the difference in characteristics between domains exceeds a predetermined level, the domain application technology cannot be applied or even if it is applied There is a problem in that there is a performance limitation that cannot perform image segmentation (M. Javanmardi, and T. Tasdizen. DOMAIN ADAPTATION FOR BIOMEDICAL IMAGE SEGMENTATION USING ADVERSARIAL TRAINING. In ISBI, 2018.)

비특허문헌 1: O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015.Non-Patent Document 1: O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015. 비특허문헌 2: V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:2481-2495, 2017.Non-Patent Document 2: V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:2481-2495, 2017. 비특허문헌 3: S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. In CVPR, 2017.Non-Patent Document 3: S. Jegou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. In CVPR, 2017. 비특허문헌 4: L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587, 2017.Non-Patent Document 4: L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587, 2017.

본 발명은 전술한 문제점을 해결하기 위해, 딥러닝 기반의 이미지 영역화 방법 및 그 장치에 관한 것이다. 보다 상세하게는 딥러닝 신경망을 이용하여 카툰 이미지 내 얼굴을 영역화하는 방법 및 그 장치를 제안하고자 한다.In order to solve the above-described problems, the present invention relates to a deep learning-based image segmentation method and an apparatus therefor. In more detail, we propose a method and apparatus for regionizing a face in a cartoon image using a deep learning neural network.

자세히, 본 발명은 카툰 이미지 내 캐릭터 얼굴을 변경하기 위해 캐릭터 얼굴을 영역화 할 수 있는 딥러닝 기반의 이미지 영역화 방법 및 그 장치를 제공함을 목적으로 한다.In detail, an object of the present invention is to provide a deep learning-based image domainization method and apparatus capable of domainizing a character face in order to change a character face in a cartoon image.

또한, 본 발명은 카툰 이미지 내 캐릭터 얼굴을 신속하고 정확하고 세밀하게 영역화 할 수 있는 딥러닝 기반의 이미지 영역화 방법 및 그 장치를 제공함을 목적으로 한다.In addition, an object of the present invention is to provide a deep learning-based image domainization method and apparatus capable of quickly, accurately and precisely domainizing a character face in a cartoon image.

또한, 본 발명은 카툰 이미지에 대한 트레이닝 데이터 셋 없이 신경망을 트레이닝하여 카툰 이미지를 영역화할 수 있는 딥러닝 기반의 이미지 영역화 방법 및 그 장치를 제공함을 목적으로 한다.In addition, an object of the present invention is to provide a deep learning-based image segmentation method and apparatus capable of segmenting a cartoon image by training a neural network without a training data set for a cartoon image.

또한, 본 발명은 실사 이미지에 대한 트레이닝 데이터 셋을 이용하여 카툰 이미지를 영역화할 수 있는 딥러닝 모델을 구축할 수 있는 딥러닝 기반의 이미지 영역화 방법 및 그 장치를 제공함을 목적으로 한다.In addition, an object of the present invention is to provide a deep learning-based image segmentation method and an apparatus for constructing a deep learning model capable of segmenting a cartoon image using a training data set for a real image.

실시예에 따른 딥러닝 기반의 이미지 영역화 장치는 카툰 이미지인 타겟 도메인 이미지 및 실사 이미지인 소스 도메인 이미지와 상기 소스 도메인 이미지 내 얼굴이 영역화된 정답 레이블을 포함한 트레이닝 데이터 셋을 수신하는 통신부; 딥러닝 모델을 저장하는 저장부; 및 상기 딥러닝 모델을 상기 트레이닝 데이터 셋을 이용하여 얼굴을 이미지 영역화 할 수 있도록 트레이닝하고, 트레이닝 완료된 상기 딥러닝 모델을 이용하여 상기 타겟 도메인 이미지의 얼굴 영역화를 수행하는 프로세서;를 포함하고, 상기 프로세서는 상기 딥러닝 모델을 이용하여 상기 소스 도메인 이미지에서 공통 컨텐츠 공간을 추출하고 상기 타겟 도메인 이미지에서 공통 컨텐츠 공간을 추출하고, 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간은 상기 소스 도메인 이미지의 얼굴 영역을 나타내는 특징 표현이고, 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간은 상기 타겟 도메인 이미지의 얼굴 영역을 나타내는 특징 표현이고, 상기 프로세서는, 상기 딥러닝 모델을 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 이미지 영역화하도록 트레이닝하고, 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 상기 타겟 도메인 이미지의 얼굴 영역화를 수행할 수 있다.The deep learning-based image segmentation apparatus according to the embodiment includes: a communication unit receiving a training data set including a target domain image as a cartoon image, a source domain image as a real image, and a correct answer label in which a face in the source domain image is region; A storage unit for storing a deep learning model; And a processor that trains the deep learning model to transform a face into an image region using the training data set, and performs face regionization of the target domain image using the trained deep learning model. The processor extracts a common content space from the source domain image using the deep learning model and extracts a common content space from the target domain image, and the common content space extracted from the source domain image is a face region of the source domain image. And the common content space extracted from the target domain image is a feature expression representing the face area of the target domain image, and the processor uses the deep learning model to extract the common content space from the source domain image. The training may be performed to make the image region, and the face region of the target domain image may be performed using a common content space extracted from the target domain image.

또한, 상기 딥러닝 모델은, 상기 소스 도메인 이미지에서 공통 컨텐츠 공간을 추출하는 소스 컨텐츠 인코더; 상기 타겟 도메인 이미지에서 공통 컨텐츠 공간을 추출하는 타겟 컨텐츠 인코더; 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간 또는 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 이미지 영역화를 수행하는 영역화부; 및 상기 소스 도메인 이미지에서 추출된 공통 컨텐츠 공간인지 상기 타겟 도메인 이미지에서 추출된 공통 컨텐츠이 분간하는 도메인 분류부를 포함하고, 상기 프로세서는, 상기 도메인 분류부가 상기 소스 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 상기 타겟 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 분간하도록 트레이닝하고, 상기 소스 컨텐츠 인코더와 상기 타겟 컨텐츠 인코더는 상기 도메인 분류부가 상기 소스 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 상기 타겟 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 분간하지 못하도록 트레이닝할 수 있다.In addition, the deep learning model may include a source content encoder for extracting a common content space from the source domain image; A target content encoder for extracting a common content space from the target domain image; A domainization unit for performing image domainization using a common content space extracted from the source domain image or a common content space extracted from the target domain image; And a domain classifying unit for distinguishing whether the domain classifying unit is a common content space extracted from the source domain image or the common content extracted from the target domain image, wherein the processor comprises: the target Training to determine whether it is a common content space extracted from a content encoder, and the source content encoder and the target content encoder determine whether the domain classification unit is a common content space extracted from the source content encoder or a common content space extracted from the target content encoder. You can train to be indistinguishable.

또한, 상기 딥러닝 모델은, 상기 소스 도메인 이미지에서 특정 속성 공간을 추출하는 소스 속성 인코더; 상기 소스 도메인 이미지에서 추출된 특정 속성 공간과 공통 컨텐츠 공간을 이용하여 소스 가상 이미지를 생성하는 소스 생성부; 및 상기 소스 생성부가 생성한 소스 가상 이미지가 상기 소스 도메인 이미지와 비교하여 진짜인지 가짜인지 판단하는 소스 판별부를 더 포함하고, 상기 프로세서는, 상기 소스 판별부가 상기 소스 가상 이미지가 가짜인지 진짜인지 판별하도록 트레이닝하고, 상기 소스 생성부가 상기 소스 가상 이미지가 상기 소스 판별부에서 진짜인지 가짜인지 판단할 수 없도록 트레이닝할 수 있다.In addition, the deep learning model may include: a source attribute encoder for extracting a specific attribute space from the source domain image; A source generator for generating a source virtual image using a specific attribute space and a common content space extracted from the source domain image; And a source determination unit configured to determine whether the source virtual image generated by the source generation unit is real or fake by comparing it with the source domain image, wherein the processor further comprises: the source determination unit to determine whether the source virtual image is fake or real. After training, the source generation unit may train so that the source virtual image cannot be determined by the source determining unit whether the source virtual image is real or fake.

또한, 상기 딥러닝 모델은, 상기 타겟 도메인 이미지에서 특정 속성 공간을 추출하는 타겟 속성 인코더; 상기 타겟메인 이미지에서 추출된 특정 속성 공간과 공통 컨텐츠 공간을 이용하여 타겟 가상 이미지를 생성하는 타겟 생성부; 및 상기 타겟 생성부가 생성한 타겟 가상 이미지가 상기 타겟 도메인 이미지와 비교하여 진짜인지 가짜인지 판단하는 타겟 판별부를 더 포함하고, 상기 프로세서는, 상기 타겟 판별부가 상기 타겟 가상 이미지가 가짜인지 진짜인지 판별하도록 트레이닝하고, 상기 타겟 생성부가 상기 타겟 가상 이미지가 상기 타겟 판별부에서 진짜인지 가짜인지 판단할 수 없도록 트레이닝할 수 있다.In addition, the deep learning model may include a target attribute encoder for extracting a specific attribute space from the target domain image; A target generation unit generating a target virtual image using a specific attribute space and a common content space extracted from the target main image; And a target determination unit for determining whether the target virtual image generated by the target generation unit is real or fake by comparing with the target domain image, wherein the processor further comprises: the target determination unit to determine whether the target virtual image is fake or real. After training, the target generation unit may train so that the target virtual image cannot be determined whether the target virtual image is real or fake.

또한, 상기 소스 속성 인코더는 컨볼루션 신경 네트워크(CNN)이고, 기 타겟 속성 인코더는 컨볼루션 신경 네트워크(CNN)이고, 기 소스 컨텐츠 인코더는 풀리 컨벌루션 네트워크(FCN)이고, 기 타겟 컨텐츠 인코더는 풀리 컨벌루션 네트워크(FCN)이고, 기 소스 생성부는 다수의 디컨벌루션 레이어로 구성되고, 기 타겟 생성부는 다수의 디컨벌루션 레이어로 구성되고, 상기 영역화부는 풀리 컨벌루션 네트워크(FCN)이고, 기 도메인 분류부는 컨볼루션 신경 네트워크(CNN)일 수 있다. In addition, the source attribute encoder is a convolutional neural network (CNN), the target attribute encoder is a convolutional neural network (CNN), the original source content encoder is a fully convolutional network (FCN), and the target content encoder is a fully convolutional network. It is a network (FCN), the source generation unit is composed of a plurality of deconvolutional layers, the target generation unit is composed of a plurality of deconvolutional layers, the domainization unit is a fully convolutional network (FCN), and the pre-domain classification unit is convolutional It may be a neural network (CNN).

또한, 상기 프로세서는 도메인 적대 손실, 자가 복구 손실, 이미지 적대 손실 또는 시그멘테이션 손실을 이용하여 상기 딥러닝 모델을 트레이닝할 수 있다. Further, the processor may train the deep learning model using domain hostile loss, self-recovery loss, image hostile loss, or segmentation loss.

또한, 상기 프로세서는 교차 사이클 손실 또는 시맨틱 일관성 손실을 이용하여 상기 딥러닝 모델을 트레이닝할 수 있다.In addition, the processor may train the deep learning model using a cross cycle loss or a semantic consistency loss.

실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 통신부, 딥러닝 모델이 저장된 저장부, 상기 딥러닝 모델을 구동하는 프로세서를 포함하는 이미지 영역화 장치에 의해 카툰 이미지의 얼굴 영역화를 수행하는 딥러닝 기반의 이미지 영역화 방법에 있어서, 상기 통신부가 상기 카툰 이미지인 타겟 도메인 이미지를 수신하는 단계; 상기 통신부가 실사 이미지인 소스 도메인 이미지와 상기 소스 도메인 이미지 내 얼굴이 영역화된 정답 레이블을 포함한 트레이닝 데이터 셋을 수신하는 단계; 상기 프로세서가 상기 트레이닝 데이터 셋을 이용하여 상기 딥러닝 모델이 상기 타겟 도메인 이미지의 얼굴을 영역화하도록 트레이닝하는 단계; 싱기 프로세서가 상기 트레이닝을 완료하여 이미지 영역화 딥러닝 모델을 구축하는 단계: 및 상기 프로세서가 상기 타겟 도메인 이미지를 상기 이미지 영역화 딥러닝 모델에 입력하여 상기 타겟 도메인 이미지의 얼굴을 영역화 하는 단계;를 포함하고, 상기 프로세서는 상기 딥러닝 모델을 이용하여 상기 소스 도메인 이미지에서 공통 컨텐츠 공간을 추출하고 상기 타겟 도메인 이미지에서 공통 컨텐츠 공간을 추출하고, 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간은 상기 소스 도메인 이미지의 얼굴 영역을 나타내는 특징 표현이고, 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간은 상기 타겟 도메인 이미지의 얼굴 영역을 나타내는 특징 표현이고, 상기 프로세서는, 상기 딥러닝 모델을 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 이미지 영역화하도록 트레이닝하고, 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 상기 타겟 도메인 이미지의 얼굴 영역화를 수행할 수 있다.The deep learning-based image segmentation method according to the embodiment is a deep learning method that performs face segmentation of a cartoon image by an image segmentation device including a communication unit, a storage unit storing the deep learning model, and a processor driving the deep learning model. A learning-based image segmentation method, comprising: receiving, by the communication unit, a target domain image that is the cartoon image; Receiving, by the communication unit, a training data set including a source domain image that is a photorealistic image and a correct answer label in which a face in the source domain image is regioned; Training, by the processor, the deep learning model to regionize the face of the target domain image using the training data set; Completing the training, by the Singgi processor, constructing an image domain-ized deep learning model: and the processor inputting the target domain image into the image-zoned deep learning model to localize a face of the target domain image; Including, wherein the processor extracts a common content space from the source domain image using the deep learning model, extracts a common content space from the target domain image, and the common content space extracted from the source domain image is the source domain A feature expression representing a face area of an image, and the common content space extracted from the target domain image is a feature expression representing a face area of the target domain image, and the processor includes the deep learning model extracted from the source domain image. It is possible to train the image region using space, and perform face regionization of the target domain image using a common content space extracted from the target domain image.

또한, 상기 딥러닝 모델은, 상기 소스 도메인 이미지에서 공통 컨텐츠 공간을 추출하는 소스 컨텐츠 인코더; 상기 타겟 도메인 이미지에서 공통 컨텐츠 공간을 추출하는 타겟 컨텐츠 인코더; 상기 소스 도메인 이미지에서 추출한 공통 컨텐츠 공간 또는 상기 타겟 도메인 이미지에서 추출한 공통 컨텐츠 공간을 이용하여 이미지 영역화를 수행하는 영역화부; 및 상기 소스 도메인 이미지에서 추출된 공통 컨텐츠 공간인지 상기 타겟 도메인 이미지에서 추출된 공통 컨텐츠이 분간하는 도메인 분류부를 포함하고, 상기 프로세서는, 상기 도메인 분류부가 상기 소스 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 상기 타겟 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 분간하도록 트레이닝하고, 상기 소스 컨텐츠 인코더와 상기 타겟 컨텐츠 인코더는 상기 도메인 분류부가 상기 소스 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 상기 타겟 컨텐츠 인코더에서 추출된 공통 컨텐츠 공간인지 분간하지 못하도록 트레이닝할 수 있다.In addition, the deep learning model may include a source content encoder for extracting a common content space from the source domain image; A target content encoder for extracting a common content space from the target domain image; A domainization unit for performing image domainization using a common content space extracted from the source domain image or a common content space extracted from the target domain image; And a domain classifying unit for distinguishing whether the domain classifying unit is a common content space extracted from the source domain image or the common content extracted from the target domain image, wherein the processor comprises: the target Training to discriminate whether it is a common content space extracted from a content encoder, and the source content encoder and the target content encoder determine whether the domain classification unit is a common content space extracted from the source content encoder or a common content space extracted from the target content encoder. You can train to be indistinguishable.

또한, 상기 소스 속성 인코더는 컨볼루션 신경 네트워크(CNN)이고, 상기 타겟 속성 인코더는 컨볼루션 신경 네트워크(CNN)이고, 상기 소스 컨텐츠 인코더는 풀리 컨벌루션 네트워크(FCN)이고, 상기 타겟 컨텐츠 인코더는 풀리 컨벌루션 네트워크(FCN)이고, 상기 소스 생성부는 다수의 디컨벌루션 레이어로 구성되고, 상기 타겟 생성부는 다수의 디컨벌루션 레이어로 구성되고, 상기 영역화부는 풀리 컨벌루션 네트워크(FCN)이고, 상기 도메인 분류부는 컨볼루션 신경 네트워크(CNN)일 수 있다.In addition, the source attribute encoder is a convolutional neural network (CNN), the target attribute encoder is a convolutional neural network (CNN), the source content encoder is a fully convolutional network (FCN), and the target content encoder is a fully convolutional neural network It is a network (FCN), the source generation unit is composed of a plurality of deconvolutional layers, the target generation unit is composed of a plurality of deconvolutional layers, the segmentation unit is a fully convolutional network (FCN), and the domain classification unit is convolutional It may be a neural network (CNN).

또한, 상기 프로세서는 도메인 적대 손실, 자가 복구 손실, 이미지 적대 손실 또는 시그멘테이션 손실을 이용하여 상기 딥러닝 모델을 트레이닝할 수 있다.Further, the processor may train the deep learning model using domain hostile loss, self-recovery loss, image hostile loss, or segmentation loss.

실시예에 따른 딥러닝 기반의 이미지 영역화 방법 및 그 장치는 카툰 이미지 내 캐릭터 얼굴을 변경하기 위해 캐릭터 얼굴을 영역화 할 수 있다.The deep learning-based image segmentation method and the apparatus according to the embodiment may segment a character face to change a character face in a cartoon image.

또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법 및 그 장치는 카툰 이미지 내 캐릭터 얼굴을 신속하고 정확하고 세밀하게 영역화 할 수 있다.In addition, the deep learning-based image segmentation method and the apparatus according to the embodiment can quickly, accurately and precisely segment a character face in a cartoon image.

또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법 및 그 장치는 카툰 이미지에 대한 트레이닝 데이터 셋이 없이 신경망을 트레이닝하여 카툰 이미지를 영역화할 수 있다.In addition, the deep learning-based image segmentation method and the apparatus according to the embodiment may train a neural network without a training data set for the cartoon image to segment the cartoon image.

또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법 및 그 장치는 실사 이미지에 대한 트레이닝 데이터 셋을 이용하여 카툰 이미지를 영역화할 수 있는 딥러닝 모델을 구축할 수 있다.In addition, the deep learning-based image segmentation method and the apparatus according to the embodiment may construct a deep learning model capable of segmenting a cartoon image using a training data set for a real image.

도 1은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 시스템의 개념도이다.
도 2는 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 기능을 요청하는 어플리케이션을 실행하기 위한 사용자 단말기의 내부 블록도이다.
도 3은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 기능을 가진 어플리케이션을 실행하기 위한 머신러닝서버의 내부 블록도이다.
도 4는 소스 도메인 이미지와 타겟 도메인 이미지에서 각각 얼굴이 영역화된 모습이다.
도 5는 본 발명의 실시예에 따른 딥러닝 모델의 구조를 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 딥러닝 모델의 트레이닝하는 모습을 설명하기 위한 도면이다.
도 7은 본 발명의 실시예에 따른 타겟 도메인을 딥러닝 모델을 이용하여 이미지 영역화하는 모습이다.
도 8은 일 실시예에 따른 딥러닝 모델의 트레이닝을 위한 손실을 설명하기 위한 도면이다.
도 9는 다른 실시예에 따른 딥러닝 모델의 트레이닝을 위한 손실을 설명하기 위한 도면이다.
도 10은 본 발명의 실시예에 따른 딥러닝 모델에 의한 이미지 영역화 결과와 종래 기술에 따른 딥러닝 모델에 의한 이미지 영역화 결과를 비교한 모습이다.
도 11은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 시스템의 신호 흐름도이다.
도 12는 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법이다.1 is a conceptual diagram of a deep learning-based image segmentation system according to an embodiment of the present invention.
2 is an internal block diagram of a user terminal for executing an application requesting a deep learning-based image segmentation function according to an embodiment of the present invention.
3 is an internal block diagram of a machine learning server for executing an application having an image segmentation function based on deep learning according to an embodiment of the present invention.
4 is a diagram illustrating a face region in the source domain image and the target domain image, respectively.
5 is a diagram for explaining the structure of a deep learning model according to an embodiment of the present invention.
6 is a view for explaining a state of training a deep learning model according to an embodiment of the present invention.
7 is a diagram illustrating an image region of a target domain according to an embodiment of the present invention using a deep learning model.
8 is a diagram illustrating a loss for training a deep learning model according to an exemplary embodiment.
9 is a diagram for explaining a loss for training a deep learning model according to another embodiment.
10 is a diagram illustrating a comparison between an image segmentation result by a deep learning model according to an embodiment of the present invention and an image segmentation result by a deep learning model according to the prior art.
11 is a signal flow diagram of a deep learning-based image segmentation system according to an embodiment of the present invention.
12 is a deep learning-based image segmentation method according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 본 발명의 효과 및 특징, 그리고 그것들을 달성하는 방법은 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 다양한 형태로 구현될 수 있다. 이하의 실시예에서, 제1, 제2 등의 용어는 한정적인 의미가 아니라 하나의 구성 요소를 다른 구성 요소와 구별하는 목적으로 사용되었다. 또한, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 포함하다 또는 가지다 등의 용어는 명세서상에 기재된 특징, 또는 구성요소가 존재함을 의미하는 것이고, 하나 이상의 다른 특징들 또는 구성요소가 부가될 가능성을 미리 배제하는 것은 아니다. 또한, 도면에서는 설명의 편의를 위하여 구성 요소들이 그 크기가 과장 또는 축소될 수 있다. 예컨대, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다.Since the present invention can apply various transformations and have various embodiments, specific embodiments are illustrated in the drawings and will be described in detail in the detailed description. Effects and features of the present invention, and a method of achieving them will be apparent with reference to the embodiments described later in detail together with the drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various forms. In the following embodiments, terms such as first and second are not used in a limiting meaning, but for the purpose of distinguishing one component from another component. In addition, expressions in the singular include plural expressions unless the context clearly indicates otherwise. In addition, terms such as include or have means that the features or elements described in the specification are present, and do not preclude the possibility of adding one or more other features or elements in advance. In addition, in the drawings, the size of components may be exaggerated or reduced for convenience of description. For example, the size and thickness of each component shown in the drawings are arbitrarily shown for convenience of description, and the present invention is not necessarily limited to what is shown.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when describing with reference to the drawings, the same or corresponding constituent elements are assigned the same reference numerals, and redundant descriptions thereof will be omitted. .

도 1은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 시스템의 개념도이고, 도 2는 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 기능을 요청하는 어플리케이션을 실행하기 위한 사용자 단말기의 내부 블록도이고, 도 3은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 기능을 가진 어플리케이션을 실행하기 위한 머신러닝서버의 내부 블록도이고, 도 4는 소스 도메인 이미지와 타겟 도메인 이미지에서 각각 얼굴이 영역화된 모습이다.1 is a conceptual diagram of a deep learning-based image segmentation system according to an embodiment of the present invention, and FIG. 2 is a user terminal for executing an application requesting a deep learning-based image segmentation function according to an exemplary embodiment of the present invention. 3 is an internal block diagram of a machine learning server for executing an application with a deep learning-based image segmentation function according to an embodiment of the present invention, and FIG. 4 is a source domain image and a target domain image. Each of the faces in is a regionalized appearance.

도 1을 참조하면, 실시예에 따른 딥러닝 기반의 이미지 영역화 시스템은, 단말기(100), 이미지 제공 서버(200) 및 머신러닝서버(300)를 포함할 수 있다.Referring to FIG. 1, a deep learning-based image segmentation system according to an embodiment may include a terminal 100, an image providing server 200, and a machine learning server 300.

도 1의 각 구성요소는, 네트워크(Network)를 통해 연결될 수 있다. 단말기(100), 이미지 제공 서버(200) 및 머신러닝서버(300) 등과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다. Each component of FIG. 1 may be connected through a network. It refers to a connection structure in which information can be exchanged between nodes such as the terminal 100, the image providing server 200, and the machine learning server 300, and the like. An example of such a network is a 3GPP (3rd Generation Partnership Project) network. , Long Term Evolution (LTE) network, World Interoperability for Microwave Access (WIMAX) network, Internet, LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network), PAN (Personal Area Network), a Bluetooth network, a satellite broadcasting network, an analog broadcasting network, and a Digital Multimedia Broadcasting (DMB) network, but are not limited thereto.

<단말기(100)><Terminal (100)>

먼저, 단말기(100)는, 카툰 이미지의 얼굴 편집 서비스를 제공받고자 하는 유저의 단말기(100)이다. 또한, 단말기(100)는 유선 또는 무선 통신을 통해 머신러닝서버(300)에 접속하여 카툰 이미지의 얼굴을 영역화하는 이미지 영역화 기능을 제공받을 수 있다. 또한, 단말기(100)는 제공 받은 영역화된 이미지를 이용하여 캐릭터의 얼굴을 편집할 수 있다. First, the terminal 100 is a terminal 100 of a user who wants to receive a face editing service for a cartoon image. In addition, the terminal 100 may access the machine learning server 300 through wired or wireless communication and may be provided with an image segmentation function for segmenting a face of a cartoon image. In addition, the terminal 100 may edit the character's face using the provided regionized image.

이러한 단말기(100)는, 카툰 이미지의 얼굴을 편집하는 기능을 가지는 어플리케이션을 실행하기 위한 단말기(100)이면 특정 단말기(100)에 제한을 두지 않는다. 예를들어, 단말기(100)에는, 스마트 폰(smart phone), 휴대폰, 노트북 컴퓨터(laptop computer), 디지털방송용 단말기, PDA(personal digital assistants), PMP(portable multimedia player), 네비게이션, 태블릿 PC(tablet PC), 울트라북(ultrabook), 웨어러블 디바이스(wearable device), 글래스형 단말기 (smart glass) 또는 키오스크(kiosk) 등이 포함될 수 있다. The terminal 100 is not limited to a specific terminal 100 if it is a terminal 100 for executing an application having a function of editing a face of a cartoon image. For example, in the terminal 100, a smart phone, a mobile phone, a laptop computer, a digital broadcasting terminal, a personal digital assistants (PDA), a portable multimedia player (PMP), a navigation system, and a tablet PC (tablet PC). PC), an ultrabook, a wearable device, a smart glass or a kiosk, etc. may be included.

이러한 단말기(100)는, 데이터를 처리하는 프로세서와 카툰 이미지의 얼굴 편집 프로그램이 설치된 메모리를 포함할 수 있으며, 프로세서는 상기 카툰 이미지의 얼굴 편집 프로그램을 독출하여 사용자에게 카툰 이미지의 얼굴 편집 기능을 제공할 수 있다. Such a terminal 100 may include a data processing processor and a memory in which a cartoon image face editing program is installed, and the processor reads the cartoon image face editing program to provide a user with a face editing function of the cartoon image. can do.

자세히, 사용자는 단말기(100)를 통해 카툰 이미지의 얼굴 편집 프로그램을 실행한 후 얼굴을 바꾸고자 하는 카툰 이미지를 선택하고, 선택한 카툰 이미지를 머신러닝서버(300)에 송신하면, 머신러닝서버(300)로부터 해당 카툰 이미지의 얼굴이 영역화된 영역화 이미지를 획득하고, 영역화된 얼굴의 피부색, 헤어 스타일, 헤어 색을 변경하거나 다른 얼굴 형태로 변경하는 등의 편집 기능을 제공할 수 있다.In detail, the user executes a face editing program of the cartoon image through the terminal 100, selects a cartoon image to change the face, and transmits the selected cartoon image to the machine learning server 300, the machine learning server 300 ), and an editing function such as changing the skin color, hair style, and hair color of the regionized face, or changing to another face shape, may be provided.

- 단말기(100)의 내부 유닛 -The internal unit of the terminal 100

이하, 단말기(100)의 구체적인 구성에 대해 도 2를 참조하여 상세히 설명한다. Hereinafter, a detailed configuration of the terminal 100 will be described in detail with reference to FIG. 2.

실시예에 따른 단말기(100)는, 입력부(110), 디스플레이(140), 통신부(120), 메모리(130) 및 프로세서(150)를 포함할 수 있다. The terminal 100 according to the embodiment may include an input unit 110, a display 140, a communication unit 120, a memory 130, and a processor 150.

먼저, 단말기(100)는, 사용자의 입력을 감지하는 입력부(110)를 포함할 수 있다. 예를 들어, 입력부(110)는, 단말기(100)의 전원을 온(on)/오프(off)시키는 실행 입력이나, 단말기(100)의 각종 기능에 대한 설정, 실행 입력 등을 감지할 수 있다. First, the terminal 100 may include an input unit 110 that senses a user's input. For example, the input unit 110 may detect an execution input that turns on/off the power of the terminal 100, settings for various functions of the terminal 100, an execution input, etc. .

이러한 입력부(110)는 사용자 터치를 감지하는 터치 입력부(예를 들어, 터치 센서(touch sensor), 터치키(touch key), 푸시키(mechanical key) 등) 및 음성 입력을 감지하는 마이크로폰(microphone) 중 적어도 하나 이상을 포함하여, 사용자 입력을 감지할 수 있다. The input unit 110 is a touch input unit that senses a user's touch (eg, a touch sensor, a touch key, a push key, etc.) and a microphone that detects a voice input. By including at least one or more of, a user input may be detected.

또한, 단말기(100)는, 카툰 이미지의 얼굴 편집 기능의 진행화면을 표시하는 디스플레이(140)를 포함할 수 있다. 이러한 디스플레이(140)는 터치 센서와 상호 레이어 구조를 이루거나 일체형으로 형성됨으로써, 터치 스크린으로 구현될 수 있다. 이러한 터치 스크린은, 단말기(100)와 사용자 사이의 입력 인터페이스를 제공하는 사용자 입력부(110)로써 기능함과 동시에, 단말기(100)와 사용자 사이의 출력 인터페이스를 제공할 수 있다.In addition, the terminal 100 may include a display 140 that displays a progress screen of a face editing function of a cartoon image. The display 140 may be implemented as a touch screen by forming a layer structure or integrally with the touch sensor. Such a touch screen may function as a user input unit 110 that provides an input interface between the terminal 100 and a user, and may provide an output interface between the terminal 100 and a user.

그리고 디스플레이(140)는, 액정 디스플레이(liquid crystal display, LCD), 박막 트랜지스터 액정 디스플레이(thin film transistor-liquid crystal display, TFT LCD), 유기 발광 다이오드(organic light-emitting diode, OLED), 플렉서블 디스플레이(flexible display), 3차원 디스플레이(3D display), 전자잉크 디스플레이(e-ink display) 중에서 적어도 하나를 포함할 수 있다.In addition, the display 140 includes a liquid crystal display (LCD), a thin film transistor-liquid crystal display (TFT LCD), an organic light-emitting diode (OLED), and a flexible display ( flexible display), a 3D display, and an e-ink display.

사용자는 이러한 터치 스크린을 통해 변환하고자 하는 헤어 스타일, 헤어 색, 피부색, 얼굴 형태 등을 선택할 수 있고, 선택한대로 얼굴이 변환된 카툰 이미지를 확인할 수 있다. The user can select a hairstyle, hair color, skin color, face shape, etc. to be converted through such a touch screen, and can check the cartoon image in which the face has been converted as selected.

또한, 단말기(100)는, 머신러닝서버(300)와 무선 통신하는 통신부(120)를 포함할 수 있다. In addition, the terminal 100 may include a communication unit 120 for wireless communication with the machine learning server 300.

자세히, 단말기(100)는 통신부(120)를 통해 머신 러닝서버(300)에 카툰 이미지(이하 타겟 도메인 이미지(TD))를 송신할 수 있다. 반대로, 단말기(100)는, 통신부(120)를 통해 머신러닝서버(300)로부터 얼굴이 영역화된 이미지를 수신할 수 있다. In detail, the terminal 100 may transmit a cartoon image (hereinafter, a target domain image (TD)) to the machine learning server 300 through the communication unit 120. Conversely, the terminal 100 may receive an image in which a face is region-coded from the machine learning server 300 through the communication unit 120.

자세히, 통신부(120)는, 하기 통신방식(예를 들어, GSM(Global System for Mobile communication), CDMA(Code Division Multi Access), HSDPA(High Speed Downlink Packet Access), HSUPA(High Speed Uplink Packet Access), LTE(Long Term Evolution), LTE-A(Long Term Evolution-Advanced) 등), WLAN(Wireless LAN), Wi-Fi(Wireless-Fidelity), Wi-Fi(Wireless Fidelity) Direct, DLNA(Digital Living Network Alliance), WiBro(Wireless Broadband), WiMAX(World Interoperability for Microwave Access)에 따라 구축된 네트워크망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신할 수 있다. In detail, the communication unit 120, the following communication methods (for example, GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), HSDPA (High Speed Downlink Packet Access), HSUPA (High Speed Uplink Packet Access)) , LTE (Long Term Evolution), LTE-A (Long Term Evolution-Advanced), etc.), WLAN (Wireless LAN), Wi-Fi (Wireless-Fidelity), Wi-Fi (Wireless Fidelity) Direct, DLNA (Digital Living Network) Alliance), WiBro (Wireless Broadband), WiMAX (World Interoperability for Microwave Access), a wireless signal can be transmitted and received with at least one of a base station, an external terminal, and a server on a network.

그리고 메모리(130)는 단말기(100)에서 구동되는 다수의 응용 프로그램(application program) 또는 애플리케이션(application), 단말기(100)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다. In addition, the memory 130 may store a plurality of application programs or applications driven in the terminal 100, data for operation of the terminal 100, and commands.

자세히, 카툰 이미지의 얼굴 편집 프로그램은, 메모리(130)에 저장되고, 단말기(100) 상에 설치되어, 프로세서(150)에 의하여 상기 단말기(100)의 동작(또는 기능)을 수행하도록 구동될 수 있다.In detail, the cartoon image face editing program may be stored in the memory 130, installed on the terminal 100, and driven by the processor 150 to perform an operation (or function) of the terminal 100. have.

또한, 메모리(130)는 하드웨어적으로, ROM, RAM, EPROM, 플래시 드라이브, 하드 드라이브 등과 같은 다양한 저장기기 일 수 있고, 메모리(130)는 인터넷(internet)상에서 상기 메모리(130)의 저장 기능을 수행하는 웹 스토리지(web storage)일 수도 있다. In addition, the memory 130 may be various storage devices such as ROM, RAM, EPROM, flash drive, hard drive, etc. in terms of hardware, and the memory 130 performs a storage function of the memory 130 on the Internet. It may be a web storage that performs.

마지막으로, 각 유닛의 전반적인 동작을 제어하여 카툰 이미지의 얼굴 편집 서비스를 제공하기 위한 데이터 처리를 수행하는 프로세서(150)를 포함할 수 있다. Finally, it may include a processor 150 that performs data processing for providing a face editing service of a cartoon image by controlling the overall operation of each unit.

이러한 프로세서(150)는 ASICs (application specific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 임의의 형태의 프로세서일 수 있다.These processors 150 include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, and microcontrollers. Controllers (micro-controllers), microprocessors (microprocessors), may be any type of processor for performing other functions.

<이미지 제공 서버(200)><Image providing server (200)>

이미지 제공 서버(200)는 머신러닝 서버(300)에 트레이닝 데이터 셋을 제공할 수 있다. 보다 구체적으로, 트레이닝 데이터 셋은 소스 도메인 이미지 및 소스 도메인 이미지에 대한 정답 레이블(Ground Truth Labels)을 포함할 수 있다. 도 4를 참조하면, 소스 도메인 이미지(SD)는 사람의 얼굴이 포함된 실사 이미지일 수 있다. 정답 레이블은 소스 도메인 이미지(SD)에서 올바르게 얼굴이 영역화된 레이블 데이터일 수 있다. 일 예로, 이미지 제공 서버(200)는 데이터베이스에 기존에 공개된 방대한 양의 소스 도메인 이미지(SD)에 대한 트레이닝 데이터 셋을 저장할 수 있다. 다른 예로, 이미지 제공 서버(200)는 데이터 처리부에 새로운 소스 도메인 이미지(SD)의 얼굴을 영역화하기 위한 딥러닝 신경망을 구축하여 새로운 트레이닝 데이터 셋을 생성할 수 있다. The image providing server 200 may provide a training data set to the machine learning server 300. More specifically, the training data set may include a source domain image and Ground Truth Labels for the source domain image. Referring to FIG. 4, the source domain image SD may be a real image including a human face. The correct answer label may be label data in which a face is correctly region-coded in the source domain image SD. For example, the image providing server 200 may store a training data set for a vast amount of source domain images SD previously disclosed in a database. As another example, the image providing server 200 may generate a new training data set by constructing a deep learning neural network for regionalizing a face of a new source domain image SD in the data processing unit.

이러한 이미지 제공 서버(200)는, 단말기(100)와 통신하기 위한 통신모듈과, 데이터를 처리하는 데이터 처리부와, 각종 데이터를 저장하고 있는 데이터베이스를 포함할 수 있다. The image providing server 200 may include a communication module for communicating with the terminal 100, a data processing unit for processing data, and a database storing various types of data.

<머신러닝서버(300)><Machine Learning Server (300)>

머신러닝서버(300)는, 별도의 클라우드 서버나 컴퓨팅 장치를 포함할 수 있다. 또한, 머신러닝서버(300)는 단말기(100)의 프로세서 또는 이미지 제공 서버(200)의 데이터 처리부에 설치된 신경망 시스템일 수 있지만, 이하에서 머신러닝서버(300)는, 단말기(100) 또는 이미지 제공 서버(200)와 별도의 장치로 설명한다.The machine learning server 300 may include a separate cloud server or computing device. In addition, the machine learning server 300 may be a neural network system installed in the processor of the terminal 100 or the data processing unit of the image providing server 200, but hereinafter the machine learning server 300 provides the terminal 100 or an image It will be described as a separate device from the server 200.

도 3을 참조하면, 머신러닝서버(300)는 통신부(310)를 통하여 단말기(100)로부터 타겟 도메인 이미지(TD)를 수신하고 이미지 제공 서버(200)로부터 트레이닝 데이터 셋을 수신할 수 있다. 도 3을 참조하면, 타겟 도메인 이미지(TD)는 사용자가 단말기(100)에 입력한 캐릭터 얼굴이 포함된 카툰 이미지일 수 있다. 트레이닝 데이터 셋은 앞서 설명하였듯이 실사 인물의 얼굴이 포함된 소스 도메인 이미지(SD)와 소스 도메인 이미지(SD)의 얼굴이 영역화된 정답 레이블을 포함할 수 있다. 타겟 도메인 이미지(TD)는 카툰 이미지이고, 소스 도메인 이미지(SD)는 실사 이미지이기 때문에 서로 도메인이 다르다. Referring to FIG. 3, the machine learning server 300 may receive a target domain image TD from the terminal 100 through the communication unit 310 and receive a training data set from the image providing server 200. Referring to FIG. 3, the target domain image TD may be a cartoon image including a character face input by a user into the terminal 100. As described above, the training data set may include a source domain image SD including a face of a real person and a correct answer label in which the face of the source domain image SD is regioned. Since the target domain image TD is a cartoon image and the source domain image SD is a real image, domains are different from each other.

머신러닝서버(300)는 소스 도메인 이미지(SD)가 포함된 트레이닝 데이터 셋을 이용하여 이미지 영역화 딥러닝 모델을 구축하고 타겟 도메인 이미지(TD)에 대한 이미지 영역화를 수행할 수 있다. 일 예로, 도 4를 참조하면, 머신러닝서버(300)는 딥러닝 신경망에 소스 도메인 이미지(SD)를 입력하여 생성된 소스 영역화 이미지(SS)와 정답 레이블과 비교를 반복하여 딥러닝 신경망을 이미지 영역화 할 수 있도록 충분히 학습할 수 있다. 머신러닝서버(300)는 이미지 영역화하도록 학습된 딥러닝 신경망(즉, 이미지 영역화 딥러닝 모델)에 타겟 도메인 이미지(TD)를 입력하여 이미지 영역화를 수행함으로 타겟 영역화 이미지(TS)를 생성할 수 있다. 따라서, 본 발명은 타겟 도메인 이미지(TD)에 대한 트레이닝 데이터 셋 없이 신경망을 트레이닝하여 타겟 도메인 이미지(TD)를 영역화할 수 있는 제로샷(zero-shot) 러닝을 구현할 수 있다. 머신러닝서버(300)가 이미지 영역화 딥러닝 모델로 트레이닝하는 자세한 설명은 도 5 내지 도 12의 딥러닝 모델 관한 설명을 따른다.The machine learning server 300 may build an image segmentation deep learning model using a training data set including a source domain image SD, and perform image segmentation on a target domain image TD. As an example, referring to FIG. 4, the machine learning server 300 repeatedly compares a source domain image (SS) generated by inputting a source domain image (SD) to a deep learning neural network and a correct answer label to provide a deep learning neural network. You can learn enough to make an image area. The machine learning server 300 inputs a target domain image (TD) to a deep learning neural network (that is, an image localization deep learning model) that has been trained to localize an image to perform image domainization, thereby generating a target domainization image (TS). Can be generated. Accordingly, the present invention can implement zero-shot learning capable of regionalizing the target domain image TD by training a neural network without a training data set for the target domain image TD. A detailed description of training by the machine learning server 300 using the image segmentation deep learning model follows the description of the deep learning model of FIGS. 5 to 12.

또한, 머신러닝서버(300)는, 데이터를 처리하는 프로세서(330)와, 이미지 영역화 딥러닝을 수행하기 위한 이미지 영역화 딥러닝 모델을 저장하는 저장부(320)를 포함할 수 있으며, 프로세서(330)가 이미지 영역화 딥러닝 모델을 독출하여, 구축된 신경망 시스템에 따라서 하기 기술하는 타겟 도메인 이미지(TD)에 대한 이미지 영역화 딥러닝을 수행하게 된다.In addition, the machine learning server 300 may include a processor 330 that processes data, and a storage unit 320 that stores an image segmentation deep learning model for performing image segmentation deep learning. 330 reads the image segmentation deep learning model, and performs image segmentation deep learning on the target domain image (TD) described below according to the constructed neural network system.

실시예에 따라서 프로세서(330)는, 전체 유닛들을 제어하는 메인 프로세서와, 이미지 영역화 딥러닝에 따라 신경망 구동시 필요한 대용량의 연산을 처리하는 복수의 그래픽 프로세서(Graphics Processing Unit, GPU)를 포함하도록 구성될 수 있다. Depending on the embodiment, the processor 330 may include a main processor that controls all units, and a plurality of graphics processing units (GPUs) that process a large amount of computation required when driving a neural network according to image region deep learning. Can be configured.

도 5는 본 발명의 실시예에 따른 딥러닝 모델의 구조를 설명하기 위한 도면이고, 도 6은 본 발명의 실시예에 따른 딥러닝 모델의 트레이닝하는 모습을 설명하기 위한 도면이고, 도 7은 본 발명의 실시예에 따른 타겟 도메인을 딥러닝 모델을 이용하여 이미지 영역화하는 모습이다.5 is a diagram for explaining the structure of a deep learning model according to an embodiment of the present invention, FIG. 6 is a diagram for explaining a state of training a deep learning model according to an embodiment of the present invention, and FIG. 7 This is an image region of a target domain according to an embodiment of the present invention using a deep learning model.

도 5를 참조하면, 본 발명의 실시예에 따른 딥러닝 모델은 복수의 인코더를 포함할 수 있다. 인코더는 입력된 도메인 이미지를 공통 컨텐츠 공간(shared content space) 또는 특정 속성 공간(specific attribute space)을 추출할 수 있다. 공통 컨텐츠 공간은 이미지 영역화를 위한 특징 벡터(feature vector) 또는 특징 표현(feature representation)일 수 있다. 일 예로, 소스 도메인 이미지(실사 이미지)의 얼굴 영역을 나타내는 특징 표현과 타겟 도메인 이미지(카툰 이미지)의 얼굴 영역을 나타내는 특징 표현은 공통 컨텐츠 공간일 수 있다. 또한, 공통 컨텐츠 공간은 서로 다른 도메인 이미지에 존재하는 공통되는 특징 표현(shared feature representation)일 수 있다. 특정 속성 공간은 특정 도메인에 특화된 특징 표현일 수 있다. 일 예로, 소스 도메인 이미지의 배경, 얼굴 피부톤, 색상, 텍스처 등은 특정 속성 공간은 등일 수 있다. 또한 타겟 도메인 이미지의 얼굴 피부톤, 색상, 텍스처 등은 특정 속성 공간은 등일 수 있다. 즉, 특정 속성 공간은 이미지 영역화에 사용되지 않는 특징 표현일 수 있다.Referring to FIG. 5, a deep learning model according to an embodiment of the present invention may include a plurality of encoders. The encoder may extract a shared content space or a specific attribute space from the input domain image. The common content space may be a feature vector or a feature representation for image segmentation. For example, a feature expression representing a face area of a source domain image (realistic image) and a feature expression representing a face area of a target domain image (cartoon image) may be a common content space. Also, the common content space may be a shared feature representation present in different domain images. The specific attribute space may be a characteristic expression specialized for a specific domain. For example, the background, facial skin tone, color, texture, etc. of the source domain image may be a specific attribute space. In addition, the face skin tone, color, texture, etc. of the target domain image may be a specific attribute space. That is, the specific attribute space may be a feature expression that is not used for image regionization.

복수의 인코더는 복수의 컨텐츠 인코더를 포함할 수 있다. 복수의 컨텐츠 인코더는 소스 컨텐츠 인코더(11), 타겟 컨텐츠 인코더(13)를 포함할 수 있다. 소스 컨텐츠 인코더(11)와 타겟 컨텐츠 인코더(13)는 별도의 구성일 수 있다. 또한, 컨텐츠 인코더는 입력된 도메인 이미지에서 얼굴의 이미지 영역화를 위한 공통 컨텐츠 공간을 추출할 수 있다. 일 예로, 소스 컨텐츠 인코더(11)는 소스 도메인 이미지에서 얼굴의 이미지 영역화를 위한 공통 컨텐츠 공간을 추출할 수 있다. 타겟 컨텐츠 인코더(13)는 타겟 도메인 이미지에서 얼굴의 이미지 영역화를 위한 공통 컨텐츠 공간을 추출할 수 있다. 또한, 컨텐츠 인코더는 신경망 구조로 구성될 수 있다. 일 예로, 컨텐츠 인코더는 풀리 컨벌루션 네트워크(Fully Convolutional Network; FCN)일 수 있다. 예를 들어, 컨텐츠 인코더는 3X3 컨벌루션 레이어가 여러 개 중첩된 형태일 수 있고, 입력으로 64X64X3 사이즈의 RGB 이미지(도메인 이미지)가 입력되고, 출력으로 특징 표현이 출력될 수 있다. The plurality of encoders may include a plurality of content encoders. The plurality of content encoders may include a source content encoder 11 and a target content encoder 13. The source content encoder 11 and the target content encoder 13 may have separate components. In addition, the content encoder may extract a common content space for image area of the face from the input domain image. For example, the source content encoder 11 may extract a common content space for image region of a face from the source domain image. The target content encoder 13 may extract a common content space for regionizing a face image from the target domain image. Also, the content encoder may be configured in a neural network structure. As an example, the content encoder may be a Fully Convolutional Network (FCN). For example, the content encoder may have a form in which several 3X3 convolutional layers are superimposed, a 64X64X3 size RGB image (domain image) may be input as an input, and a feature expression may be output as an output.

복수의 인코더는 복수의 속성 인코더를 포함할 수 있다. 복수의 속성 인코더는 소스 속성 인코더(12), 타겟 속성 인코더(14)를 포함할 수 있다. 소스 속성 인코더(12)와 타겟 속성 인코더(14)는 별도의 구성일 수 있다. 또한, 속성 인코더는 입력된 도메인 이미지에서 이미지 영역화에 불필요한 특정 속성 공간을 추출할 수 있다. 일 예로, 소스 속성 인코더(12)는 소스 도메인 이미지에서 배경, 얼굴 피부톤, 색상, 텍스처 등의 특정 속성 공간을 추출할 수 있다. 타겟 속성 인코더(14)는 타겟 도메인 이미지에서 배경, 얼굴 피부톤, 색상, 텍스처 등의 특정 속성 공간을 추출할 수 있다. 또한, 속성 인코더는 신경망 구조로 구성될 수 있다. 일 예로, 속성 인코더는 컨볼루션 신경 네트워크(Convolution Neural Network; CNN)일 다. 예를 들어, 속성 인코더는 3X3 컨볼루션 레이어가 여러 개 중첩되고 마지막 레이어에 풀리 커넥티드 레이어(Fully Connected Layer)가 중첩된 형태일 수 있고, 입력으로 64X64X3 사이즈의 RGB 이미지(도메인 이미지)가 입력되고, 출력으로 특징 표현이 출력될 수 있다.The plurality of encoders may include a plurality of attribute encoders. The plurality of attribute encoders may include a source attribute encoder 12 and a target attribute encoder 14. The source attribute encoder 12 and the target attribute encoder 14 may have separate configurations. In addition, the attribute encoder may extract a specific attribute space unnecessary for image regionization from the input domain image. For example, the source attribute encoder 12 may extract a specific attribute space such as background, facial skin tone, color, and texture from the source domain image. The target attribute encoder 14 may extract a specific attribute space such as background, facial skin tone, color, and texture from the target domain image. In addition, the attribute encoder may be configured in a neural network structure. As an example, the attribute encoder is a convolution neural network (CNN). For example, the attribute encoder may have a form in which several 3X3 convolution layers are superimposed and a fully connected layer is superimposed on the last layer, and a 64X64X3 size RGB image (domain image) is input as an input. , A feature expression may be output as an output.

본 발명의 실시예에 따른 딥러닝 모델은 도메인 분류부(40)를 포함할 수 있다. 도메인 분류부(40)는 복수의 컨텐츠 인코더로부터 입력된 특징 맵(feature map)이 어느 도메인으로 왔는지 분간할 수 있다. 일 예로, 도메인 분류부(40)는 입력된 공통 컨텐츠 공간이 소스 컨텐츠 인코더(11)에서 출력한 공통 컨텐츠 공간인지 타겟 컨텐츠 인코더(13)에서 출력한 공통 컨텐츠 공간인지 판단할 수 있다. 후술하겠지만, 도메인 분류부(40)는 어느 도메인의 공통 컨텐츠 공간인지 분간하도록 트레이닝되고 복수의 컨텐츠 인코더는 도메인 분류부(40)가 어느 도메인의 공통 컨텐츠 공간인지 분간하지 못하도록 트레이닝될 수 있다. 즉, 도메인 분류부(40)와 복수의 컨텐츠 인코더는 생산적 적대 네트워크(Generative Adversarial Network; GAN) 기반으로 상호간에 경쟁적으로 트레이닝하는 것이다. 참고로, GAN은 가상의 데이터 샘플을 생성하는 생성기(generator)와 입력된 데이터 샘플이 실제 데이터인지 여부를 판별하는 판별기(discriminator)로 구성되고, 생성기와 판별기간의 적대적 트레이닝(adversarial training)을 통해 구축되는 기계 학습 모델을 의미한다. 기계 학습 분야의 당업자라면 GAN에 대한 기본적인 내용은 이미 숙지하고 있을 것인 바, GAN에 대한 더 이상의 설명은 생략하도록 한다. 또한, 도메인 분류부(40)는 신경망 구조로 구성될 수 있다. 일 예로, 도메인 분류부(40)는 컨볼루션 신경 네트워크(Convolution Neural Network; CNN)일 다. 예를 들어, 도메인 분류부(40)는 출력 노드가 1개인 컨볼루션 신경 네트워크이고 마지막 레이어에 풀리 커넥티드 레이어(Fully Connected Layer)가 중첩된 형태일 수 있고, 입력으로 공통 컨텐츠 공간이 입력되고, 출력으로 0~1 사이 범위를 갖는 시그모이드 확률값이 출력될 수 있다. 예를 들어, 도메인 분류부(40)는 시그모이드 확률값이 0.5 초과이면 소스 컨텐츠 인코더에서 출력된 공통 컨텐츠 공간으로 판단하는 것이고 시그모이드 확률값이 0.5 미만이면 타겟 컨텐츠 인코더에서 출력된 공통 컨텐츠 공간으로 판단할 수 있다. GAN 기반의 트레이닝을 통해 도메인 분류부(40)는 어느 컨텐츠 인코더에서 공통 컨텐츠 공간을 입력받더라도 0.5의 출력을 할 수 있다. 즉, 도메인 분류부(40)는 어느 도메인 이미지에서 추출된 특징 맵인지 구별하지 못하는 것이다. 다른 표현으로, 서로 다른 도메인의 컨텐츠 인코더에서 출력하는 공통 컨텐츠 공간은 도메인 분류부(40)는 동일한 도메인의 컨텐츠 인코더에서 출력하는 것으로 판단할 수 있다. The deep learning model according to an embodiment of the present invention may include a domain classifier 40. The domain classifying unit 40 may discriminate which domain the feature map input from the plurality of content encoders has come to. As an example, the domain classifier 40 may determine whether the input common content space is a common content space output from the source content encoder 11 or a common content space output from the target content encoder 13. As will be described later, the domain classifier 40 may be trained to discern which domain of the common content space, and the plurality of content encoders may be trained to prevent the domain classifier 40 from discriminating which domain of the common content space. That is, the domain classification unit 40 and the plurality of content encoders competently train each other based on a Generative Adversarial Network (GAN). For reference, GAN consists of a generator that generates a virtual data sample and a discriminator that determines whether the input data sample is real data, and performs adversarial training of the generator and the discrimination period. It refers to a machine learning model built through. Those skilled in the field of machine learning will already be familiar with the basic contents of GAN, so further description of GAN will be omitted. In addition, the domain classification unit 40 may be configured in a neural network structure. For example, the domain classifier 40 is a convolution neural network (CNN). For example, the domain classification unit 40 is a convolutional neural network having one output node, and may be a form in which a fully connected layer is superimposed on the last layer, and a common content space is input as an input, As an output, a sigmoid probability value having a range between 0 and 1 may be output. For example, if the sigmoid probability value is greater than 0.5, the domain classifier 40 determines it as a common content space output from the source content encoder, and if the sigmoid probability value is less than 0.5, the domain classifier 40 determines the common content space output from the target content encoder. I can judge. Through GAN-based training, the domain classifier 40 may output 0.5 even if a common content space is input from any content encoder. That is, the domain classification unit 40 cannot distinguish which domain image is the extracted feature map. In other words, the common content space output from the content encoders of different domains may be determined that the domain classifier 40 outputs the content encoders of the same domain.

본 발명의 실시예에 따른 딥러닝 모델은 복수의 생성부를 포함할 수 있다. 복수의 생성부는 소스 생성부(21), 타겟 생성부(22)를 포함할 수 있다. 생성부는 속성 인코더의 특정 속성 공간과 컨텐츠 인코더의 공통 컨텐츠 공간을 입력받아 가상 이미지를 생성할 수 있다. 일 예로, 소스 생성부(21)는 소스 속성 인코더(12)의 특정 속성 공간과 소스 컨텐츠 인코더(11)의 공통 컨텐츠 공간을 이용하여 소스 가상 이미지(SF)를 생성할 수 있다. 타겟 생성부(22)는 타겟 속성 인코더(14)의 특정 속성 공간과 타겟 컨텐츠 인코더(13)의 공통 컨텐츠 공간을 이용하여 타겟 가상 이미지(TF)를 생성할 수 있다. 또한, 생성부는 신경망 구조로 구성될 수 있다. 일 예로, 생성부는 다수의 디컨벌루션 레이어(Deconvolutional Layer)로 구성된 네트워크일 수 있고, 컨텐트 인코더와 속성 인코더의 출력 값을 연관시킨(concatenate) 특징 표현을 입력으로 하고 RGB 이미지를 출력할 수 있다.The deep learning model according to an embodiment of the present invention may include a plurality of generation units. The plurality of generation units may include a source generation unit 21 and a target generation unit 22. The generator may generate a virtual image by receiving a specific property space of the property encoder and a common content space of the content encoder. For example, the source generator 21 may generate the source virtual image SF using a specific attribute space of the source attribute encoder 12 and a common content space of the source content encoder 11. The target generator 22 may generate the target virtual image TF using a specific attribute space of the target attribute encoder 14 and a common content space of the target content encoder 13. In addition, the generation unit may be configured in a neural network structure. As an example, the generation unit may be a network composed of a plurality of deconvolutional layers, and may output an RGB image with a feature expression concatenated with an output value of a content encoder and an attribute encoder as an input.

본 발명의 실시예에 따른 딥러닝 모델은 복수의 판별부를 포함할 수 있다. 복수의 판결부는 소스 판별부(31)와 타겟 판별부(32)를 포함할 수 있다. 소스 판별부(31)와 타겟 판별부(32)는 서로 별도의 구성일 수 있다. 판별부는 생성부에서 생성된 가상 이미지가 진짜인지 가짜인지 판별할 수 있다. 판별부는 생성부가 생성한 가상 이미지가 사실적인(realistic)한 좋은 화질의 이미지를 출력할 수 있도록 도와줄 수 있다. 후술하겠지만, 판별부는 생성부가 생성한 가상 이미지가 가짜인지 판단하도록 트레이닝되고 생성부는 생성한 가상 이미지가 판별부가 진짜로 판단하도록 트레이닝 될 수 있다. 즉, 생성부와 판별부는 GAN 기반으로 상호간에 경쟁적으로 트레이닝하는 것이다. 일 예로, 소스 판별부(31)는 소스 생성부(21)에서 생성한 소스 가상 이미지(SF)가 가짜인지 진짜인지 판별하도록 트레이닝 되고 소스 생성부(21)는 생성한 소스 가상 이미지(SF)가 소스 판별부(31)에서 진짜인지 가짜인지 판별할 수 없도록 트레이닝 될 수 있다. 마찬가지로, 타겟 판별부(32)는 타겟 생성부(22)에서 생성한 타겟 가상 이미지(TF)가 가짜인지 진짜인지 판별하도록 트레이닝 되고 타겟 생성부(22)는 생성한 타겟 가상 이미지(TF)가 타겟 판별부(32)에서 진짜인지 가짜인지 판별할 수 없도록 트레이닝 될 수 있다. 또한, 판별부는 신경망 구조로 구성될 수 있다. 일 예로, 판별부는 신경망 구조로 구성될 수 있다. 일 예로, 판별부는 컨볼루션 신경 네트워크(Convolution Neural Network; CNN)일 수 있다. 예를 들어, 판별부는 출력 노드가 1개인 컨볼루션 신경 네트워크이고 마지막 레이어에 풀리 커넥티드 레이어(Fully Connected Layer)가 중첩된 형태일 수 있고, 입력으로 소스 도메인 이미지(SD)와 생성부에서 출력한 가상 이미지가 입력되고, 출력으로 0~1 사이 범위를 갖는 시그모이드 확률값이 출력될 수 있다.The deep learning model according to an embodiment of the present invention may include a plurality of discrimination units. The plurality of judgment units may include a source determination unit 31 and a target determination unit 32. The source determination unit 31 and the target determination unit 32 may be configured separately from each other. The determination unit may determine whether the virtual image generated by the generation unit is real or fake. The determination unit may help the virtual image generated by the generation unit to output a realistic and good quality image. As will be described later, the determination unit may be trained to determine whether the virtual image generated by the generation unit is fake, and the generation unit may be trained to determine whether the virtual image generated by the generation unit is genuine. In other words, the generation unit and the discrimination unit are training each other competitively based on GAN. As an example, the source determination unit 31 is trained to determine whether the source virtual image SF generated by the source generation unit 21 is fake or real, and the source generation unit 21 generates the generated source virtual image SF It may be trained so that the source determination unit 31 cannot determine whether it is real or fake. Similarly, the target determination unit 32 is trained to determine whether the target virtual image TF generated by the target generation unit 22 is fake or real, and the target generation unit 22 uses the generated target virtual image TF It may be trained so that the determination unit 32 cannot determine whether it is real or fake. In addition, the discrimination unit may be configured in a neural network structure. For example, the determination unit may be configured in a neural network structure. For example, the determination unit may be a convolution neural network (CNN). For example, the discrimination unit may be a convolutional neural network with one output node, and a fully connected layer may be superimposed on the last layer, and the source domain image (SD) as input and the output from the generation unit A virtual image is input, and a sigmoid probability value having a range between 0 and 1 may be output as an output.

본 발명의 실시예에 따른 딥러닝 모델은 영역화부(50)를 포함할 수 있다. 영역화부(50)는 소스 도메인 이미지에 관한 트레이닝 데이터 셋을 이용하여 이미지 영역화가 가능하도록 트레이닝 된 후 타겟 도메인 이미지가 입력되면 타겟 도메인 이미지를 이미지 영역화할 수 있습니다. 즉, 컨텐츠 인코더는 도메인 분류기(40)를 속이는 방향으로 훈련되어 소스 컨텐츠 인코더(11)에서 출력한 공통 컨텐츠 공간과 타겟 컨텐츠 인코더(13)에서 출력한 공통 컨텐츠 공간이 잠재 공간(Latent Space) 상에서 서로 같은 선상에 놓을 수 있다. 영역화부(50)는 이렇게 도메인이 불변한(domain invariant) 공통 컨텐츠 공간을 입력으로 받아서 이미지 영역화를 수행한 영역화 이미지를 출력할 수 있다. 일 예로, 도 5를 참고하면, 소스 컨텐츠 인코더(11)는 소스 도메인 이미지(SD)에서 공통 컨텐츠 공간을 추출하여 영역화부(50)에 입력시킬 수 있다. 영역화부(50)는 입력된 공통 컨텐츠 공간을 이용하여 이미지 영역화을 수행하여 소스 영역화 이미지(SS)를 생성하고, 이미지 영역화 결과 값과 정답 레이블(Ground Truth Labels)을 비교하여 올바르게 이미지 영역화를 수행하도록 트레이닝 될 수 있다. 도 6을 참조하면, 충분히 트레이닝된 영역화부(50)는 타겟 컨텐츠 인코더(13)가 타겟 도메인 이미지(TD)에서 추출한 공통 컨텐츠 공간을 입력 받아 이미지 영역화를 수행하여 타겟 영역화 이미지(TS)를 생성할 수 있다. 또한, 영역화부(50)는 신경망 구조로 구성될 수 있다. 일 예로, 영역화부(50)는 풀리 컨벌루션 네트워크(Fully Convolutional Network; FCN)일 수 있다.The deep learning model according to an embodiment of the present invention may include a domainization unit 50. The domainization unit 50 may convert the target domain image into an image area when the target domain image is input after training to enable image domainization using the training data set for the source domain image. That is, the content encoder is trained in the direction of deceiving the domain classifier 40 so that the common content space output from the source content encoder 11 and the common content space output from the target content encoder 13 are Can be placed on the same line. The zoning unit 50 may receive a domain-invariant common content space as an input and output a zoning image obtained by performing image zoning. As an example, referring to FIG. 5, the source content encoder 11 may extract a common content space from the source domain image SD and input it to the region converter 50. The segmentation unit 50 performs image segmentation using the input common content space to generate a source segmentation image (SS), and compares the image segmentation result value and ground truth labels to correctly segment the image. Can be trained to perform. Referring to FIG. 6, the sufficiently trained regionization unit 50 receives the common content space extracted from the target domain image TD by the target content encoder 13 and performs image regionization to generate the target regionization image TS. Can be generated. In addition, the segmentation unit 50 may be configured in a neural network structure. As an example, the domainization unit 50 may be a Fully Convolutional Network (FCN).

본 발명의 실시예에 따른 딥러닝 모델은 도 8 및/또는 도 9에서 설명하는 손실을 이용하여 이미지 영역화를 수행하도록 트레이닝 될 수 있다. 일 예로, 딥러닝 모델은 이하에서 설명되는 손실을 해당 구성에 역전파(back-propagation)함으로써 트레이닝 될 수 있다. The deep learning model according to an embodiment of the present invention may be trained to perform image segmentation using the loss described in FIGS. 8 and/or 9. As an example, the deep learning model may be trained by back-propagation of the loss described below to the corresponding configuration.

도 8은 일 실시예에 따른 딥러닝 모델의 트레이닝을 위한 손실을 설명하기 위한 도면이다. 8 is a diagram illustrating a loss for training a deep learning model according to an exemplary embodiment.

도 8을 참조하면, 일 실시예에 따른 딥러닝 모델은 트레이닝시 도메인 적대 손실(Domain Adversarial Loss;

)을 이용할 수 있다. 도메인 적대 손실(

)은 소스 컨텐츠 인코더(11)에서 출력한 공통 컨텐츠 공간과 타겟 컨텐츠 인코더(13)에서 출력한 공통 컨텐츠 공간이 서로 구분 불가능하도록 같은 잠재 공간(Latent Space) 선상에 놓이도록 하는 손실이다. 즉, 도메인 분류부(40)는 분류 정확도를 최대화하는 방향으로, 소스 컨텐츠 인코더(11)와 타겟 컨텐츠 인코더(13)는 분류 정확도를 최소화하는 방향으로 트레이닝 될 수 있다. 도메인 적대 손실(

)은 수학식 1과 같다.8, a deep learning model according to an embodiment includes Domain Adversarial Loss during training;

) Can be used. Domain hostility loss (

) Is a loss that causes the common content space output from the source content encoder 11 and the common content space output from the target content encoder 13 to be placed on the same latent space line so that they cannot be distinguished from each other. That is, the domain classifier 40 may be trained in a direction to maximize the classification accuracy, and the source content encoder 11 and the target content encoder 13 may be trained in a direction to minimize the classification accuracy. Domain hostility loss (

) Is the same as in Equation 1.

(수학식 1)(Equation 1)

는 기대(expectation) 함수, xs는 소드 도메인 이미지(SD)이고, xt는 타겟 도메인 이미지(TD)이다.

Is the expectation function, xs is the sword domain image (SD), and xt is the target domain image (TD).

일 실시예에 따른 딥러닝 모델은 트레이닝시 자가 복구 손실(Self-Reconstruction Loss;

)을 이용할 수 있다. 자가 복구 손실(

)은 딥러닝 모델에 의해 생성된 가상 이미지가 입력된 도메인 이미지와 비교하여 원본 그대로 복원되도록 하는 손실이다. 일 예로, 소스 자가 복구 손실(

)은 소스 도메인 이미지(SD)와 소스 가상 이미지(SF)를 비교하여 소스 가상 이미지(SF)가 원본인 소스 도메인 이미지(SD)로 복원된 정도를 손실로 나타낸 것이다. 타겟 자가 복구 손실(

)은 타겟 도메인 이미지(TD)와 타겟 가상 이미지(TF)를 비교하여 타겟 가상 이미지(TF)가 원본인 타겟 도메인 이미지(TD)로 복원된 정도를 손실로 나타낸 것이다.A deep learning model according to an embodiment includes Self-Reconstruction Loss during training;

) Can be used. Self-healing loss (

) Is a loss that allows the virtual image generated by the deep learning model to be restored as it is by comparing it with the input domain image. For example, the source self-healing loss (

) Represents the degree to which the source virtual image SF is restored to the original source domain image SD by comparing the source domain image SD with the source virtual image SF as loss. Target self-healing loss (

) Denotes the degree to which the target virtual image (TF) is restored to the original target domain image (TD) by comparing the target domain image (TD) with the target virtual image (TF) as a loss.

일 실시예에 따른 딥러닝 모델은 트레이닝시 이미지 적대 손실(Image Adversarial Loss;

)을 이용할 수 있다. 이미지 적대 손실(

)은 GAN 컨셉의 손실이며, 생성부에서 출력된 가상 이미지가 사실적이게 하는 손실이다. 예를 들어, 소스 이미지 적대 손실(

)은 소스 생성부(21)가 출력한 소스 가상 이미지(SF)를 소스 판별부(31)가 가짜라고 판별하지 못하는 정도를 손실로 나타낸 것이다. 타겟 이미지 적대 손실(

)은 타겟 생성부(22)가 출력한 타겟 가상 이미지(TF)를 타겟 판별부(32)가 가짜라고 판별하지 못하는 정도를 손실로 나타낸 것이다.The deep learning model according to an embodiment includes: Image Adversarial Loss during training;

) Can be used. Image hostility loss (

) Is the loss of the GAN concept, and is the loss that makes the virtual image output from the generation unit realistic. For example, source image hostile loss (

) Denotes the degree to which the source determining unit 31 cannot determine that the source virtual image SF output by the source generating unit 21 is a fake, as a loss. Target image hostile loss (

) Represents the degree to which the target virtual image TF output by the target generation unit 22 is not determined as a fake by the target determination unit 32 as a loss.

일 실시예에 따른 딥러닝 모델은 트레이닝시 시그멘테이션 손실(Segmentation Loss;

)을 이용할 수 있다. 시그멘테이션 손실(

)은 소스 도메인 이미지(SD)에 대한 이미지 영역화 결과가 정답 레이블(ground truth label)과 차이가 나는지에 대한 손실일 수 있다. 타겟 도메인 이미지(TD)에 대해서는 정답 레이블이 없기 때문에 소스 도메인 이미지(SD)에 대해서만 시그멘테이션 손실(

)을 적용할 수 있다. 시그멘테이션 손실(

)은 수학식 2와 같다.A deep learning model according to an embodiment includes: Segmentation Loss during training;

) Can be used. Loss of segmentation (

) May be a loss as to whether the result of image segmentation for the source domain image SD is different from the ground truth label. Since there is no correct answer label for the target domain image (TD), only the source domain image (SD) has a loss of signature (

) Can be applied. Loss of segmentation (

) Is the same as in Equation 2.

(수학식 2)(Equation 2)

수학식 2에서,

는 영역화부(50)의 출력 픽셀이고,

는 정답 레이블(ground truth label)의 픽셀을 나타낸다.In Equation 2,

Is an output pixel of the areaization unit 50,

Denotes the pixel of the ground truth label.

도 9는 다른 실시예에 따른 딥러닝 모델의 트레이닝을 위한 손실을 설명하기 위한 도면이다.9 is a diagram for explaining a loss for training a deep learning model according to another embodiment.

도 9를 참조하면, 다른 실시예에 따른 딥러닝 모델은 트레이닝시 교차 사이클 손실(Pixel-Level Cross-Cycle Consistency Loss;

)을 이용할 수 있다. 교차 사이클 손실(

)은 픽셀 수준의 교차 사이클 일관성 손실을 줄인 말이다. 교차 사이클 손실(

)은 딥러닝 모델의 전체 신경망의 도메인 적용(domain adaptation) 성능을 높이기 위한 손실일 수 있다. 일 도메인 이미지에 대한 콘텐츠 인코더의 출력을 타 도메인의 생성부로 스왑(swap)하면 타 도메인의 생성부는 일 도메인 이미지에서 얻은 특징 정보가 반영되어 제1 가상 이미지를 생성할 수 있다. 제1 가상 이미지를 다시 입력으로 사용하여 콘텐츠 인코더의 출력을 타 도메인의 생성부로 다시 스왑(swap)하여 생성된 제2 가상 이미지는 원본 이미지인 일 도메인 이미지로 복원되어야 한다. 일 예로, 딥러닝 모델은, 소스 속성 인코더(12)에 의해 소스 도메인 이미지(SD)에서 추출한 제1 특정 속성 공간과 타겟 컨텐츠 인코더(13)에 의해 타겟 도메인 이미지(TD)에서 추출한 제1 공통 컨텐츠 공간을 이용하여 제1 소스 가상 이미지(SF1)를 생성하고, 소스 컨텐츠 인코더(11)에 의해 소스 도메인 이미지(SD)에서 추출한 제1 공통 컨텐츠 공간과 타겟 속성 인코더(14)에 의해 타겟 도메인 이미지(TD)에서 추출한 제1 특정 속성 공간을 이용하여 제1 타겟 가상 이미지(TF1)를 생성하고, 소스 속성 인코더(12)에 의해 제1 소스 가상 이미지(SF1)에서 추출한 제2 특정 속성 공간과 타겟 컨텐츠 인코더(13)에 의해 제1 타겟 가상 이미지(TF1)에서 추출한 제2 공통 컨텐츠 공간을 이용하여 제2 소스 가상 이미지(SF2)를 생성하고, 소스 컨텐츠 인코더(11)에 의해 제1 소스 가상 이미지(SF1)에서 추출한 제2 공통 컨텐츠 공간과 타겟 속성 인코더(14)에 의해 제1 타겟 가상 이미지(TF1)에서 추출한 제2 특정 속성 공간을 이용하여 제2 타겟 가상 이미지(TF2)를 생성할 수 있다. 소스 도메인 이미지(SD)와 제2 소스 가상 이미지(SF2)와 비교한 손실을 소스 교차 사이클 손실(

)로 나타낼 수 있고, 타겟 도메인 이미지(TD)와 제2 타겟 가상 이미지(TF2)와 비교한 손실을 타겟 교차 사이클 손실(

)로 나타낼 수 있다. 전체 교차 사이클 손실(

)은 소스 교차 사이클 손실(

)과 타겟 교차 사이클 손실(

)의 합으로 나타낼 수 있으며 전체 교차 사이클 손실(

)은 수학식 3과 같다.Referring to FIG. 9, a deep learning model according to another embodiment includes: Pixel-Level Cross-Cycle Consistency Loss during training;

) Can be used. Cross cycle loss(

) Is an abbreviation for cross-cycle coherence loss at the pixel level. Cross cycle loss(

) May be a loss for improving the domain adaptation performance of the entire neural network of the deep learning model. When the output of the content encoder for one domain image is swapped with the generation unit of another domain, the generation unit of the other domain may generate a first virtual image by reflecting feature information obtained from the one domain image. The second virtual image generated by using the first virtual image as an input again and swapping the output of the content encoder back to the generation unit of another domain must be restored to a domain image that is the original image. As an example, the deep learning model includes a first specific attribute space extracted from the source domain image SD by the source attribute encoder 12 and the first common content extracted from the target domain image TD by the target content encoder 13 A first source virtual image SF1 is generated using the space, and the first common content space extracted from the source domain image SD by the source content encoder 11 and the target domain image ( A first target virtual image TF1 is generated using the first specific attribute space extracted from TD), and a second specific attribute space and target content extracted from the first source virtual image SF1 by the source attribute encoder 12 The second source virtual image SF2 is generated by using the second common content space extracted from the first target virtual image TF1 by the encoder 13, and the first source virtual image ( The second target virtual image TF2 may be generated using the second common content space extracted from SF1) and the second specific attribute space extracted from the first target virtual image TF1 by the target attribute encoder 14. The loss compared to the source domain image (SD) and the second source virtual image (SF2) is compared to the source cross cycle loss (

), and the loss compared with the target domain image TD and the second target virtual image TF2 is the target cross cycle loss (

Can be represented by ). Total cross cycle loss (

) Is the source cross-cycle loss (

) And target cross cycle loss (

It can be expressed as the sum of the total cross cycle loss (

) Is equal to Equation 3.

(수학식 3)(Equation 3)

다른 실시예에 따른 딥러닝 모델은 트레이닝시 시맨틱 일관성 손실(Semantic Consistency Loss;

)을 이용할 수 있다. 시맨틱 일관성 손실(

)은 딥러닝 모델의 전체 신경망의 도메인 적용(domain adaptation) 성능을 높이기 위한 손실일 수 있다. 일 도메인 이미지에 대한 콘텐츠 인코더의 제1 공통 컨텐츠 공간 출력을 타 도메인의 생성부로 스왑(swap)하면 타 도메인의 생성부는 일 도메인 이미지에서 얻은 특징 정보가 반영되어 제1 가상 이미지를 생성할 수 있다. 제1 가상 이미지를 다시 입력으로 사용하여 타 도메인의 콘텐츠 인코더는 제2 공통 컨텐츠 공간을 출력할 수 있다. 이 때, 시맨틱 일관성 손실(

)은 타 도메인의 제2 공통 컨텐츠 공간이 일 도메인의 제1 공통 컨텐츠 공간을 그대로 유지하고 있는지에 대한 손실이다. 이러한 시맨틱 일관성 손실(

)은 신경망이 헤어 컬러, 헤어 스타일, 얼굴 윤곽과 같은 고레벨의 특징을 캡쳐할 수 있게 한다. 일 예로, 소스 컨텐츠 인코더(11)에서 출력하는 제1 공통 컨텐츠 공간과 타겟 컨텐츠 인코더(13)에서 출력하는 제2 공통 컨텐츠 공간을 비교하여 소스 도메인에서 타겟 도메인으로의 변환시의 시맨틱 일관성 손실(

)로 나타낼 수 있고, 타겟 컨텐츠 인코더(13)에서 출력하는 제1 공통 컨텐츠 공간과 소스 컨텐츠 인코더(11)에서 출력하는 제2 공통 컨텐츠 공간을 비교하여 타겟 도메인에서 소스 도메인으로의 변환시의 시맨틱 일관성 손실(

)로 나타낼 수 있다. 소스 도메인에서 타겟 도메인으로의 변환시의 시맨틱 일관성 손실(

)은 수학식 4와 같다.A deep learning model according to another embodiment may include Semantic Consistency Loss during training;

) Can be used. Loss of semantic consistency (

) May be a loss for improving the domain adaptation performance of the entire neural network of the deep learning model. When the output of the first common content space of the content encoder for one domain image is swapped to the generator of another domain, the generator of the other domain reflects feature information obtained from the one domain image to generate the first virtual image. Using the first virtual image again as an input, a content encoder of another domain may output a second common content space. At this time, the loss of semantic consistency (

) Is a loss of whether the second common content space of another domain maintains the first common content space of one domain as it is. This loss of semantic consistency (

) Allows neural networks to capture high-level features such as hair color, hairstyle, and facial contours. For example, by comparing a first common content space output from the source content encoder 11 and a second common content space output from the target content encoder 13, the loss of semantic consistency when converting from the source domain to the target domain (

), and semantic consistency loss when converting from the target domain to the source domain by comparing the first common content space output from the target content encoder 13 and the second common content space output from the source content encoder 11 (

Can be represented by ). Loss of semantic consistency when converting from source domain to target domain (

) Is the same as in Equation 4.

(수학식 4)(Equation 4)

타겟 도메인에서 소스 도메인으로의 변환시의 시맨틱 일관성 손실(

)은 마찬가지로 수학식 4에 적용할 수 있다.Loss of semantic consistency when converting from target domain to source domain (

) Can likewise be applied to Equation 4.

본 발명의 실시예에 따른 딥러닝 모델은 도 7 및/또는 도 8에서 설명하는 손실을 선형적으로 결합한 전체 손실(Full Loss;

)을 이용하여 이미지 영역화를 수행하도록 트레이닝 될 수 있다. 전체 손실(

)은 수학식 5와 같다.The deep learning model according to an embodiment of the present invention includes a total loss (Full Loss) which linearly combines the losses described in FIGS. 7 and/or 8;

) Can be trained to perform image segmentation. Total loss(

) Is equal to Equation 5.

(수학식 5)(Equation 5)

수학식 5에서,

는 하이퍼 파라미터들이다. 사용자는 하이퍼 파라미터는 조절하여 각 손실의 상대적인 중요도를 조절할 수 있다. In Equation 5,

Are hyperparameters. The user can adjust the relative importance of each loss by adjusting the hyper parameter.

<비교예><Comparative Example>

도 10은 본 발명의 실시예에 따른 딥러닝 모델에 의한 이미지 영역화 결과와 종래 기술에 따른 딥러닝 모델에 의한 이미지 영역화 결과를 비교한 모습이다.10 is a diagram illustrating a comparison between an image segmentation result by a deep learning model according to an embodiment of the present invention and an image segmentation result by a deep learning model according to the prior art.

도 10을 참조하면, 본 발명의 실시예에 따른 딥러닝 모델(Ours)에 카툰 이미지인 타겟 도메인 이미지(Input)를 이미지 영역화한 결과와 종래에 일반적으로 시맨틱 세그먼테이션 임무를 수행하는 네트워크 모델에 타겟 도메인 이미지(Input)를 이미지 영역화한 결과를 나타낸다. 카툰 이미지에 대한 이미지 영역화를 위해서는 카툰 이미지에 대한 트레이닝 데이터 셋이 없으므로 딥러닝 모델이 제로 샷 학습을 해야 하는데, 종래 딥러닝 모델 중 U-Net은 카툰 이미지를 영역화하기 위해 적절히 학습하지 못하였다. Mehran은 배경 기술에서 설명하였던 의학 분야에서 사용되는 도메인 적용기술을 이용한 이미지 영역화 딥러닝 모델이다. Mehran은 카툰 이미지의 이미지 영역화가 어느 정도 되지만 본 발명의 딥러닝 모델(Ours)의 이미지 영역화 결과와 비교하여 육안으로도 성능이 떨어지는 것을 알 수 있다.Referring to FIG. 10, a result of image domainization of a target domain image (Input), which is a cartoon image, in a deep learning model (Ours) according to an embodiment of the present invention, and a target to a network model that generally performs a semantic segmentation task. It shows the result of converting the domain image (Input) into an image area. Since there is no training data set for cartoon images for image domainization of cartoon images, the deep learning model must perform zero shot learning. Among the conventional deep learning models, U-Net could not properly learn to domainize cartoon images. . Mehran is an image domain deep learning model using domain application technology used in the medical field described in the background art. It can be seen that Mehran has some degree of image segmentation of cartoon images, but the performance is poor even with the naked eye compared to the image segmentation results of the deep learning model (Ours) of the present invention.

<표 1><Table 1>

표 1을 참조하면, 딥러닝 모델에 따른 카툰 이미지에 대한 이미지 영역화 결과를 살펴보면, 본 발명의 딥러닝 모델(Ours)이 종래 딥러닝 모델인 U-Net과 Mehran 보다 평균 IoU(Mean Intersection over Union)과 픽셀 정확도(Pixel Accuracy)가 모두 우세한 것을 알 수 있다.Referring to Table 1, looking at the result of image segmentation for a cartoon image according to a deep learning model, the deep learning model (Ours) of the present invention is an average IoU (Mean Intersection over Union) than the conventional deep learning models U-Net and Mehran. ) And Pixel Accuracy are both dominant.

<트레이닝시 특징표현을 분리 여부에 따른 비교예><Comparative example according to whether or not to separate the characteristic expression during training>

<표 2><Table 2>

표 2를 참조하면, 본 발명의 실시예에 따른 딥러닝 모델(Ours)은 영역화부(50)에 실사 이미지의 공통 컨텐츠 공간만 입력하여 트레이닝을 수행한다. 영역화부(50)에 실사 이미지의 공통 컨텐츠 공간과 특정 속성 공간을 모두 입력하여 트레이닝을 수행한 딥러닝 모델(Baseline)에 카툰 이미지에 대한 이미지 영역화 결과와 본 발명을 비교하면, 본 발명(Ours)이 비교 딥러닝 모델(Bseline) 보다 평균 IoU(Mean Intersection over Union)과 픽셀 정확도(Pixel Accuracy)가 모두 우세한 것을 알 수 있다. 이는 컨텐츠 특징이 속성 특징보다 구조적인 정보를 많이 가지고 있기 때문에 얼굴에 대한 이미지 영역화를 더 효율적으로 수행할 수 있게 한다.Referring to Table 2, the deep learning model Ours according to an embodiment of the present invention performs training by inputting only the common content space of the real image into the domainization unit 50. Comparing the image domainization result of the cartoon image and the present invention in a deep learning model (Baseline) in which training was performed by inputting both the common content space and the specific property space of the real image into the domainization unit 50, the present invention (Ours ), it can be seen that both average IoU (Mean Intersection over Union) and pixel accuracy are superior to the comparative deep learning model (Bseline). This makes it possible to more efficiently perform image segmentation for a face because the content feature has more structural information than the attribute feature.

<손실에 따른 비교예><Comparative example according to loss>

<표 3><Table 3>

표 3을 참조하면, 이미지 영역화를 위한 딥러닝 모델을 트레이닝시 손실 함수에 따라 성능에 미치는 정도를 실험하였다. 즉, 전체 손실에서 특정 손실의 영향을 제거하여 딥러닝 모델을 학습시켰다. 결과적으로 모든 손실을 고려하여 트레이닝시 딥러닝 모델의 성능이 우수한 것을 알 수 있다.Referring to Table 3, the degree to which the deep learning model for image segmentation is affected by the loss function during training was tested. That is, the deep learning model was trained by removing the effect of a specific loss from the total loss. As a result, it can be seen that the performance of the deep learning model is excellent during training considering all losses.

따라서, 실시예에 따른 딥러닝 기반의 이미지 영역화 장치는 카툰 이미지 내 캐릭터 얼굴을 신속하고 정확하고 세밀하게 영역화 할 수 있다. 또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 장치는 카툰 이미지에 대한 트레이닝 데이터 셋이 없이 신경망을 트레이닝하여 카툰 이미지를 영역화할 수 있다. 또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 장치는 실사 이미지에 대한 트레이닝 데이터 셋을 이용하여 카툰 이미지를 영역화할 수 있는 딥러닝 모델을 구축할 수 있다.Accordingly, the deep learning-based image segmentation apparatus according to an embodiment can quickly, accurately, and precisely segment a character face in a cartoon image. In addition, the deep learning-based image segmentation apparatus according to the embodiment may train a neural network without a training data set for the cartoon image to segment the cartoon image. In addition, the deep learning-based image segmentation apparatus according to an embodiment may construct a deep learning model capable of segmenting a cartoon image using a training data set for a real image.

도 11은 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 시스템의 신호 흐름도이다.11 is a signal flow diagram of a deep learning-based image segmentation system according to an embodiment of the present invention.

도 11을 참조하면, 단말기(100)는 타겟 도메인 이미지(TD)를 획득할 수 있다(S11). 타겟 도메인 이미지(TD)는 캐릭터 얼굴을 포함하는 카툰 이미지일 수 있다. 단말기(100)는 머신 러닝 서버(300)에 타겟 도메인 이미지(TD)를 송신하고 타겟 도메인 이미지(TD)에 대한 이미지 영역화를 요청할 수 있다(S12). 머신 러닝 서버(30)는 타겟 도메인 이미지(TD)에 대한 이미지 영역화를 요청 받으면 이미지 제공 서버(200)에 트레이닝 데이터 셋을 요청할 수 있다(S13). 트레이닝 데이터 셋은 인물의 얼굴이 포함된 실사 이미지와 해당 실사 이미지의 정답 레이블이 쌍으로 있는 데이터 일 수 있다. 이미지 제공 서버(200)는 트레이닝 데이터 셋을 요청 받으면 트레이닝 데이터 셋을 생성할 수 있다(S14). 일 예로, 이미지 제공 서버(200)는 데이터 베이스에 저장된 공개된 트레이닝 데이터 셋을 불러 올 수 있다. 다른 예로, 이미지 제공 서버(200)는 이미지 영역화 딥러닝 모델을 이용하여 실사 이미지에 대한 이미지 영역화를 수행하여 생성한 영역화 이미지를 정답 레이블로 하여 실사 이미지와 정답 레이블을 쌍으로 한 데이터 셋을 생성할 수 있다. 이미지 제공 서버(200)는 트레이닝 데이터 셋을 머신 러닝 서버(300)에 송신할 수 있다. 머신 러닝 서버(300)는 수신한 트레이닝 데이터 셋과 타겟 도메인 이미지(TD)를 이용하여 트레이닝하여 이미지 영역화 딥러닝 모델을 생성할 수 있다(S16). 딥러닝 모델의 트레이닝은 도 5 내지 도 10의 딥러닝 모델을 트레이닝하는 설명을 따른다. 머신 러닝 서버(300)는 이미지 영역화 딥러닝 모델이 구축되면 타겟 도메인 이미지(TD)에 대해 이미지 영역화를 수행할 수 있다(S17). 머신 러닝 서버는 타겟 영역화 이미지(TS)를 단말기(100)에 송신할 수 있다(S18). 단말기(100)는 타겟 영역화 이미지(TS)를 이용하여 카툰 이미지의 얼굴 영역을 편집할 수 있다(S19).Referring to FIG. 11, the terminal 100 may acquire a target domain image TD (S11). The target domain image TD may be a cartoon image including a character face. The terminal 100 may transmit the target domain image TD to the machine learning server 300 and request the image region for the target domain image TD (S12). The machine learning server 30 may request a training data set from the image providing server 200 upon receiving a request for image regionization of the target domain image TD (S13). The training data set may be data in which a real image including a person's face and a correct answer label of the corresponding real image are paired. The image providing server 200 may generate a training data set upon receiving a request for a training data set (S14). For example, the image providing server 200 may load a public training data set stored in a database. As another example, the image providing server 200 is a data set in which the real image and the correct answer label are paired by using the image segmentation image generated by performing image segmentation on the real image using the image segmentation deep learning model as the correct answer label. Can be created. The image providing server 200 may transmit the training data set to the machine learning server 300. The machine learning server 300 may train using the received training data set and a target domain image (TD) to generate an image regionized deep learning model (S16). Training of the deep learning model follows the description of training the deep learning model of FIGS. 5 to 10. The machine learning server 300 may perform image segmentation on the target domain image TD when the image segmentation deep learning model is built (S17). The machine learning server may transmit the target regionized image TS to the terminal 100 (S18). The terminal 100 may edit the face region of the cartoon image by using the target regionized image TS (S19).

도 12는 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법이다.12 is a deep learning-based image segmentation method according to an embodiment of the present invention.

도 12를 참조하면, 본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 단말기(100)로부터 타겟 도메인 이미지(TD)를 수신하는 단계(S1102)를 포함할 수 있다.Referring to FIG. 12, the deep learning-based image segmentation method according to an embodiment of the present invention includes the step of receiving a target domain image (TD) from the terminal 100 by the machine learning server 300 (S1102). I can.

본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 이미지 제공 서버(200)로부터 트레이닝 데이터 셋을 수신하는 단계(S1101)를 포함할 수 있다. The deep learning-based image segmentation method according to an embodiment of the present invention may include the step (S1101) of receiving a training data set from the image providing server 200 by the machine learning server 300.

본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 딥러닝 모델이 이미지 영역화를 하도록 트레이닝하는 단계(S1103)를 포함할 수 있다. 딥러닝 모델의 트레이닝은 도 5 내지 도 10의 딥러닝 모델을 트레이닝하는 설명을 따른다. The deep learning-based image segmentation method according to an embodiment of the present invention may include training, by the machine learning server 300, the deep learning model to image segmentation (S1103). Training of the deep learning model follows the description of training the deep learning model of FIGS. 5 to 10.

본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 트레이닝을 통해 충분히 학습되면 타겟 도메인 이미지를 영역화하는 딥러닝 모델을 구축하는 단계(S1104)를 포함할 수 있다.The deep learning-based image segmentation method according to an embodiment of the present invention may include the step (S1104) of constructing a deep learning model for regionalizing a target domain image when the machine learning server 300 sufficiently learns through training. have.

본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 구축된 딥러닝 모델에 타겟 도메인 이미지(TD)를 입력하여 타겟 도메인 이미지를 영역화를 수행하는 단계(S1105)를 포함할 수 있다. 이 경우, 머신 러닝 서버(300)는 타겟 영역화 이미지(TS)를 생성할 수 있다.In the deep learning-based image segmentation method according to an embodiment of the present invention, the step of performing segmentation of the target domain image by inputting a target domain image (TD) to a deep learning model in which the machine learning server 300 is built (S1105). ) Can be included. In this case, the machine learning server 300 may generate the target regionized image TS.

본 발명의 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 머신 러닝 서버(300)가 단말기(100)로 타겟 영역화 이미지(TS)를 송신하는 단계(S1106)를 포함할 수 있다.The deep learning-based image segmentation method according to an embodiment of the present invention may include a step S1106 of transmitting, by the machine learning server 300, a target segmentation image TS to the terminal 100.

따라서, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 카툰 이미지 내 캐릭터 얼굴을 신속하고 정확하고 세밀하게 영역화 할 수 있다. 또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 카툰 이미지에 대한 트레이닝 데이터 셋이 없이 신경망을 트레이닝하여 카툰 이미지를 영역화할 수 있다. 또한, 실시예에 따른 딥러닝 기반의 이미지 영역화 방법은 실사 이미지에 대한 트레이닝 데이터 셋을 이용하여 카툰 이미지를 영역화할 수 있는 딥러닝 모델을 구축할 수 있다.Accordingly, the deep learning-based image segmentation method according to an embodiment can quickly, accurately and precisely segment a character face in a cartoon image. In addition, the deep learning-based image segmentation method according to an embodiment may transform a cartoon image by training a neural network without a training data set for the cartoon image. In addition, the deep learning-based image segmentation method according to an embodiment may construct a deep learning model capable of segmenting a cartoon image using a training data set for a real image.

이상 설명된 본 발명에 따른 실시예는 다양한 컴퓨터 구성요소를 통하여 실행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위하여 하나 이상의 소프트웨어 모듈로 변경될 수 있으며, 그 역도 마찬가지이다.The embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded in a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded in the computer-readable recording medium may be specially designed and configured for the present invention or may be known and usable to those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic-optical media such as floptical disks. medium), and a hardware device specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of the program instructions include not only machine language codes such as those produced by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device can be changed to one or more software modules to perform the processing according to the present invention, and vice versa.

본 발명에서 설명하는 특정 실행들은 일 실시 예들로서, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다. 또한, 도면에 도시된 구성 요소들 간의 선들의 연결 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것으로서, 실제 장치에서는 대체 가능하거나 추가의 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들로서 나타내어질 수 있다. 또한, “필수적인”, “중요하게” 등과 같이 구체적인 언급이 없다면 본 발명의 적용을 위하여 반드시 필요한 구성 요소가 아닐 수 있다.The specific implementations described in the present invention are examples, and do not limit the scope of the present invention in any way. For brevity of the specification, descriptions of conventional electronic configurations, control systems, software, and other functional aspects of the systems may be omitted. In addition, the connection or connection members of the lines between the components shown in the drawings exemplarily represent functional connections and/or physical or circuit connections. It may be referred to as a connection, or circuit connections. In addition, if there is no specific mention such as “essential” or “importantly”, it may not be an essential component for the application of the present invention.

또한 설명한 본 발명의 상세한 설명에서는 본 발명의 바람직한 실시 예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자 또는 해당 기술분야에 통상의 지식을 갖는 자라면 후술할 특허청구범위에 기재된 본 발명의 사상 및 기술 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. 따라서, 본 발명의 기술적 범위는 명세서의 상세한 설명에 기재된 내용으로 한정되는 것이 아니라 특허청구범위에 의해 정하여져야만 할 것이다.In addition, although the detailed description of the present invention has been described with reference to a preferred embodiment of the present invention, the spirit of the present invention described in the claims to be described later if one skilled in the art or those of ordinary skill in the art And it will be appreciated that various modifications and changes can be made to the present invention within a range not departing from the technical field. Therefore, the technical scope of the present invention should not be limited to the content described in the detailed description of the specification, but should be determined by the claims.

100 단말기
200 이미지 제공 서버
300 머신러닝서버
11 소스 컨텐츠 인코더
12 소스 속성 인코더
13 타겟 컨텐츠 인코더
14 타겟 속성 인코더
21 소스 생성부
22 타겟 생성부
31 소스 판별부
32 타겟 판별부
40 도메인 분류부
50 영역화부
SD 소스 도메인 이미지
TD 타겟 메인 이미지
SS 소스 영역화 이미지
TS 타겟 영역화 이미지
SF 소스 가상 이미지
TF 타겟 가상 이미지100 terminals
200 image serving servers
300 machine learning server
11 source content encoder
12 source attribute encoder
13 target content encoder
14 target attribute encoder
21 source generator
22 Target generator
31 Source identification unit
32 Target determination unit
40 domain classification
50 area
SD source domain image
TD target main image
SS source zoning image
TS target zoning image
SF source virtual image
TF target virtual image

Claims

A communication unit receiving a training data set including a target domain image that is a cartoon image, a source domain image that is a real-life image, and a correct answer label in which a face in the source domain image is region;
A storage unit for storing a deep learning model; And
A processor that trains the deep learning model to transform a face into an image region using the training data set, and performs a face regionization of the target domain image using the trained deep learning model; and
The processor extracts a common content space from the source domain image using the deep learning model and extracts a common content space from the target domain image,
The common content space extracted from the source domain image is a feature expression representing a face area of the source domain image,
The common content space extracted from the target domain image is a feature expression representing a face area of the target domain image,
The processor trains the deep learning model to image region using a common content space extracted from the source domain image, and performs face regionization of the target domain image using a common content space extracted from the target domain image. Deep learning-based image segmentation device, characterized in that.

The method of claim 1,
The deep learning model,
A source content encoder for extracting a common content space from the source domain image;
A target content encoder for extracting a common content space from the target domain image;
A domainization unit for performing image domainization using a common content space extracted from the source domain image or a common content space extracted from the target domain image; And
A domain classifying unit for discriminating whether a common content space extracted from the source domain image or a common content extracted from the target domain image,
The processor trains the domain classification unit to discriminate whether it is a common content space extracted from the source content encoder or a common content space extracted from the target content encoder, and the source content encoder and the target content encoder include the domain classification unit. An image segmentation apparatus based on deep learning, characterized in that training not to discriminate whether it is a common content space extracted from a source content encoder or a common content space extracted from the target content encoder.

The method of claim 2,
The deep learning model,
A source attribute encoder for extracting a specific attribute space from the source domain image;
A source generator for generating a source virtual image using a specific attribute space and a common content space extracted from the source domain image; And
The source virtual image generated by the source generation unit further comprises a source determining unit for determining whether the source virtual image is real or fake by comparing with the source domain image,
Wherein the processor trains the source determination unit to determine whether the source virtual image is fake or real, and trains the source generation unit to be unable to determine whether the source virtual image is real or fake by the source determination unit. Image areaization device based on deep learning.

The method of claim 3,
The deep learning model,
A target attribute encoder for extracting a specific attribute space from the target domain image;
A target generation unit for generating a target virtual image using a specific attribute space and a common content space extracted from the target domain image; And
The target virtual image generated by the target generation unit further comprises a target determination unit for determining whether the target virtual image is real or fake by comparing it with the target domain image,
Wherein the processor trains the target determination unit to determine whether the target virtual image is fake or real, and trains the target generation unit so that the target virtual image cannot determine whether the target virtual image is real or fake Image areaization device based on deep learning.

The method of claim 4,
The source attribute encoder is a convolutional neural network (CNN),
The target attribute encoder is a convolutional neural network (CNN),
The source content encoder is a Fully Convolutional Network (FCN),
The target content encoder is a fully convolutional network (FCN),
The source generator is composed of a plurality of deconvolution layers,
The target generator is composed of a plurality of deconvolution layers,
The segmentation unit is a fully convolutional network (FCN),
The domain classification unit is a deep learning-based image segmentation apparatus, characterized in that the convolutional neural network (CNN).

The method of claim 4,
The processor trains the deep learning model using domain hostile loss, self-recovery loss, image hostile loss, or segmentation loss.

The method of claim 6,
Wherein the processor trains the deep learning model using cross cycle loss or semantic coherence loss.

In a deep learning-based image segmentation method for performing face segmentation of a cartoon image by an image segmentation apparatus including a communication unit, a storage unit storing a deep learning model, and a processor that drives the deep learning model,
Receiving, by the communication unit, a target domain image that is the cartoon image;
Receiving, by the communication unit, a training data set including a source domain image that is a photorealistic image and a correct answer label in which a face in the source domain image is regioned;
Training, by the processor, the deep learning model to regionize the face of the target domain image using the training data set;
Singgi processor completing the training to build an image domain deep learning model: And
Including, by the processor, inputting the target domain image to the image segmentation deep learning model to segment the face of the target domain image
The processor extracts a common content space from the source domain image using the deep learning model and extracts a common content space from the target domain image,
The common content space extracted from the source domain image is a feature expression representing a face area of the source domain image,
The common content space extracted from the target domain image is a feature expression representing a face area of the target domain image,
The processor trains the deep learning model to image region using a common content space extracted from the source domain image, and performs face regionization of the target domain image using a common content space extracted from the target domain image. Deep learning-based image segmentation method, characterized in that.

The method of claim 8,
The deep learning model,
A source content encoder for extracting a common content space from the source domain image;
A target content encoder for extracting a common content space from the target domain image;
A domainization unit for performing image domainization using a common content space extracted from the source domain image or a common content space extracted from the target domain image; And
A domain classifying unit for discriminating whether a common content space extracted from the source domain image or a common content extracted from the target domain image,
The processor trains the domain classification unit to discriminate whether it is a common content space extracted from the source content encoder or a common content space extracted from the target content encoder, and the source content encoder and the target content encoder include the domain classification unit. A deep learning-based image segmentation method, characterized in that training is performed not to discriminate whether it is a common content space extracted from a source content encoder or a common content space extracted from the target content encoder.

The method of claim 9,
The deep learning model,
A source attribute encoder for extracting a specific attribute space from the source domain image;
A source generator for generating a source virtual image using a specific attribute space and a common content space extracted from the source domain image; And
The source virtual image generated by the source generation unit further comprises a source determining unit for determining whether the source virtual image is real or fake by comparing with the source domain image,
Wherein the processor trains the source determination unit to determine whether the source virtual image is fake or real, and trains the source generation unit to be unable to determine whether the source virtual image is real or fake by the source determination unit. Image areaization device based on deep learning.

The method of claim 10,
The deep learning model,
A target attribute encoder for extracting a specific attribute space from the target domain image;
A target generation unit for generating a target virtual image using a specific attribute space and a common content space extracted from the target domain image; And
The target virtual image generated by the target generation unit further comprises a target determination unit for determining whether the target virtual image is real or fake by comparing it with the target domain image,
Wherein the processor trains the target determination unit to determine whether the target virtual image is fake or real, and trains the target generation unit so that the target virtual image cannot determine whether the target virtual image is real or fake Image areaization method based on deep learning.

The method of claim 11,
The source attribute encoder is a convolutional neural network (CNN),
The target attribute encoder is a convolutional neural network (CNN),
The source content encoder is a Fully Convolutional Network (FCN),
The target content encoder is a fully convolutional network (FCN),
The source generator is composed of a plurality of deconvolution layers,
The target generator is composed of a plurality of deconvolution layers,
The segmentation unit is a fully convolutional network (FCN),
The domain classification unit is a deep learning-based image segmentation method, characterized in that the convolutional neural network (CNN).

The method of claim 11,
The processor trains the deep learning model using domain hostile loss, self-recovery loss, image hostile loss, or segmentation loss.

The method of claim 13,
The processor trains the deep learning model using a cross cycle loss or a semantic coherence loss.