KR102259207B1

KR102259207B1 - Automatic taging system and method thereof

Info

Publication number: KR102259207B1
Application number: KR1020130169041A
Authority: KR
Inventors: 김병민; 유창동; 이경님; 권재철; 박상혁; 이동훈; 정준영
Original assignee: 주식회사 케이티; 한국과학기술원
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2021-05-31
Also published as: KR20150079064A

Abstract

자동 태깅 시스템 및 그 방법이 개시된다. 여기서, 자동 태깅 시스템은 정지 영상을 입력받는 입력부, 상기 정지 영상을 해석하여 객체의 시각적인 특징을 포함하는 물리적인 정보를 추론하는 물리적 정보 추출부, 상기 정지 영상을 해석하여 상기 정지 영상의 속성에 해당하는 의미론적인 정보를 추론하는 의미론적 정보 추출부, 그리고 상기 정지 영상의 메타 데이터, 상기 물리적인 정보 및 상기 의미론적인 정보를 통합하여 상기 정지 영상에 태깅하는 자동 태깅부를 포함한다.Disclosed is an automatic tagging system and method thereof. Here, the automatic tagging system includes an input unit for receiving a still image, a physical information extraction unit for inferring physical information including visual characteristics of an object by analyzing the still image, and an attribute of the still image by analyzing the still image. And a semantic information extracting unit for inferring corresponding semantic information, and an automatic tagging unit for tagging the still image by integrating metadata of the still image, the physical information, and the semantic information.

Description

Automatic tagging system and its method {AUTOMATIC TAGING SYSTEM AND METHOD THEREOF}

본 발명은 자동 태깅 시스템 및 그 방법에 관한 것이다.The present invention relates to an automatic tagging system and method thereof.

클라우드 컴퓨팅과 소셜 네트워크 등의 서비스가 대중화되면서 개개인이 생산하는 정지 영상 및 동영상 컨텐츠가 기하급수적으로 급증하고 있다. 이러한 영상 컨텐츠 들을 효율적으로 관리하기 위하여 영상 콘텍스트를 자동으로 분석하고 태그를 달아주는 시스템의 필요성이 대두되고 있다. 앞서 언급된 콘텍스트라 함은 영상 컨텐츠가 가지는 문맥 및 상황 정보를 얘기한다. As services such as cloud computing and social networks become popular, still images and video contents produced by individuals are exponentially increasing. In order to efficiently manage such video contents, there is a need for a system that automatically analyzes video contexts and attaches tags to them. The aforementioned context refers to context and context information of video content.

최근에 영상 신호 처리와 기계 학습 분야가 발전함에 따라서 영상이 입력되면 자동적으로 주어진 영상의 배경과 전경을 분할하고 영상 내에 포함되어 있는 객체를 인식할 수 있는 알고리즘들이 많이 개발되었다. Recently, as the field of image signal processing and machine learning has developed, many algorithms have been developed that can automatically segment the background and foreground of a given image when an image is input and recognize objects included in the image.

따라서 종래에 영상 처리 시스템은 영상 컨텐츠 안에 속해 있는 배경과 객체들의 물리적인 정보 및 의미론적인 정보를 추출한다. 영상의 자동 태깅 및 검색과 같은 응용분야에서 활용이 가능하다. Accordingly, a conventional image processing system extracts physical information and semantic information of backgrounds and objects belonging to image content. It can be used in applications such as automatic tagging and search of images.

종래의 영상 컨텐츠에 대한 자동 태깅은 컨텐츠가 만들어질 당시에 기록된 메타 데이터 즉, 시간, 장소, 노출, 촬영 장비만을 사용하거나 집단 지성을 활용하는 방법이 주를 이루었다. In the conventional automatic tagging of video content, the meta data recorded at the time the content was created, that is, using only the time, location, exposure, and photographing equipment, or a method of utilizing collective intelligence, has been mainly made.

또한, 종래에는 사용자가 직접 영상 컨텐츠에 태그를 입력하는 방법이 주로 사용되고 있다. 주로 사용자 간의 정보 공유를 유도함으로써 집단 지성을 활용하는 방법이라고 볼 수 있다.In addition, conventionally, a method in which a user directly inputs a tag into video content is mainly used. It can be viewed as a method of utilizing collective intelligence by inducing information sharing between users mainly.

이처럼, 종래에 영상 컨텐츠에 대한 자동 태길은 인간의 수동 태깅 즉, 집단 지성 혹은 개인을 유도하거나 자동 태깅을 하더라도 컨텐츠 안의 물리적인 정보에 국한되어 있다. As such, conventionally, automatic tagging of video content is limited to physical information in the content, even when manual tagging of a human, that is, inducing collective intelligence or individual, or automatically tagging.

따라서, 본 발명이 이루고자 하는 기술적 과제는 사용자가 저장하고자 하는 정지 영상을 입력받아 자동적으로 영상 콘텍스트를 이해하고 물리적인 정보, 의미론적인 정보, 메타 데이터를 추론하여 자동 태깅하는 시스템 및 그 방법을 제공하는 것이다.Therefore, the technical problem to be achieved by the present invention is to provide a system and method for automatically tagging by inferring physical information, semantic information, and meta data by receiving a still image to be stored by a user and automatically understanding the image context. will be.

본 발명의 하나의 특징에 따르면, 자동 태깅 시스템은 정지 영상을 입력받는 입력부, 상기 정지 영상을 해석하여 객체의 시각적인 특징을 포함하는 물리적인 정보를 추론하는 물리적 정보 추출부, 상기 정지 영상을 해석하여 추상적인 개념 또는 상황을 묘사하는 영상의 속성에 해당하는 의미론적인 정보를 추론하는 의미론적 정보 추출부, 그리고 상기 정지 영상의 메타 데이터, 상기 물리적인 정보 및 상기 의미론적인 정보를 통합하여 상기 정지 영상에 태깅하는 자동 태깅부를 포함한다.According to one feature of the present invention, the automatic tagging system includes an input unit for receiving a still image, a physical information extracting unit for inferring physical information including visual characteristics of an object by analyzing the still image, and analyzing the still image. Thus, a semantic information extracting unit that infers semantic information corresponding to an attribute of an image describing an abstract concept or situation, and the still image by integrating the metadata of the still image, the physical information, and the semantic information Includes an automatic tagging unit for tagging on.

상기 물리적 정보 추출부는,The physical information extraction unit,

상기 정지 영상으로부터 배경 전경을 분리하는 배경 분리 모듈, 그리고 상기 배경 전경이 분리된 분할 영상으로부터 특징을 추출하여 객체를 인식하는 객체 인식 모듈을 포함할 수 있다.A background separation module for separating a background foreground from the still image, and an object recognition module for recognizing an object by extracting a feature from a divided image in which the background foreground is separated.

상기 배경 분리 모듈은,The background separation module,

상기 정지 영상의 픽셀들중 유사한 성격을 가진 픽셀들을 결합하여 슈퍼 픽셀을 생성하고, 상기 슈퍼 픽셀로부터 특징 벡터를 추출하며, 상기 특징 벡터를 이용하여 상기 슈퍼 픽셀이 결합된 상기 배경 전경이 분리된 분할 영상을 생성할 수 있다.A super pixel is generated by combining pixels having similar characteristics among the pixels of the still image, a feature vector is extracted from the super pixel, and the background foreground to which the super pixel is combined is separated using the feature vector. You can create an image.

상기 배경 분리 모듈은,The background separation module,

색상, 질감, 형태, 위치, 비주얼 워드(visual word)를 포함하는 특징 벡터를 추출할 수 있다.Feature vectors including color, texture, shape, location, and visual words can be extracted.

상기 객체 인식 모듈은,The object recognition module,

상기 분할 영상에서 객체 분류를 통해 복수의 객체를 인식할 수 있다.In the divided image, a plurality of objects may be recognized through object classification.

상기 의미론적 정보 추출부는,The semantic information extraction unit,

상기 정지 영상 및 배경 전경 영상에 기 정의된 생성 모델을 적용하여 영상 속성을 추출할 수 있다.An image property may be extracted by applying a predefined generation model to the still image and the background foreground image.

본 발명의 다른 특징에 따르면, 자동 태깅 방법은 자동 태깅 시스템이 정지 영상을 입력받는 단계, 상기 정지 영상을 해석하여 객체의 시각적인 특징을 포함하는 물리적인 정보를 추론하는 단계, 상기 정지 영상을 해석하여 추상적인 개념 또는 상황을 묘사하는 영상의 속성에 해당하는 의미론적인 정보를 추론하는 단계, 그리고 상기 정지 영상의 메타 데이터, 상기 물리적인 정보 및 상기 의미론적인 정보를 통합하여 상기 정지 영상에 태깅하는 단계를 포함하고, According to another feature of the present invention, the automatic tagging method includes receiving, by an automatic tagging system, a still image, interpreting the still image to infer physical information including visual characteristics of the object, and interpreting the still image. Inferring semantic information corresponding to an attribute of an image describing an abstract concept or situation, and tagging the still image by integrating the metadata of the still image, the physical information, and the semantic information Including,

상기 물리적인 정보를 추론하는 단계 및 상기 의미론적인 정보를 추론하는 단계는 병렬적으로 동시에 수행될 수 있다.The step of inferring the physical information and the step of inferring the semantic information may be performed simultaneously in parallel.

상기 물리적인 정보를 추론하는 단계는,Inferring the physical information,

상기 정지 영상으로부터 배경 전경을 분리하는 단계, 그리고 상기 배경 전경이 분리된 분할 영상으로부터 특징을 추출하여 객체를 인식하는 단계를 포함할 수 있다.Separating a background foreground from the still image, and recognizing an object by extracting a feature from the divided image from which the background foreground is separated.

상기 분리하는 단계는,The separating step,

상기 정지 영상의 픽셀들중 유사한 성격을 가진 픽셀들을 결합하여 슈퍼 픽셀을 생성하는 단계, 상기 슈퍼 픽셀로부터 특징 벡터를 추출하는 단계, 그리고 상기 특징 벡터를 이용하여 상기 슈퍼 픽셀이 결합된 상기 배경 전경이 분리된 분할 영상을 생성하는 단계를 포함할 수 있다.Generating a super pixel by combining pixels having similar characteristics among the pixels of the still image, extracting a feature vector from the super pixel, and the background foreground to which the super pixel is combined using the feature vector It may include generating the divided image.

상기 객체를 인식하는 단계는,Recognizing the object,

상기 분할 영상에서 색상, 픽셀 밝기, 기울기, 크기 및 회전에 불변한 특징을 포함하는 객체 인식을 위한 특징을 추출하는 단계, 그리고 추출된 특징들을 기계적 학습 알고리즘을 통과시켜 객체를 인식하는 단계를 포함할 수 있다.And extracting features for object recognition including features that are invariant to color, pixel brightness, slope, size, and rotation from the divided image, and recognizing the object by passing the extracted features through a mechanical learning algorithm. I can.

상기 의미론적인 정보를 추론하는 단계는,Inferring the semantic information,

상기 정지 영상 및 상기 정지 영상으로부터 분리된 배경 전경 영상에 기 정의된 생성 모델을 적용하여 영상 속성을 추출할 수 있다.An image attribute may be extracted by applying a predefined generated model to the still image and a background foreground image separated from the still image.

상기 정지 영상 및 상기 정지 영상의 추상적인 개념 또는 상황을 묘사하는 속성을 훈련하여 영상 속성을 추론할 수 있다.The still image and the attribute describing the abstract concept or situation of the still image may be trained to infer the image attribute.

본 발명의 실시예에 따르면, 정지 영상으로부터 물리적인 정보와 의미론적인 정보를 추론하여 영상의 이해 및 자동 태깅이 가능하므로, 사용자가 직접 태그를 달아야 하는 불편함을 없애주며, 정확하고 효율적인 영상 회수가 가능하다.According to an embodiment of the present invention, since physical information and semantic information can be inferred from still images to understand and auto-tag images, the inconvenience of directly tagging the user is eliminated, and accurate and efficient image retrieval is possible. It is possible.

또한, 클라우딩 컴퓨팅이나 소셜 네트워크 등의 서비스에서 서버에 입력되는 정지 영상에 대해 시스템이 자동적으로 태그를 주어 저장하므로, 차후에 사용자가 원하는 영상을 효율적으로 검색할 수 있다.In addition, since the system automatically tags and stores still images input to the server from services such as cloud computing or social networks, it is possible to efficiently search for images desired by the user in the future.

또한, 사용자의 정지영상에 달아진 태그와 자연어 처리 알고리즘을 바탕으로 해당 영상에 대한 간단한 설명을 부가하는 어플리케이션으로 활용이 가능하다.In addition, it can be used as an application that adds a simple description of a corresponding image based on a tag attached to a user's still image and a natural language processing algorithm.

도 1은 본 발명의 실시예에 따른 자동 태깅 시스템의 구성을 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 자동 태깅 개념도이다.
도 3은 본 발명의 실시예에 따른 자동 태깅 방법을 나타낸 순서도이다.
도 4는 도 3의 S103 단계를 세부적으로 나타낸 순서도이다.
도 5는 도 3의 S105 단계를 세부적으로 나타낸 순서도이다.
도 6은 도 3의 S107 단계를 세부적으로 나타낸 순서도이다.
도 7은 본 발명의 실시예에 따른 의미론적 정보 추출을 위한 생성 모델 예시도이다.
도 8은 본 발명의 다른 실시예에 따른 자동 태깅 시스템의 개략적인 도면이다.1 is a block diagram showing the configuration of an automatic tagging system according to an embodiment of the present invention.
2 is a conceptual diagram of automatic tagging according to an embodiment of the present invention.
3 is a flowchart illustrating an automatic tagging method according to an embodiment of the present invention.
FIG. 4 is a detailed flowchart illustrating step S103 of FIG. 3.
5 is a detailed flowchart illustrating step S105 of FIG. 3.
6 is a detailed flowchart illustrating step S107 of FIG. 3.
7 is an exemplary diagram of a generation model for semantic information extraction according to an embodiment of the present invention.
8 is a schematic diagram of an automatic tagging system according to another embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

또한, 명세서에 기재된 "…부", "…모듈" 의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. In addition, the terms "... unit" and "... module" described in the specification mean a unit that processes at least one function or operation, which may be implemented by hardware or software, or a combination of hardware and software.

이하, 도면을 참조로 하여 본 발명의 실시예에 따른 자동 태깅 시스템 및 그방법에 대하여 상세히 설명한다.Hereinafter, an automatic tagging system and method according to an embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 실시예에 따른 자동 태깅 시스템의 구성을 나타낸 블록도이고, 도 2는 본 발명의 실시예에 따른 자동 태깅 개념도이다.1 is a block diagram showing the configuration of an automatic tagging system according to an embodiment of the present invention, and FIG. 2 is a conceptual diagram of automatic tagging according to an embodiment of the present invention.

먼저, 도 1을 참조하면, 자동 태깅 시스템(1)은 개인이 소장한 정지 영상을 임의의 저장 장치에 저장 또는 업로드하면, 자동 태깅 시스템(1)이 자동적으로 영상 내의 물리적인 정보와 의미론적인 정보를 추론한 후 메타 데이터와 함께 태그 정보를 생성하여 효율적으로 관리한다. 즉, 자동 태깅 시스템(1)은 영상 저장 장치(미도시)에 탑재되거나 또는 영상 저장 장치(미도시)와 연결되어 있을 수 있다.First, referring to FIG. 1, when the automatic tagging system 1 stores or uploads a still image owned by an individual to an arbitrary storage device, the automatic tagging system 1 automatically includes physical information and semantic information in the image. After inferring, tag information is generated along with meta data and managed efficiently. That is, the automatic tagging system 1 may be mounted on an image storage device (not shown) or may be connected to an image storage device (not shown).

또한, 자동 태깅 시스템(1)은 클라우딩 컴퓨팅 서버(미도시) 또는 소셜 네트워크 서버(미도시)에 탑재될 수 있다. In addition, the automatic tagging system 1 may be mounted on a cloud computing server (not shown) or a social network server (not shown).

자동 태깅 시스템(1)은 영상 신호 처리와 기계 학습을 이용하여 정지 영상 내의 다중 객체를 인식하여 클래스 정보를 추론하고, 배경 및 전경이 지니는 의미론적인 정보를 추론하여 자동 태깅한다.The automatic tagging system 1 uses image signal processing and machine learning to recognize multiple objects in a still image, infer class information, and infer semantic information of the background and foreground and automatically tag them.

자동 태깅 시스템(1)은 정지 영상의 물리적인 정보, 의미론적인 정보를 추출하기 위한 독립된 모듈을 사용하고, 영상 신호 처리와 기계 학습에 기반한 알고리즘을 이용하여 각각의 모듈을 학습시킨다. 또한, 메타 데이터는 영상에 포함되어 있기 때문에 따로 학습 알고리즘을 사용하지 않으며, 물리적인 정보를 추출하는 방법에는 배경 전경 분리 및 객체 인식을 알고리즘을 사용하고 의미론적인 정보를 추출하기 위한 방법으로 딥러닝에 기반한 생성모델을 사용할 수 있다.The automatic tagging system 1 uses independent modules for extracting physical and semantic information of still images, and learns each module using an algorithm based on image signal processing and machine learning. In addition, since meta data is included in the image, a separate learning algorithm is not used. For the method of extracting physical information, the background foreground separation and object recognition are used, and the semantic information is extracted as a method for deep learning. You can use a generation model based on it.

자동 태깅 시스템(1)은 정지 영상 입력부(100), 물리적 정보 추출부(200), 의미론적 정보 추출부(300) 및 자동 태깅부(400)를 포함한다. The automatic tagging system 1 includes a still image input unit 100, a physical information extraction unit 200, a semantic information extraction unit 300, and an automatic tagging unit 400.

여기서, 정지 영상 입력부(100)는 사용자 단말(미도시) 정지 영상을 업로드받는다.Here, the still image input unit 100 receives a still image from a user terminal (not shown).

이처럼, 정지 영상이 입력되면 물리적 정보 추출부(200) 및 의미론적 정보 추출부(300)는 병렬적으로 물리적 정보와 의미론적 정보를 각각 추론한다. In this way, when a still image is input, the physical information extracting unit 200 and the semantic information extracting unit 300 infer physical information and semantic information in parallel, respectively.

물리적 정보 추출부(200)는 정지 영상이 입력되면 배경 전경 분리와 객체 인식 알고리즘으로 물리적인 정보를 추출한다. 즉, 정지 영상에서 특징을 추출하여 배경과 전경을 분리하고, 분할 영역으로부터 특징을 추출하여 객체를 인식한다. 이러한 물리적 정보 추출부(200)는 배경 분리 모듈(210) 및 객체 인식 모듈(230)을 포함한다.When a still image is input, the physical information extracting unit 200 extracts physical information using a background foreground separation and an object recognition algorithm. That is, a feature is extracted from a still image to separate the background and the foreground, and the feature is extracted from the divided area to recognize an object. The physical information extraction unit 200 includes a background separation module 210 and an object recognition module 230.

의미론적 정보 추출부(300)는 정지 영상으로부터 의미론적인 정보를 추출한다. 즉, 정지 영상과 속성 사이의 관계를 모델링하여 의미론적인 정보를 추론한다. The semantic information extraction unit 300 extracts semantic information from a still image. That is, semantic information is inferred by modeling the relationship between the still image and the attribute.

여기서, 의미론적인 정보는 정지 영상에 상응하는 속성에 해당된다. 속성의 예는 '갈기가 있는', '어린 남자 아이', 등의 추상적인 정보이다. 속성의 나열만으로 영상 콘텍스트를 서술하는 것이 가능하다. 객체의 시각적인 특징에 해당되는 저차원 정보들과는 확연히 구별되는 정보이다. Here, the semantic information corresponds to an attribute corresponding to a still image. Examples of attributes are abstract information such as'with mane','young boy', etc. It is possible to describe the video context only by listing the attributes. It is information that is clearly distinguished from low-dimensional information corresponding to the visual characteristics of an object.

또한, 영상을 이해하고 속성을 찾을 수 있는 방법도 있다. 속성의 예를 들면, '동그란 구 모양', '구름이 낀', '맑은', '석양이 지는', '포유류', '털이 많은'등의 추상적인 정보로써 객체의 시각 정보에 크게 좌우되는 저차원 정보들과는 확연히 구별되는 정보이다. There are also ways to understand the video and find its properties. For example, as abstract information such as'round sphere shape','cloudy','clear','set sunset','mammal','hairy', it is highly dependent on the visual information of the object. It is information that is clearly distinguished from low-dimensional information.

실제로 사람 사이에 대화가 이루어질 때나 상대방에게 특정 상황을 설명할 때는 물리적인(저차원) 정보보다 의미론적인 정보가 더 큰 영향력을 발휘한다. In fact, semantic information has a greater influence than physical (low-dimensional) information when conversations occur between people or when explaining a specific situation to the other person.

도 2의 (a)에 도시한 바와 같이, 주어진 정지 영상에는 물리적인 정보가 있지만, 실제로는 의미론적인 정보가 더 많이 담겨 있다. As shown in (a) of FIG. 2, although physical information is present in a given still image, more semantic information is actually contained.

도 2의 (b)에 따르면, 물리적 정보 추출부(200)는 정지 영상에서 특징을 추출하여 배경 전경을 분리한 후, 정지 영상을 배경 전경 영상 및 분리된 영상으로 분할한다. 물리적 정보 추출부(200)는 배경 전경이 분리된 영상에서 특징을 추출하여 물리적 정보인 객체를 인식한다. 여기서, 객체는 도 2의 (a)에서 안경쓴 성인 남성, 아이가 될 수 있다. 2B, the physical information extraction unit 200 extracts features from the still image and separates the background foreground, and then divides the still image into a background foreground image and a separated image. The physical information extracting unit 200 recognizes an object that is physical information by extracting features from an image in which the background foreground is separated. Here, the object may be an adult male or a child wearing glasses in FIG. 2A.

또한, 의미론적 정보 추출부(300)는 도 2의 (a)에서 입력받은 정지 영상에서 특징을 추출하고, 생성 모델을 통해 객체 및 상황을 이해하여 의미론적인 정보에 해당하는 속성을 생성한다. 즉, 이벤트를 인식하고, 소풍, 가족, 딸과 같은 의미론적인 정보를 생성한다. In addition, the semantic information extraction unit 300 extracts features from the still image input in FIG. 2A, and generates attributes corresponding to semantic information by understanding objects and situations through a generation model. In other words, it recognizes events and generates semantic information such as picnics, family, and daughters.

자동 태깅부(400)는 추론된 물리적 정보 및 의미론적 정보의 신뢰도와 정지영상과 함께 주어진 메타 데이터를 추가한 후, 최종적으로 자동 태깅을 수행한다. 즉, 자동 태깅부(400)는 도 2의 (c)와 같이 '사람, 잔디, 나무'와 같은 물리적인 정보 태그를 생성한다. 그리고 '가족, 소풍, 성인 남성, 여자 아이, 안경을 쓴, 맑은 날, 뛰노는' 과 같은 의미론적인 정보 태그를 생성한다. The automatic tagging unit 400 adds the reliability of the inferred physical information and semantic information and the given metadata along with the still image, and then finally performs automatic tagging. That is, the automatic tagging unit 400 generates a physical information tag such as “people, grass, and trees” as shown in (c) of FIG. 2. And it creates semantic information tags such as'family, picnic, adult men, girls, wearing glasses, sunny days, and running'.

도 1 및 도 2를 통해 설명한 구성에 기초하여 자동 태깅 시스템(1)의 동작을 설명하면 다음과 같다. 이때, 도 1 및 도 2와 동일한 구성 요소에 대한 설명은 동일한 도면 부호를 사용하기로 한다.The operation of the automatic tagging system 1 based on the configuration described with reference to FIGS. 1 and 2 will be described as follows. In this case, the same reference numerals are used for descriptions of the same components as in FIGS. 1 and 2.

도 3은 본 발명의 실시예에 따른 자동 태깅 방법을 나타낸 순서도이고, 도 4는 도 3의 S103 단계를 세부적으로 나타낸 순서도이며, 도 5는 도 3의 S105 단계는 세부적으로 나타낸 순서도이고, 도 6은 도 3의 S107 단계를 세부적으로 나타낸 순서도이고, 도 7은 본 발명의 실시예에 따른 의미론적 정보 추출을 위한 생성 모델 예시도이다.3 is a flowchart illustrating an automatic tagging method according to an embodiment of the present invention, FIG. 4 is a detailed flowchart illustrating step S103 of FIG. 3, and FIG. 5 is a detailed flowchart illustrating step S105 of FIG. 3, and FIG. Is a detailed flowchart showing step S107 of FIG. 3, and FIG. 7 is an exemplary diagram of a generation model for semantic information extraction according to an embodiment of the present invention.

도 3을 참조하면, 정지 영상 입력부(100)가 정지 영상을 입력받는다(S101). 물리적 정보 추출부(200)는 입력받은 정지 영상에서 배경 전경을 분리한다(S103).3, the still image input unit 100 receives a still image (S101). The physical information extracting unit 200 separates the background foreground from the input still image (S103).

여기서, 물리적 정보 추출부(200)의 배경 분리 모듈(210)은 배경 및 전경을 정지 영상으로부터 분리해줌으로써 객체 인식은 물론이고 정지 영상 내의 배경에 해당하는 영역까지 이해하고 하나의 객체로써 인식하기 위한 전처리 과정을 수행한다. Here, the background separation module 210 of the physical information extraction unit 200 separates the background and the foreground from the still image to recognize not only the object but also the region corresponding to the background in the still image and recognize it as a single object. Perform the pretreatment process.

이때, 배경 및 전경 분리를 할 수 있는 방법은 다양하게 존재한다. 하나의 실시예에 따르면, '상관 군집(correlation clustering)'에 기반한 방법을 사용하여 훈련 데이터에 포함되지 않은 객체에 대해서도 일반적으로 적용이 가능하도록 할 수 있다. 배경 전경 분리의 일반적인 순서는 먼저 정지 영상의 픽셀 들을 슈퍼 픽셀이라는 더 큰 단위의 형태로 변환하고, 슈퍼 픽셀들을 결합해서, 더 큰 분할영역을 만들어가는 방법을 취한다. 분할 영역들이 자신이 속한 객체가 아닌 다른 객체의 경계선을 침범하지 않는다는 조건을 만족시키도록 최대한 결합하고, 최종적으로 만들어진 분할 영역들에 대해서는 객체 분류를 한다.At this time, there are various methods for separating the background and the foreground. According to an embodiment, a method based on'correlation clustering' may be used to allow general application to an object not included in the training data. The general sequence of background/foreground separation is to first convert the pixels of a still image into a larger unit called a super pixel, and then combine the super pixels to create a larger segmentation area. The partitioned regions are combined as much as possible to satisfy the condition that the boundary lines of objects other than the object to which they belong, and finally created partitions are classified as objects.

여기서, 도 4를 참조하면, 배경 분리 모듈(210)은 정지 영상이 입력되면 비슷한 성질을 지닌 픽셀들을 결합하여 슈퍼 픽셀들을 생성한다(S201). 슈퍼 픽셀은 배경전경 분리를 효율적으로 하기 위한 전처리 과정으로써 하나의 실시예에 따르면, UCM(ultrametric contour map)을 사용한다. 배경 전경 분리를 위해 슈퍼 픽셀들로부터 색상, 질감, 형태, 위치, 비주얼 워드(visual word) 등으로 이루어진 특징 벡터를 추출한다(S203). 그리고 추출된 특징 벡터들을 상관 군집(correlation clustering) 알고리즘의 입력으로 넣어 배경 전경 분리를 수행한다(S205). 상관 군집(correlation clustering)은 에너지 최소화 기법에 기반하여 훈련되는 알고리즘으로써 단일 슈퍼 픽셀의 특징 벡터와 인접한 두 슈퍼 픽셀 사이의 특징 벡터를 보고 에너지가 최소화 되는 방향으로 결합해 나간다. Here, referring to FIG. 4, when a still image is input, the background separation module 210 combines pixels having similar properties to generate super pixels (S201). The super pixel is a preprocessing process for efficiently separating the background foreground, and according to an embodiment, an ultrametric contour map (UCM) is used. In order to separate the background and foreground, a feature vector consisting of color, texture, shape, position, and visual word is extracted from the super pixels (S203). Then, the extracted feature vectors are input to a correlation clustering algorithm, and background foreground separation is performed (S205). Correlation clustering is an algorithm that is trained based on an energy minimization technique, looking at a feature vector of a single super pixel and a feature vector between two adjacent super pixels and combining them in the direction in which the energy is minimized.

좋은 성능을 얻기 위해서는 3차 이상의 슈퍼 픽셀들의 조합인 상위 텀(higher order term)의 설계가 중요하다. 여기서, 상위 텀(Higher order term)이란 정지 영상 내의 슈퍼 픽셀들의 조합으로 이루어진 덩어리, 덩어리들의 조합으로 이루어진 더 큰 덩어리, 큰 덩어리들이 결합하면 객체를 이루는 등의 계층적인 구조를 지닌다고 가정하고, 슈퍼 픽셀들이 올바르게 결합해 나가도록 하는 제약 조건이다. In order to obtain good performance, it is important to design a higher order term, which is a combination of 3rd or higher order super pixels. Here, the higher order term is assumed to have a hierarchical structure such as a mass made of a combination of super pixels in a still image, a larger mass made of a combination of masses, and an object when large masses are combined. It is a constraint that allows them to be properly combined.

배경 전경 분리에서 얻어지는 결과물은 객체를 이루는 큰 덩어리들로써 분할영역에 해당된다(S207). 이때, 기존의 상위 텀(higher order term)을 개선하기 위하여 RBM(restricted Boltzmann machine)을 활용한다. RBM은 확률을 에너지로 표현하는 무방향 그래프로 비교사 학습법에 의하여 훈련데이터의 구조를 학습하여 훈련데이터에 대해 에너지를 최소화 해주는 생성모델이다.. RBM의 확률 분포는 다음과 같이 주어진다.The result obtained from separating the background and foreground corresponds to a divided area as large chunks that make up an object (S207). At this time, a restricted Boltzmann machine (RBM) is used to improve the existing higher order term. RBM is an undirected graph that expresses probability as energy. It is a generation model that minimizes energy for training data by learning the structure of training data by the non-linear learning method. The probability distribution of RBM is given as follows.

여기서, p(v)는 v에 대한 확률분포, v는 visible node, h는 hidden node을 의미한다.Here, p(v) is a probability distribution for v, v is a visible node, and h is a hidden node.

배경 분리 모듈(210)은 다항 슈퍼 픽셀들의 특징들을 입력으로 받아서 입력 슈퍼 픽셀들이 같은 객체의 영역에 해당된다면 작은 에너지를 내어주어서 합치도록 하고, 만약 서로 다른 객체에 속해야 한다면 큰 에너지를 내어주어 합치지 않도록 한다. 수학식 1에 따르면 에너지가 작을수록 p(v)의 값 (v의 확률) 이 커지게 되는데 이는 같은 객체일 확률이 크다는 것을 의미한다. The background separation module 210 receives features of the polynomial super pixels as input, and if the input super pixels fall into the same area of the object, a small amount of energy is given to combine them, and if they belong to different objects, a large amount of energy is given to combine them. Do not. According to Equation 1, the smaller the energy, the larger the value of p(v) (the probability of v), which means that the probability of the same object is higher.

RBM은 비교사 학습기법에 의하여 학습되므로 별도의 큰 노력 없이 상위 항목(higher order term)을 효과적으로 설계할 수 있다.Since RBM is learned by the comparative history learning technique, it is possible to effectively design higher order terms without extra great effort.

다시, 도 3을 참조하면, 객체 인식 모듈(230)은 배경 분리 모듈(210)이 출력하는 분할 영역(또는 분할 영상)으로부터 객체를 인식한다(S105).Again, referring to FIG. 3, the object recognition module 230 recognizes an object from a divided area (or divided image) output from the background separation module 210 (S105 ).

여기서, 객체 인식 모듈(230)은 일반적으로 활용이 가능한 객체 분류기를 사용하는데, 하나의 실시예에서는 SVM(support vector machine)을 사용할 수 있다. 이때, SVM(support vector machine)은 데이터를 분리하는 초평면 중에서 서포트 벡터들과 가장 마진이 큰 (max-margin) 초평면을 선택하여 분리하는 기계학습 알고리즘의 하나이다.Here, the object recognition module 230 uses an object classifier that can be used in general, and in one embodiment, a support vector machine (SVM) may be used. At this time, the SVM (support vector machine) is one of the machine learning algorithms that selects and separates the support vectors and the largest-margin hyperplane among the hyperplanes separating data.

이때, 도 5를 참조하면, 객체 인식 모듈(230)은 배경 전경 분리 알고리즘에 의해 생성된 분할 영역들이 어떤 객체에 해당되는지 객체 분류기를 이용해 분류한다. In this case, referring to FIG. 5, the object recognition module 230 classifies which object the divided regions generated by the background foreground separation algorithm correspond to using an object classifier.

객체 인식 모듈(230)은 분할 영상이 입력되면 객체 인식을 위한 특징을 추출(S301)하는데 색상, 픽셀 밝기, 기울기, SIFT(Scale Invariant Feature Transform)를 사용할 수 있다. SIFT란 크기와 회전에 불변한 특징을 추출하는 것, 그리고 이를 이용해서 탐지(Detection) 나 인식(Recognition)에 응용한다. 객체 분류용 SVM은 RBF(radius basis function) 커널을 사용할 수 있다.When a segmented image is input, the object recognition module 230 extracts features for object recognition (S301), and may use color, pixel brightness, slope, and Scale Invariant Feature Transform (SIFT). SIFT is to extract features that are invariant in size and rotation, and apply them to detection or recognition. The SVM for object classification may use a radius basis function (RBF) kernel.

즉, 객체 인식 모듈(230)은 배경 전경 분리 이후 분할 영역들에 대하여 객체 분류함(S303)으로써 정지 영상으로부터 다중 객체를 인식할 수 있다(S305). 종래에는 일반적인 객체 인식은 하나의 정지 영상으로부터 한 개의 객체를 인식할 수 있었다.That is, the object recognition module 230 may recognize multiple objects from the still image by classifying the divided regions (S303) after separating the background foreground (S305). Conventionally, in general object recognition, one object could be recognized from one still image.

다시, 도 3을 참조하면, 의미론적 정보 추출부(300)는 영상 속성(어트리뷰트, attribute)를 생성한다(S107).Again, referring to FIG. 3, the semantic information extracting unit 300 generates an image attribute (attribute) (S107).

이때, 도 6을 참조하면, 의미론적 정보 추출부(300)는 정지 영상을 입력받아 추상적인 개념이나 상황을 묘사하는 속성을 추출(S401, S403)하기 위하여 딥러닝 기술에 기반한 생성 모델을 사용(S405)하여 영상 속성 즉 어트리뷰트를 생성한다. 이처럼 생성된 속성이 자동 태깅에 사용된다. 이때, 정지 영상으로부터 의미론적인 특징의 추출 및 배경 전경 영상에서 의미론적인 특징의 추출일 수 있다. In this case, referring to FIG. 6, the semantic information extraction unit 300 receives a still image and uses a generation model based on deep learning technology to extract attributes (S401, S403) describing an abstract concept or situation ( S405) to generate an image property, that is, an attribute. This generated property is used for automatic tagging. In this case, semantic features may be extracted from a still image and semantic features may be extracted from a background foreground image.

여기서, 생성 모델은 정지 영상 및 속성이라는 두 모드의 데이터를 입력으로 받아 훈련되고 각 모드의 입력 데이터(영상 혹은 속성)를 여러 개의 잠복 층을 이용해 모델링한다. Here, the generated model is trained by receiving data of two modes, a still image and an attribute, and modeling the input data (image or attribute) of each mode using several latent layers.

생성모델 즉 기계학습 분야의 제안된 그래프 모델 중 한 구조를 통해 할 수 있는 일은 크게 두 가지이다. 정지영상이 들어왔을 때 그 영상의 속성, 즉, 노을이 진, 날씨가 맑은, 산이 있는, 바다가 있는과 같은 속성을 생성한다. 그리고 속성을 입력으로 주었을 때, 해당 속성들을 포함하거나 유사성이 높은 콘텍스트를 지닌 정지 영상을 생성하거나 회수한다. There are two main things that can be done through a generative model, that is, one of the proposed graph models in the machine learning field. When a still image comes in, it creates the attributes of the image, namely, the sunset, the clear weather, the mountains, and the sea. And when an attribute is given as an input, a still image containing the attributes or having a high similarity context is created or retrieved.

이때, 생성 모델을 "훈련"시키는 과정을 거쳐야한다. 기계학습이 추구하는 바는, 어떤 입력이 주어졌을 때 원하는 출력을 내어주는 함수를 수학적으로 즉, 주로 통계와 확률에 기반하여 목적함수를 최적화시켜 얻는 것이다. 그러기 위해서는 입력과 출력을 모두 포괄하는 훈련용 데이터(경험) 가 필요하다. 인간이 경험을 통해 학습하듯이 생성모델(기계)도 훈련용 데이터를 통해 자신의 파라미터(parameter)를 최적화시켜서 결국에는 입력이 들어오면 그 입력에 알맞은 출력을 내어주는 함수를 스스로 학습하게 된다. At this point, you have to go through the process of "training" the generated model. The pursuit of machine learning is to obtain a function that gives a desired output when given an input, that is, by optimizing the objective function mainly based on statistics and probability. This requires training data (experience) that covers both inputs and outputs. As humans learn through experience, the generative model (machine) optimizes its own parameters through training data, and eventually learns a function that gives an appropriate output to the input when it comes in.

즉, 생성 모델은 영상 속성을 입력받는데, 훈련 과정에서 정지 영상과 그 정지 영상의 속성을 입력받는다. 그리고 많은 훈련용 정지영상-속성 쌍(pair)을 이용해서 생성 모델이 훈련이 끝난 후에는 전혀 새로운 정지 영상이 들어왔을 때, 기계 훈련 과정에서 최적화 되어진 파라미터(parameter)들을 이용하여 속성들을 생성할 수 있게 된다. 이때, 속성은 사전에 생성모델을 학습시키는 훈련용 속성과 실제 사용상 완전히 새로운 종류의 정지영상이 들어올 때의 테스트 영상으로 구분된다. That is, the generated model receives image properties, and receives the still image and the properties of the still image in the training process. And when the model is generated using many training still image-property pairs, after training, when a completely new still image comes in, properties can be created using parameters optimized in the machine training process. There will be. At this time, the attributes are divided into a training attribute that trains the generated model in advance and a test image when a completely new type of still image is entered for actual use.

모델 최상위에는 잠복 층을 추가하여 두 모드를 이어주게 된다. 이렇게 만들어진 생성모델은 정지영상이 들어오면 정지영상과 연관성이 높은 속성을 생성해주고, 반대로 다수의 속성을 입력으로 받으면 입력된 속성들과 연관성이 높은 정지영상을 회수해준다. At the top of the model, a latent layer is added to connect the two modes. When a still image comes in, the generated model creates a property that is highly related to a still image, and conversely, when a number of attributes are received as input, it retrieves a still image that is highly related to the input properties.

일반적으로는 최상위에 연결된 잠복 층이 두 종류의 입력 사이의 관계를 모델링하는 교두보 역할을 해주지만, 실제로 입력 데이터들이 일관되지 않고 다양한 분포를 가질 때는 한계를 가지게 된다. 예를 들면, 자연경관들로만 이루어진 영상 데이터가 들어오면 일관성이 존재하지만, 실제로는 자연경관, 도시배경, 실내영상까지 포함하는 영상 데이터가 있을 수 있다.In general, the latent layer connected to the top serves as a bridgehead for modeling the relationship between the two types of inputs, but there is a limit when the input data are inconsistent and have various distributions. For example, there is consistency when image data consisting only of natural landscapes comes in, but in reality, there may be image data including natural landscapes, urban backgrounds, and indoor images.

한 실시예에 따르면, 다양한 종류의 영상(자연경관, 도시배경, 실내, 인물, 객체)과 입력 쿼리의 관계를 더 잘 모델링하기 위해서 모델 최상위에 존재하는 한 개의 잠복 층을 다수로 확장하고 각 잠복 층이 제각기 다른 분포 예를들면, 자연경관, 도시배경, 실내영상을 모델링한다. According to an embodiment, in order to better model the relationship between various types of images (natural scenery, urban background, indoors, people, objects) and input queries, one latent layer at the top of the model is expanded to a large number and each Distribution of different layers, for example, natural scenery, urban background, and indoor images are modeled.

이때, 도 7을 참조하면, 정지 영상 데이터와 속성 데이터에 대한 두 개의 생성 모델이 있다. 그리고 두 생성 모델 최상위의 잠복 층들과 게이팅 함수(Gating function)에 의해 연결된다. 각 생성 모델의 잠복층은 RBM을 이용하여 층마다 순차적으로 학습하고 전체 모델 최상위의 h_natural^3, h_urban^3, h_indoor^3, h_i^2, h_t^2 는 mixture of RBMs으로 학습한다. 이러한 생성 모델의 전 훈련 과정은 비교사 학습기법을 따른다.In this case, referring to FIG. 7, there are two generation models for still image data and attribute data. Then, the latent layers at the top of the two generation models are connected by a gating function. Latent layers of each generative model are sequentially trained for each layer using RBM, and h_natural^3, h_urban^3, h_indoor^3, h_i^2, h_t^2 at the top of the entire model are learned as a mixture of RBMs. The entire training process of this generative model follows the non-historical learning technique.

다시, 도 3을 참조하면, 자동 태깅부(400)는 정지 영상의 메타 데이터와 함께 S105 단계에서 추론된 물리적인 정보, S107 단계에서 추론된 의미론적인 정보를 통합하여 자동으로 태그를 생성한다(S109). 즉, 추론된 물리적 정보 및 의미론적 정보를 취합하여 정지 영상에 포함된 메타데이터까지 총 세 부류의 정보를 영상에 자동 태깅한 후 저장한다.Again, referring to FIG. 3, the automatic tagging unit 400 automatically generates a tag by integrating the physical information inferred in step S105 and the semantic information inferred in step S107 together with the metadata of the still image (S109). ). That is, the inferred physical information and semantic information are collected, and a total of three types of information, up to the metadata included in the still image, are automatically tagged on the image and then stored.

한편, 도 8은 본 발명의 다른 실시예에 따른 자동 태깅 시스템의 개략적인 도면으로, 도 1을 참고하여 설명한 자동 태깅 시스템의 정지 영상 입력부(100), 물리적 정보 추출부(200), 의미론적 정보 추출부(300) 및 자동 태깅부(400)의 기능 중 적어도 일부를 수행하는데 사용할 수 있는 장치를 나타낸다.Meanwhile, FIG. 8 is a schematic diagram of an automatic tagging system according to another embodiment of the present invention. A still image input unit 100, a physical information extraction unit 200, and semantic information of the automatic tagging system described with reference to FIG. 1 Represents a device that can be used to perform at least some of the functions of the extraction unit 300 and the automatic tagging unit 400.

도 8을 참고하면, 자동 태깅 시스템(500)은 프로세서(501), 메모리(503), 적어도 하나의 저장장치(505), 입출력(input/output, I/O) 인터페이스(507) 및 네트워크 인터페이스(509)를 포함한다.Referring to FIG. 8, the automatic tagging system 500 includes a processor 501, a memory 503, at least one storage device 505, an input/output (I/O) interface 507, and a network interface ( 509).

프로세서(501)는 중앙처리 유닛(central processing unit, CPU)이나 기타 칩셋, 마이크로프로세서 등으로 구현될 수 있으며, 메모리(503)는 동적 랜덤 액세스 메모리(DRAM), 램버스 DRAM(RDRAM), 동기식 DRAM(SDRAM), 정적 RAM(SRAM) 등의 RAM과 같은 매체로 구현될 수 있다. The processor 501 may be implemented as a central processing unit (CPU) or other chipset, a microprocessor, and the like, and the memory 503 includes dynamic random access memory (DRAM), Rambus DRAM (RDRAM), and synchronous DRAM ( SDRAM), static RAM (SRAM), etc. may be implemented in a medium such as RAM.

저장 장치(505)는 하드디스크(hard disk), CD-ROM(compact disk read only memory), CD-RW(CD rewritable), DVD-ROM(digital video disk ROM), DVD-RAM, DVD-RW 디스크, 블루레이(blue-ray) 디스크 등의 광학디스크, 플래시메모리, 다양한 형태의 RAM과 같은 영구 또는 휘발성 저장장치로 구현될 수 있다. The storage device 505 includes a hard disk, a compact disk read only memory (CD-ROM), a CD rewritable (CD-RW), a digital video disk ROM (DVD-ROM), a DVD-RAM, and a DVD-RW disk. , It may be implemented as an optical disk such as a blue-ray disk, a flash memory, or a permanent or volatile storage device such as various types of RAM.

또한, I/O 인터페이스(507)는 프로세서(501) 및/또는 메모리(503)가 저장 장치(505)에 접근할 수 있도록 하며, 네트워크 인터페이스(509)는 프로세서(501) 및/또는 메모리(503)가 네트워크(미도시)에 접근할 수 있도록 한다.In addition, the I/O interface 507 allows the processor 501 and/or the memory 503 to access the storage device 505, and the network interface 509 is the processor 501 and/or the memory 503 ) To access the network (not shown).

이 경우, 프로세서(501)는 정지 영상 입력부(100), 물리적 정보 추출부(200), 의미론적 정보 추출부(300) 및 자동 태깅부(400)의 기능의 적어도 일부 기능을 구현하기 위한 프로그램 명령을 메모리(503)에 로드하여 도 1을 참고로 하여 설명한 동작이 수행되도록 제어할 수 있다.In this case, the processor 501 is a program command for implementing at least some of the functions of the still image input unit 100, the physical information extraction unit 200, the semantic information extraction unit 300, and the automatic tagging unit 400. Is loaded into the memory 503 to control the operation described with reference to FIG. 1 to be performed.

또한, 메모리(503) 또는 저장장치(505)는 프로세서(501)와 연동하여 정지 영상 입력부(100), 물리적 정보 추출부(200), 의미론적 정보 추출부(300) 및 자동 태깅부(400)의 기능이 수행되도록 할 수 있다.In addition, the memory 503 or the storage device 505 interlocks with the processor 501 to provide a still image input unit 100, a physical information extraction unit 200, a semantic information extraction unit 300, and an automatic tagging unit 400. Can be performed.

도 8에 도시한 프로세서(501), 메모리(503), 저장장치(505), I/O 인터페이스(507) 및 네트워크 인터페이스(509)는 하나의 컴퓨터에 구현될 수도 있으며 또는 복수의 컴퓨터에 분산되어 구현될 수도 있다.The processor 501, the memory 503, the storage device 505, the I/O interface 507, and the network interface 509 shown in FIG. 8 may be implemented in one computer or distributed among a plurality of computers. It can also be implemented.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다. The embodiments of the present invention described above are not implemented only through an apparatus and a method, but may be implemented through a program that realizes a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

An input unit that receives a still image uploaded by a user terminal,
A physical information extraction unit for inferring physical information including visual characteristics of an object by analyzing the still image,
A semantic information extraction unit that interprets the still image to infer semantic information corresponding to an attribute of an image describing an abstract concept or situation, and
An automatic tagging unit for tagging the still image by integrating metadata of the still image, the physical information, and the semantic information,
The semantic information extraction unit,
By learning various still image data and attribute data representing an abstract concept of the still image data or describing a situation using a deep learning technology, a generation model modeled the relationship between the still image data and the attribute data is derived,
Extracting attribute data related to the input still image as the semantic information by using the still image input from the input unit as an input of the generated model,
The physical information extraction unit,
A background separation module for classifying a background foreground from the still image into a background and a foreground in pixel units, and
An object recognition module for recognizing an object by extracting a feature from the divided image from which the background foreground is separated,
The background separation module,
A super pixel is generated by combining pixels having similar characteristics among the pixels of the still image, a feature vector is extracted from the super pixel, and the background foreground to which the super pixel is combined is separated using the feature vector. Automatic tagging system that generates video.

delete

The method of claim 1,
The background separation module,
An automatic tagging system that extracts feature vectors including color, texture, shape, location, and visual words.

The method of claim 4,
The object recognition module,
An automatic tagging system for recognizing a plurality of objects through object classification in the divided image.

The method of claim 1,
The semantic information extraction unit,
An automatic tagging system for extracting attribute data related to each of the still image and the background foreground image by using the still image and the background foreground image as inputs of the generated model.

A step in which the automatic tagging system receives a still image,
Interpreting the still image to infer physical information including visual characteristics of the object,
Interpreting the still image to infer semantic information corresponding to an attribute of an image describing an abstract concept or situation, and
Integrating metadata of the still image, the physical information, and the semantic information, and tagging the still image,
The step of inferring the physical information and the step of inferring the semantic information are performed simultaneously in parallel,
Inferring the semantic information,
By learning various still image data and attribute data representing an abstract concept of the still image data or describing a situation using a deep learning technology, a generation model modeling the relationship between the still image data and the attribute data is derived,
Extracting attribute data related to the received still image as the semantic information by using the received still image as an input of the generated model,
Inferring the physical information,
Separating a background foreground from the still image, and
Recognizing an object by extracting a feature from the divided image from which the background foreground is separated,
The separating step,
Generating a super pixel by combining pixels having similar characteristics among the pixels of the still image,
Extracting a feature vector from the super pixel, and
Generating a segmented image in which the background foreground to which the super pixels are combined is separated using the feature vector
Automatic tagging method comprising a.

delete

The method of claim 7,
Recognizing the object,
Extracting features for object recognition including features that are invariant in color, pixel brightness, gradient, size, and rotation from the divided image, and
Recognizing an object by passing the extracted features through a mechanical learning algorithm
Automatic tagging method comprising a.

The method of claim 10,
Inferring the semantic information,
An automatic tagging method for extracting image attribute data related to each of the still image and the background foreground image by using the still image and a background foreground image separated from the still image as an input of the generated model.

delete