KR102622958B1

KR102622958B1 - System and method for automatic generation of image caption

Info

Publication number: KR102622958B1
Application number: KR1020190023268A
Authority: KR
Inventors: 최호진; 한승호
Original assignee: 한국전력공사; 한국과학기술원
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2024-01-10
Also published as: KR20200104663A

Abstract

본 발명은 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 추출하여 캡션을 생성하고, 오브젝트 정보들 사이의 관계를 예측하여 생성된 캡션을 재구조화하는 이미지 캡션 자동 생성 시스템 및 방법에 관한 것이다.
본 발명의 실시 예에 따른 이미지에 대해 이미지를 설명하는 캡션을 자동으로 생성하기 위한 캡션 자동 생성 시스템에 있어서. 상기 캡션을 생성하기 위한 이미지를 제공하는 클라이언트와, 상기 클라이언트로부터 제공받은 이미지를 분석하여 상기 이미지를 설명하는 캡션을 생성하고, 상기 생성한 캡션 및 상기 캡션을 생성한 근거를 상기 클라이언트로 전송하는 캡션 생성기를 포함한다.The present invention relates to an automatic image caption generation system and method that generates captions by extracting attribute information and object information in images using deep learning, and reconstructs the generated captions by predicting relationships between object information.
In an automatic caption generation system for automatically generating captions describing images for images according to an embodiment of the present invention. A client that provides an image for generating the caption, analyzes an image provided by the client to generate a caption describing the image, and transmits the generated caption and the basis for generating the caption to the client. Includes generator.

Description

System and method for automatic generation of image caption}

본 발명은 이미지 캡션 자동 생성 시스템 및 방법에 관한 것으로, 보다 자세하게는 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 추출하여 캡션을 생성하고, 오브젝트 정보들 사이의 관계를 예측하여 생성된 캡션을 재구조화하는 이미지 캡션 자동 생성 시스템 및 방법에 관한 것이다.The present invention relates to an automatic image caption generation system and method. More specifically, the present invention relates to a system and method for automatically generating image captions. More specifically, the present invention relates to a caption generated by extracting attribute information and object information in an image using deep learning, and a caption generated by predicting the relationship between object information. It relates to a system and method for automatically generating image captions for restructuring.

이미지 캡셔닝은 제공되는 이미지에 대해 그 이미지를 설명하는 자연어 문장을 생성하는 것으로, 최근에는 인공지능 기술의 발전으로 기계를 이용하여 자동으로 캡션을 생성하는 기술이 개발되고 있다.Image captioning involves generating natural language sentences that describe a provided image. Recently, with the advancement of artificial intelligence technology, technology to automatically generate captions using machines has been developed.

이와 같이, 기계를 이용하여 자동으로 캡션을 생성하는 기술은 기존의 존재하는 많은 이미지와 각 이미지에 달린 라벨(이미지를 설명하는 한 단어) 정보를 이용하여 라벨이 같은 이미지를 검색하거나, 유사한 이미지들의 라벨들을 하나의 이미지에 할당하여 이미지에 대한 캡션을 생성하였다.In this way, the technology to automatically generate captions using a machine searches for images with the same label or uses the information on many existing images and the label (one word that describes the image) attached to each image. Labels were assigned to an image to create a caption for the image.

이미지 캡셔닝은 제공되는 이미지에 대해 그 이미지를 설명하는 캡션을 자연어 문장으로 생성하는 것이다. 최근에는 인공지능 기술의 발전으로 기계를 이용하여 자동으로 캡션을 생성하는 기술이 개발되고 있다.Image captioning is the process of creating a caption that describes a provided image using natural language sentences. Recently, with the advancement of artificial intelligence technology, technology to automatically generate captions using machines is being developed.

기계를 이용하여 자동으로 캡션을 생성하는 것은 기존에 존재하는 많은 이미지와 각 이미지에 대한 라벨(이미지를 설명하는 한 단어) 정보를 이용하여 수행될 수 있다. 즉, 라벨이 같은 이미지를 검색하거나, 유사한 이미지들의 라벨들을 하나의 이미지에 할당함으로써 이미지에 대한 캡션을 생성할 수 있게 되는 것이다.Automatically generating captions using a machine can be done using many existing images and the label (one word that describes the image) information for each image. In other words, it is possible to create a caption for an image by searching for images with the same label or assigning labels from similar images to one image.

그러나, 이러한 방법의 경우 새로운 이미지에 대해 저장되어 있는 이미지 및 라벨 데이터만을 이용하여 캡션을 생성하므로, 자연어 문장으로 된 캡션을 생성하기 어렵고, 생성하더라고 문장의 질이 떨어지는 문제가 있다.However, in this method, captions are generated using only the image and label data stored for the new image, so it is difficult to generate captions in natural language sentences, and even if generated, the quality of the sentences is poor.

본 발명은 앞에서 설명한 문제점을 해결하기 위한 것으로, 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 추출하여 캡션을 생성하고, 오브젝트 정보들 사이의 관계를 예측하여 생성된 캡션을 재구조화하는 이미지 캡션 자동 생성 시스템 및 방법을 제공하는 것을 목적으로 한다.The present invention is intended to solve the problems described above, and is an image caption that generates captions by extracting attribute information and object information in images using deep learning, and reconstructs the generated captions by predicting relationships between object information. The purpose is to provide an automatic generation system and method.

위에서 언급된 본 발명의 기술적 과제 외에도, 본 발명의 다른 특징 및 이점들이 이하에서 기술되거나, 그러한 기술 및 설명으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.In addition to the technical problems of the present invention mentioned above, other features and advantages of the present invention are described below, or can be clearly understood by those skilled in the art from such description and description.

앞에서 설명한 목적을 달성하기 위한 본 발명의 실시 예에 따른 이미지에 대해 이미지를 설명하는 캡션을 자동으로 생성하기 위한 캡션 자동 생성 시스템은 캡션을 생성하기 위한 이미지를 제공하는 클라이언트와, 클라이언트로부터 제공받은 이미지를 분석하여 이미지를 설명하는 캡션을 생성하고, 생성한 캡션 및 캡션을 생성한 근거를 클라이언트로 전송하는 캡션 생성기를 포함할 수 있다.An automatic caption generation system for automatically generating a caption describing an image according to an embodiment of the present invention to achieve the purpose described above includes a client providing an image for generating a caption, and an image provided by the client. It may include a caption generator that analyzes the image and generates a caption that describes the image, and transmits the generated caption and the basis for generating the caption to the client.

한편, 앞에서 설명한 목적을 달성하기 위한 본 발명의 실시 예에 따른 이미지에 대해 이미지를 설명하는 캡션을 자동으로 생성하기 위한 캡션 자동 생성 방법은 캡션 생성 모듈에서 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 추출하고, 속성 정보 및 오브젝트 정보를 이용하여 캡션을 생성하는 단계와, 관계 생성 모듈에서 이미지 내 오브젝트들 사이의 관계를 예측하고, 예측된 관계들을 투플(tuple) 형태로 구조화한 투플 집합을 생성하고, 설명 생성 모듈에서 생성한 캡션 및 투플 집합을 이용하여 캡션을 재구조화여 확장된 캡션을 생성하고, 확장된 캡션 및 투플 집합에 대한 그래프를 시각화할 수 있다.Meanwhile, the automatic caption generation method for automatically generating a caption describing an image according to an embodiment of the present invention to achieve the purpose described above uses deep learning in the caption generation module to generate attribute information and objects within the image. A step of extracting information and generating a caption using attribute information and object information, predicting relationships between objects in the image in the relationship creation module, and creating a tuple set that structures the predicted relationships in the form of a tuple. You can create extended captions by restructuring the captions using the captions and tuple sets created in the description generation module, and visualize the graph for the extended captions and tuple sets.

본 발명의 실시 예에 따른 이미지 캡션 자동 생성 시스템 및 방법은 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 반영하여 캡션을 생성하므로 이미지에 대한 캡션 생성의 성능을 향상시킬 수 있다.The automatic image caption generation system and method according to an embodiment of the present invention uses deep learning to generate captions by reflecting attribute information and object information in the image, thereby improving the performance of caption generation for images.

이 밖에도, 본 발명의 실시 예들을 통해 본 발명의 또 다른 특징 및 이점들이 새롭게 파악될 수도 있을 것이다.In addition, other features and advantages of the present invention may be newly understood through embodiments of the present invention.

도 1은 본 발명의 실시 예에 따른 이미지 캡션 자동 생성 시스템의 구성을 나타내는 도면이다.
도 2는 본 발명의 실시 예에 따른 캡션 생성기의 구성을 나타내는 도면이다.
도 3은 본 발명의 실시 예에 따른 캡션 생성 모듈의 구성을 나타내는 도면이다.
도 4는 본 발명의 실시 예에 따른 관계 생성 모듈의 구성을 나타내는 도면이다.
도 5는 본 발명의 실시 예에 따른 설명 생성 모듈의 구성을 나타내는 도면이다.
도 6은 본 발명의 실시 예에 따른 이미지에 대한 캡션 생성을 나타내는 도면이다.
도 7은 본 발명의 실시 예에 따른 확장된 캡션 생성을 나타내는 도면이다.
도 8은 본 발명의 실시 예에 따른 이미지 캡션 자동 생성 방법을 나타내는 도면이다.
도 9는 본 발명의 실시 예에 따른 캡션을 생성하는 방법을 나타내는 도면이다.
도 10은 본 발명의 실시 예에 따른 확장된 캡션을 생성하는 방법을 나타내는 도면이다. 1 is a diagram showing the configuration of an automatic image caption generation system according to an embodiment of the present invention.
Figure 2 is a diagram showing the configuration of a caption generator according to an embodiment of the present invention.
Figure 3 is a diagram showing the configuration of a caption creation module according to an embodiment of the present invention.
Figure 4 is a diagram showing the configuration of a relationship creation module according to an embodiment of the present invention.
Figure 5 is a diagram showing the configuration of a description generation module according to an embodiment of the present invention.
Figure 6 is a diagram showing caption creation for an image according to an embodiment of the present invention.
Figure 7 is a diagram showing extended caption creation according to an embodiment of the present invention.
Figure 8 is a diagram showing a method for automatically generating image captions according to an embodiment of the present invention.
Figure 9 is a diagram showing a method for generating captions according to an embodiment of the present invention.
Figure 10 is a diagram showing a method for generating extended captions according to an embodiment of the present invention.

본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조 부호를 붙이도록 한다.In order to clearly explain the present invention, parts that are not relevant to the description are omitted, and identical or similar components are assigned the same reference numerals throughout the specification.

여기서 사용되는 전문 용어는 단지 특정 실시 예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다. 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분의 존재나 부가를 제외시키는 것은 아니다.The terminology used herein is only intended to refer to specific embodiments and is not intended to limit the invention. As used herein, singular forms include plural forms unless phrases clearly indicate the contrary. As used in the specification, the meaning of "comprising" refers to specifying a particular characteristic, area, integer, step, operation, element and/or ingredient, and the presence or presence of another characteristic, area, integer, step, operation, element and/or ingredient. This does not exclude addition.

다르게 정의하지는 않았지만, 여기에 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 보통 사용되는 사전에 정의된 용어들은 관련 기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.Although not defined differently, all terms including technical and scientific terms used herein have the same meaning as those generally understood by those skilled in the art in the technical field to which the present invention pertains. Terms defined in commonly used dictionaries are further interpreted as having meanings consistent with related technical literature and currently disclosed content, and are not interpreted in ideal or very formal meanings unless defined.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, with reference to the attached drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily implement the present invention. However, the present invention may be implemented in many different forms and is not limited to the embodiments described herein.

도 1은 본 발명의 실시 예에 따른 이미지 캡션 자동 생성 시스템의 구성을 나타내는 도면이다.1 is a diagram showing the configuration of an automatic image caption generation system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 이미지 캡션 자동 생성 시스템(1000)은 클라이언트(100), 캡션 생성기(200)를 포함할 수 있다.Referring to FIG. 1, the automatic image caption generation system 1000 according to an embodiment of the present invention may include a client 100 and a caption generator 200.

클라이언트(100)는 캡션을 생성하기 위한 이미지를 제공할 수 있다. 클라이언트(100)는 스마트폰이나 태플릿 PC와 같은 사용자 디바이스를 통해 캡션 생성기(200)로 이미지를 제공할 수 있다.The client 100 may provide an image for generating a caption. The client 100 may provide images to the caption generator 200 through a user device such as a smartphone or tablet PC.

또한, 캡션 생성기(200)는 클라이언트(100)로부터 제공받은 이미지를 분석하여 해당 이미지를 설명하는 캡션을 생성하고, 생성한 캡션 및 캡션을 생성한 근거를 클라이언트(100)로 전송할 수 있다.Additionally, the caption generator 200 may analyze an image provided by the client 100, generate a caption describing the image, and transmit the generated caption and the basis for generating the caption to the client 100.

여기서, 캡션 생성기(200)는 딥 러닝을 통해 이미지를 분석할 수 있다. 구체적으로, 캡션 생성기(200)는 이미지 및 이미지에 대한 정답 캡션을 학습하고 있을 수 있다.Here, the caption generator 200 can analyze the image through deep learning. Specifically, the caption generator 200 may be learning the image and the correct caption for the image.

캡션 생성기(200)는 학습된 이미지 및 이미지에 대한 정답 캡션들을 이용하여 새로운 이미지에 대한 캡션을 생성할 수 있다. 캡션 생성기(200)는 학습된 이미지 및 이미지에 대한 정답 캡션들을 이용하여 클라이언트(100)로부터 제공된 이미지에 대해 캡션을 생성할 수 있다. 여기서, 정답 캡션은 사용자가 이미지에 대해 임의로 설정한 5개 이상의 구절을 포함하는 문장일 수 있다. 또한, 캡션 생성기(200)는 제공된 이미지의 오브젝트를 추출하여 오브젝트들간의 관계를 예측하고, 예측된 관계들을 생성된 캡션에 적용하여 줌으로써 더 확장된 캡션을 생성할 수 있다.The caption generator 200 can generate a caption for a new image using the learned image and the correct captions for the image. The caption generator 200 may generate a caption for an image provided from the client 100 using the learned image and correct captions for the image. Here, the correct caption may be a sentence containing five or more phrases arbitrarily set by the user for the image. Additionally, the caption generator 200 can generate a more expanded caption by extracting objects from a provided image, predicting relationships between objects, and applying the predicted relationships to the generated caption.

캡션 생성기(200)는 확장된 캡션 및 캡션이 생성된 근거를 클라이언트(100)로 전달할 수 있고, 클라이언트(100)는 캡션 생성기(200)에서 전달된 이미지에 대한 캡션 및 캡션이 생성된 근거를 통해 딥 러닝의 대한 결과를 해석할 수 있다. 여기서, 클라이언트(100) 및 캡션 생성기(200)는 유선 또는 무선으로 연결될 수 있다.The caption generator 200 may transmit the extended caption and the basis for generating the caption to the client 100, and the client 100 may transmit the caption for the image delivered from the caption generator 200 and the basis for generating the caption. You can interpret the results of deep learning. Here, the client 100 and the caption generator 200 may be connected wired or wirelessly.

도 2는 본 발명의 실시 예에 따른 캡션 생성기의 구성을 나타내는 도면이다.Figure 2 is a diagram showing the configuration of a caption generator according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시 예에 따른 캡션 생성기(200)는 캡션 생성 모듈(210), 관계 생성 모듈(220) 및 설명 생성 모듈(230)을 포함할 수 있다.Referring to FIG. 2, the caption generator 200 according to an embodiment of the present invention may include a caption generation module 210, a relationship generation module 220, and a description generation module 230.

캡션 생성 모듈(210)은 이미지 및 이미지에 대한 정답 캡션을 학습하고 있으며, 학습된 이미지 및 이미지에 대한 정답 캡션을 이용하여 제공된 이미지의 캡션을 생성할 수 있다. 캡션 생성 모듈(210)은 이미지 내 속성 정보 및 오브젝트 정보를 추출하고, 추출한 속성 정보 및 오브젝트 정보를 이용하여 캡션을 생성할 수 있다. 여기서, 속성 정보는 이미지와 관련된 단어들일 수 있고, 오브젝트 정보는 제공받은 이미지의 핵심 대상일 수 있다. 일 예로, 소파 앞의 개를 포함하고 있는 이미지의 경우, 속성 정보는 '개', '소퍄'일 수 있고, 오브젝트 정보는 이미지 내의 '개', '소파'일 수 있다. The caption generation module 210 is learning images and correct captions for images, and can generate captions for provided images using the learned images and correct captions for images. The caption creation module 210 may extract attribute information and object information within an image and generate a caption using the extracted attribute information and object information. Here, attribute information may be words related to the image, and object information may be the core object of the provided image. For example, in the case of an image containing a dog in front of a sofa, attribute information may be 'dog' or 'sofa', and object information may be 'dog' or 'sofa' in the image.

관계 생성 모듈(220)은 이미지 내 오브젝트들 사이의 관계를 예측하고, 예측된 관계들을 투플(tuple) 형태로 구조화한 투플 집합을 생성할 수 있다. 여기서, 투플 형태는 원소들을 열거한 것으로, 원소들을 괄호 '( )'안에 쉼표 ','로 구분하여 나열하는 것일 수 있다. 일 예로, 소파 앞의 개를 포함하고 있는 이미지가 제공된 경우, 관계 생성 모듈(220)은 오브젝트인 개와 소파 사이의 관계를 예측할 수 있다. 즉, 관계 생성 모듈(220)은 개가 소파 앞의 있음을 예측할 수 있고, 예측된 관계를 (소파, 앞의, 개)로 구조화할 수 있다. 이때. '(소파, 앞의, 개)'는 투플 집합일 수 있다.The relationship creation module 220 can predict relationships between objects in an image and generate a tuple set in which the predicted relationships are structured in the form of a tuple. Here, the tuple form is a list of elements, which may be separated by commas ',' within parentheses '( )'. For example, when an image including a dog in front of a sofa is provided, the relationship creation module 220 can predict the relationship between the dog, which is an object, and the sofa. That is, the relationship creation module 220 can predict that a dog is in front of the sofa, and structure the predicted relationship as (sofa, front, dog). At this time. '(sofa, front, dog)' can be a set of tuples.

설명 생성 모듈(230)은 캡션 생성 모듈(210)에서 생성한 캡션 및 관계 생성 모듈(220)에서 생성한 투플 집합을 이용하여 캡션을 재구조화여 확장된 캡션을 생성할 수 있다. 즉, 설명 생성 모듈(230)은 캡션 생성 모듈(210)에서 생성한 캡션에 관계 생성 모듈(220)에서 예측한 오브젝트들 사이의 관계를 반영하여 더 확장된 캡션을 생성할 수 있다. 또한, 설명 생성 모듈(230)은 확장된 캡션 및 캡션이 생성된 근거인 투플 집합에 대한 그래프를 시각화하여 클라이언트(100)로 전송할 수 있다.The description generation module 230 may restructure the caption using the caption generated by the caption generation module 210 and the tuple set generated by the relationship generation module 220 to generate an extended caption. That is, the description generation module 230 may generate a more expanded caption by reflecting the relationships between objects predicted by the relationship generation module 220 in the caption generated by the caption generation module 210. Additionally, the description generation module 230 may visualize an extended caption and a graph of the tuple set that is the basis for generating the caption and transmit it to the client 100.

도 3은 본 발명의 실시 예에 따른 캡션 생성 모듈의 구성을 나타내는 도면이다.Figure 3 is a diagram showing the configuration of a caption creation module according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 실시 예에 따른 캡션 생성 모듈(210)은 속성 추출 모델(212), 오브젝트 인식 모델(214) 및 이미지 캡션 모델(216)을 포함할 수 있다.Referring to FIG. 3, the caption creation module 210 according to an embodiment of the present invention may include an attribute extraction model 212, an object recognition model 214, and an image caption model 216.

속성 추출 모델(212)은 제공받은 이미지의 속성 정보를 추출하고, 속성 정보를 투플 형태로 변환할 수 있다. 여기서, 속성 추출 모델(212)은 이미지 및 이미지에 대한 캡션이 학습되어 있을 수 있다. 즉, 속성 추출 모델(212)에는 많은 이미지와 각 이미지와 관련된 단어들이 하나의 벡터 공간에 맵핑되어 저장되어 있을 수 있다. 이에 따라, 속성 추출 모델(212)은 저장된 정보들을 이용하여 새로운 이미지와 관련된 단어들을 출력하고, 출력한 단어들을 학습에 이용할 수 있다.The attribute extraction model 212 can extract attribute information of a provided image and convert the attribute information into a tuple form. Here, the attribute extraction model 212 may have learned images and captions for the images. That is, the attribute extraction model 212 may have many images and words related to each image mapped and stored in one vector space. Accordingly, the attribute extraction model 212 can output words related to a new image using the stored information and use the output words for learning.

또한, 속성 추출 모델(212)은 각 이미지에 대한 캡션들로부터 캡션 내 동사(또는 동명사 및 분사) 형태의 단어들과 3번 이상 동일하게 존재하는 명사 형태의 단어들을 이용하여 각 이미지에 대한 캡션들로부터 단어들을 추출할 수 있다. 속성 추출 모델(212)은 해당 이미지 및 추출된 단어들을 딥 러닝 모델을 이용하여 하나의 벡터 공간에 임베딩 되도록 학습할 수 있다.In addition, the attribute extraction model 212 extracts the captions for each image from the captions for each image using words in the form of verbs (or gerunds and participles) in the captions and words in the form of nouns that occur identically three or more times. Words can be extracted from . The attribute extraction model 212 can learn to embed the image and extracted words into one vector space using a deep learning model.

이에 따라, 속성 추출 모델(212)은 학습되어 있는 이미지 및 이미지에 대한 캡션 데이터를 이용하여 제공받은 이미지와 가장 관련된 단어들을 추출할 수 있다.Accordingly, the attribute extraction model 212 can extract words most related to the provided image using the learned image and caption data for the image.

오브젝트 인식 모델(214)은 이미지 내 중요 오브젝트를 추출하고, 추출된 오브젝트를 포함하는 오브젝트 영역을 투플 형태로 변환할 수 있다. 오브젝트 인식 모델(214)은 Mask R-CNN 알고리즘 등과 같은 딥 러닝 기반 오브젝트 인식 모델을 활용하여 제공된 이미지 내의 미리 정의된 오브젝트 영역에 해당하는 영역들을 제공된 이미지의 오브젝트 영역으로써, 추출할 수 있다.The object recognition model 214 can extract important objects from the image and convert the object area including the extracted objects into a tuple form. The object recognition model 214 may utilize a deep learning-based object recognition model such as the Mask R-CNN algorithm to extract areas corresponding to a predefined object area in the provided image as the object area of the provided image.

이미지 캡션 모델(216)은 속성 추출 모델(212)에서 추출된 각 단어들 및 오브젝트 인식 모델(214)에서 추출된 오브젝트 영역들을 이용하여 제공된 이미지의 캡션을 생성할 수 있다.The image caption model 216 may generate a caption for a provided image using each word extracted from the attribute extraction model 212 and object regions extracted from the object recognition model 214.

이미지 캡션 모델(216)은 딥 러닝 알고리즘으로 수행되며, RNN(Recurrent neural network)을 기반으로 수행될 수 있다. 이에 따라, 이미지 캡션 모델(216)은 이미지 내 오브젝트들 사이의 관계를 시계열적으로 예측할 수 있다.The image caption model 216 is performed using a deep learning algorithm and may be performed based on a recurrent neural network (RNN). Accordingly, the image caption model 216 can predict the relationship between objects in the image in a time series manner.

본 발명의 실시 예에 따른 이미지 캡션 모델(216)은 속성 주의 모델(216a), 오브젝트 주의 모델(216b), 문법 학습 모델(216c) 및 언어 생성 모델(216d)을 포함할 수 있다.The image caption model 216 according to an embodiment of the present invention may include an attribute attention model 216a, an object attention model 216b, a grammar learning model 216c, and a language generation model 216d.

속성 주의 모델(216a)은 속성 추출 모델(212)에서 추출된 단어들에 대해 단어 주의도(attention score)를 부여할 수 있다. 속성 주의 모델(216a)은 현재 시간에 언어 생성 모델(216d)에서 생성한 단어 태그와 관련성이 높은 단어 순서로 단어 주의도를 부여할 수 있다. 여기서, 단어 주의도는 0 내지 1 사이의 값이며, 단어 태그와 관련성이 높을수록 1에 인접할 수 있다.The attribute attention model 216a may assign an attention score to words extracted from the attribute extraction model 212. The attribute attention model 216a may assign word attention to a word order that is highly related to the word tag generated by the language generation model 216d at the current time. Here, the word attention level is a value between 0 and 1, and the higher the relevance to the word tag, the closer it may be to 1.

오브젝트 주의 모델(216b)은 오브젝트 인식 모델(214)에서 추출한 오브젝트의 영역들에 대해 영역 주의도를 부여할 수 있다. 오브젝트 주의 모델(216b)은 현재 시간에 언어 생성 모델(216d)에서 생성한 단어 태그와 관련성이 높은 단어 순서로 영역 주의도를 부여할 수 있다. 여기서, 영역 주의도는 0 내지 1 사이의 값이며, 단어 태그와 관련성이 높을수록 1에 인접할 수 있다.The object attention model 216b may assign region attention to regions of the object extracted from the object recognition model 214. The object attention model 216b may assign region attention to a word order that is highly related to the word tag generated by the language generation model 216d at the current time. Here, the region attention is a value between 0 and 1, and the higher the relevance to the word tag, the closer it may be to 1.

문법 학습 모델(216c)은 이미지 및 이미지의 캡션에 대한 문장의 문법을 학습할 수 있다. 문법 학습 모델(216c)은 이미지의 정답 캡션 문장에 대해 EasySRL과 같은 문법 태깅 도구를 이용하여 문장 내 각 단어들에 대해 태깅하고, 이미지의 정답 캡션 문장의 문법을 학습할 수 있다. 문법 학습 모델(216c)이 캡션 문장의 문법을 학습함으로써, 제공된 이미지에 대해 캡션을 생성할 때 문법적인 측면이 고려될 수 있도록 할 수 있다. The grammar learning model 216c can learn the grammar of sentences for images and image captions. The grammar learning model 216c can tag each word in the sentence using a grammar tagging tool such as EasySRL for the correct caption sentence of the image and learn the grammar of the correct caption sentence of the image. By learning the grammar of caption sentences, the grammar learning model 216c can ensure that grammatical aspects are taken into consideration when generating captions for provided images.

언어 생성 모델(216d)은 속성 추출 모델(216a)에서 추출된 단어들, 오브젝트 인식 모델(216b)에서 추출된 오브젝트 영역들, 속성 주의 모델(216c)에서 생성된 단어 주의도 및 오브젝트 주의 모델(216d)에서 생성된 영역 주의도를 기초로 시간 단계마다 캡션을 위한 단어 태그 및 문법 태그를 생성할 수 있다.The language generation model 216d includes words extracted from the attribute extraction model 216a, object regions extracted from the object recognition model 216b, word attention generated from the attribute attention model 216c, and object attention model 216d. ), word tags and grammar tags for captions can be created at each time step based on the region attention generated in ).

언어 생성 모델(216d)은 단어 주의도 값, 영역 주의도 값, 속성 추출 모젤(212)에서 투플 형태로 변환한 단어들의 평균 벡터, 오브젝트 인식 모델(214)에서 투플 형태로 변환한 오브젝트 영역들의 평균 벡터, 언어 생성 모델(216d)에서 이전 시간에 생성한 단어 및 언어 생성 모델(216d)이 생성한 모든 단어들에 대한 압축된 정보를 모두 고려하여 현재 시간에서 단어 태그 및 문법 태그를 예측할 수 있다. 언어 생성 모델(216d)은 예측한 단어 태그 및 문법 태그에 대해서 정답 캡션 문장과 비교하여 생성된 단어 태그 및 문법 태그에 대한 손실값을 각각 계산할 수 있다. 언어 생성 모델(216d)은 단어 태그 및 문법 태그에 대한 손실값들을 반영하여 캡션 생성 모듈(210)의 학습 파라미터들을 업데이트할 수 있다.The language generation model 216d includes a word attention value, a region attention value, an average vector of words converted into a tuple form by the attribute extraction model 212, and an average of object regions converted into a tuple form by the object recognition model 214. Word tags and grammar tags can be predicted at the current time by considering all compressed information about the vector, words generated by the language generation model 216d at a previous time, and all words generated by the language generation model 216d. The language generation model 216d can compare the predicted word tags and grammar tags with the correct caption sentences and calculate loss values for the generated word tags and grammar tags, respectively. The language generation model 216d may update the learning parameters of the caption generation module 210 by reflecting loss values for word tags and grammar tags.

이에 따라, 언어 생성 모델(216d)은 단어 태그 및 문법 태그를 이용하여 제공된 이미지에 대해 문법이 고려된 캡션 문장을 생성할 수 있다.Accordingly, the language generation model 216d can generate a caption sentence that takes grammar into account for the provided image using word tags and grammar tags.

도 4는 본 발명의 실시 예에 따른 관계 생성 모듈의 구성을 나타내는 도면이다.Figure 4 is a diagram showing the configuration of a relationship creation module according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 실시 예에 따른 관계 생성 모듈(220)은 오브젝트 추출 모델(222), 관계 예측 모델(224) 및 관계 그래프 생성 모델(226)을 포함할 수 있다.Referring to FIG. 4, the relationship creation module 220 according to an embodiment of the present invention may include an object extraction model 222, a relationship prediction model 224, and a relationship graph creation model 226.

오브젝트 인식 모델(222)은 제공된 이미지 내 중요한 오브젝트 영역들을 추출할 수 있다. 오브젝트 인식 모델(222)은 제공된 이미지 내 중요한 오브젝트들을 추출하고, 추출한 오브젝트들을 포함하는 오브젝트 영역들을 추출할 수 있다. The object recognition model 222 can extract important object areas within the provided image. The object recognition model 222 can extract important objects from the provided image and extract object areas including the extracted objects.

관계 예측 모델(224)은 추출된 오브젝트 영역들간의 관계를 예측하고, 예측한 오브젝트 영역들간의 관계를 투플 형태로 구조화할 수 있다. 여기서, 관계 예측 모델(224)은 예측한 오브젝트 영역들간의 관계를 (제1명사, 서술어, 제2명사)의 형태로 구조화할 수 있다.The relationship prediction model 224 can predict relationships between extracted object areas and structure the relationships between predicted object areas in a tuple form. Here, the relationship prediction model 224 may structure the relationship between predicted object areas in the form of (first noun, predicate, second noun).

관계 그래프 생성 모델(226)은 생성된 투플 집합에 대해 하나의 그래프를 생성할 수 있다. 관계 그래프 생성 모델(226)은 투플 집합들에 대해 제1명사로부터 서술어로 화살표를 표시하고, 서술어로부터 제2명사로 화살표를 표시하는 등의 그래프를 생성할 수 있다.The relationship graph generation model 226 can create a graph for the generated tuple set. The relationship graph creation model 226 can generate a graph for sets of tuples, such as displaying an arrow from a first noun to a predicate, and displaying an arrow from a predicate to a second noun.

도 5는 본 발명의 실시 예에 따른 설명 생성 모듈의 구성을 나타내는 도면이다.Figure 5 is a diagram showing the configuration of a description generation module according to an embodiment of the present invention.

도 5를 참조하면, 본 발명의 실시 예에 따른 설명 생성 모듈(230)은 문장 재구조화 모델(232) 및 시각화 모델(234)을 포함할 수 있다.Referring to FIG. 5, the description generation module 230 according to an embodiment of the present invention may include a sentence restructuring model 232 and a visualization model 234.

문장 재구조화 모델(232)은 캡션 생성 모듈(210)에서 생성된 캡션 및 관계 생성 모듈(220)에서 생성된 투플 집합을 이용하여 알고리즘에 따라 일부 단어를 투플에 대한 구절로 대치시키고, 생성된 캡션을 확장시킬 수 있다. 즉, 문장 재구조화 모델(232)은 캡션 생성 모듈(210)에서 생성한 캡션에 관계 생성 모듈(220)에서 생성된 투플 집합을 반영하여 캡션을 더 확장시킬 수 있다.The sentence restructuring model 232 replaces some words with phrases for the tuples according to an algorithm using the caption generated in the caption generation module 210 and the set of tuples generated in the relationship generation module 220, and the generated caption can be expanded. That is, the sentence restructuring model 232 can further expand the caption by reflecting the tuple set generated by the relationship generation module 220 in the caption generated by the caption generation module 210.

문장 재구조화 모델(232)은 관계 생성 모듈(220)에서 생성된 투플 집합들 중에서 캡션 생성 모듈(210)에서 생성된 캡션에 포함되는 투플 집합들을 제거할 수 있다. 여기서, 투플 집합을 제거하는 것은 투플 집합 내 제1명사, 제2명사, 서술어가 캡션 생성 모듈(210)에서 생성된 캡션에 모두 포함되면 중복 투플 집합으로 판단하여 이 중복 투플 집합을 삭제할 수 있다.The sentence restructuring model 232 may remove tuple sets included in the caption generated by the caption generation module 210 from among the tuple sets generated by the relationship generation module 220. Here, removing a tuple set means that if the first noun, second noun, and predicate in the tuple set are all included in the caption generated by the caption generation module 210, it is determined to be a duplicate tuple set and this duplicate tuple set can be deleted.

문장 재구조화 모델(232)은 중복 투플 집합을 제거하고 남은 투플 집합들을 문장 형식으로 변환할 수 있다. 여기서, 문장 재구조화 모델(232)은 투플 집합의 서술어가 전치사인 경우 제1명사 - 전치사 - 제2명사의 순으로 나열함으로써 문장 형식으로 변환할 수 있다. 반면, 문장 재구조화 모델(232)은 투플 집합의 서술어가 동사인 경우 제2명사 - 동사 - 제1명사의 순으로 나열함으로써 문장 형식으로 변환할 수 있다.The sentence restructuring model 232 can remove duplicate tuple sets and convert the remaining tuple sets into sentence format. Here, when the predicate of the tuple set is a preposition, the sentence restructuring model 232 can be converted into a sentence format by listing the items in the following order: first noun - preposition - second noun. On the other hand, the sentence restructuring model 232 can be converted into a sentence format by listing the predicate of the tuple set in the order of second noun - verb - first noun when the predicate of the tuple set is a verb.

일 예로, 투플 집합이 (소파, 앞의, 개)인 경우 투플 집합의 서술어가 전치사이므로, 문장 재구조화 모델(232)은 상기 투플 집합을 '소파 앞의 개'로 변환할 수 있다. 다른 예로, 투플 집합이 (사람, 눕다. 침대)인 경우 투플 집합의 서술어가 동사이므로, 문장 재구조화 모델(232)은 상기 투플 집합을 '침대에 누워있는 사람'으로 변환할 수 있다.For example, if the tuple set is (sofa, in front, dog), the predicate of the tuple set is a preposition, so the sentence restructuring model 232 can convert the tuple set into 'dog in front of the sofa'. As another example, if the tuple set is (person, lie down, bed), the predicate of the tuple set is a verb, so the sentence restructuring model 232 can convert the tuple set to 'person lying in bed'.

문장 재구조화 모델(232)은 투플 집합을 문장 형식으로 변환하고, 변환된 문장을 캡션에 반영할 수 있다. 이후, 변환된 문장이 반영된 캡션(확장된 캡션)을 정답 캡션과 비교하여 스코어를 계산하고, 가장 큰 스코어를 갖는 구절을 선택할 수 있다. 문장 재구조화 모델(232)은 투플 집합을 문장 형식으로 변환 - 캡션에 적용 - 가장 큰 스코어를 갖는 구절 선택하는 방식을 통해 더 이상 남은 투플 집합이 없을때까지 반복할 수 있다. 이후, 문장 재구조화 모델(232)은 마지막에 선택된 구절을 최종 확장된 캡션으로서 선택할 수 있다.The sentence restructuring model 232 can convert a set of tuples into a sentence format and reflect the converted sentence in the caption. Afterwards, the caption (extended caption) reflecting the converted sentence is compared with the correct caption to calculate the score, and the phrase with the highest score can be selected. The sentence restructuring model 232 can repeat the process of converting a set of tuples into a sentence format - applying it to a caption - selecting a phrase with the highest score until there are no more sets of tuples remaining. Thereafter, the sentence restructuring model 232 may select the last selected phrase as the final extended caption.

시각화 모델(234)은 문장 재구조화 모델(232)에서 확장시킨 캡션을 투플 집합과 매칭하여 시각화할 수 있다. 시각화 모델(234)은 문장 재구조화 모델(232)에서 확장된 캡션을 투플 집합과 매칭하여 투플 집합의 관계를 나타내는 그래프를 생성할 수 있다. 또한, 시각화 모델(234)은 생성한 투플 집합의 관계를 나타내는 그래프를 클라이언트(100)로 전송하여 사용자가 확장된 캡션이 생성된 근거를 확인할 수 있도록 할 수 있다. The visualization model 234 can visualize the caption expanded from the sentence restructuring model 232 by matching it with a tuple set. The visualization model 234 can generate a graph representing the relationship between the tuple sets by matching the extended caption in the sentence restructuring model 232 with the tuple set. Additionally, the visualization model 234 may transmit a graph representing the relationship between the generated tuple sets to the client 100 so that the user can check the basis for generating the extended caption.

시각화 모델(234)은 캡션에 반영된 투플 집합에 해당하는 오브젝트 영역을 제공된 이미지 위에 표시할 수 있다. 이때, 시각화 모델(234)은 각각의 오브젝트 영역을 서로 다른 색 또는 서로 다른 선(선 종류나 두께 등)을 통해 표시할 수 있다. 또한, 시각화 모델(234)은 최종 캡션에서 오브젝트 영역과 대응되는 구절을 오브젝트 영역과 동일한 색으로 표시할 수 있다. 일 예로, 최종 캡션 문장이 '바닥에 누워있는 소파 앞의 개와 노트북 주변의 고양이'인 경우 시각화 모델(234)은 제공된 이미지 내의 소파 및 개를 하나의 오브젝트 영역으로써 빨간색 선을 이용하여 표시할 수 있다. 또한, 시각화 모델(234)은 최종 캡션 문장에서 '소파 앞의 개'를 빨간색 글씨로 표시할 수 있다. 이와 같이, 대응되는 구절 및 오브젝트 영역을 동일한 색으로 표시함으로써, 사용자가 이를 한눈에 알아 볼 수 있도록 할 수 있다.The visualization model 234 may display the object area corresponding to the set of tuples reflected in the caption on the provided image. At this time, the visualization model 234 may display each object area using different colors or different lines (line type or thickness, etc.). Additionally, the visualization model 234 may display phrases corresponding to the object area in the final caption in the same color as the object area. For example, if the final caption sentence is 'a dog in front of a sofa lying on the floor and a cat around a laptop', the visualization model 234 may display the sofa and the dog in the provided image as one object area using a red line. . Additionally, the visualization model 234 may display 'dog in front of the sofa' in red in the final caption sentence. In this way, by displaying corresponding passages and object areas in the same color, the user can recognize them at a glance.

도 6은 본 발명의 실시 예에 따른 이미지에 대한 캡션 생성을 나타내는 도면이다.Figure 6 is a diagram showing caption creation for an image according to an embodiment of the present invention.

도 6을 참조하면, 클라이언트(100)로부터 이미지(10)가 제공되면 속성 추출 모델(212)은 제공된 이미지(10) 내 속성 정보(1)를 추출할 수 있다. 속성 추출 모델(212)은 학습된 이미지 및 이미지의 정답 캡션을 기초로 제공된 이미지(10) 내 속성 정보(1)를 추출할 수 있다. 일 예로, 속성 추출 모델(212)은 개, 고양이, 바닥 등을 속성 정보(1)로 추출할 수 있다.Referring to FIG. 6, when an image 10 is provided from the client 100, the attribute extraction model 212 may extract attribute information 1 within the provided image 10. The attribute extraction model 212 can extract attribute information (1) in the image 10 provided based on the learned image and the correct caption of the image. As an example, the attribute extraction model 212 can extract a dog, a cat, a floor, etc. as attribute information (1).

또한, 오브젝트 인식 모델(214)은 속성 추출 모델(212)이 속성 정보(1)를 추출하는 것과 동시에 제공된 이미지(10) 내 오브젝트 정보 및 오브젝트를 포함하는 오브젝트 영역(2)을 추출할 수 있다. 오브젝트 인식 모델(214)은 학습된 이미지 및 이미지의 정답 캡션을 기초로 제공된 이미지(10) 내 오브젝트 영역(2)을 추출할 수 있다. 일 예로, 오브젝트 인식 모델(214)은 개, 고양이, 바닥 등을 오브젝트 정보로 추출할 수 있고, 오브젝트 정보를 포함하는 오브젝트 영역(2)을 추출할 수 있다.Additionally, the object recognition model 214 may extract the object information and object area 2 including the object within the provided image 10 at the same time as the attribute extraction model 212 extracts the attribute information 1. The object recognition model 214 may extract the object area 2 within the provided image 10 based on the learned image and the correct caption of the image. As an example, the object recognition model 214 can extract a dog, a cat, a floor, etc. as object information, and can extract an object area 2 including object information.

또한, 이미지 캡션 모델(216)은 속성 추출 모델(212)에서 추출한 속성 정보 및 오브젝트 인식 모델(214)에서 추출한 오브젝트 정보를 이용하여 제공된 이미지(10)에 대한 캡션(3)을 생성할 수 있다. 일 예로, 이미지 캡션 모델(216)은 '바닥 위에 누워 있는 개와 고양이에 대한 거실 사진'이라는 캡션(3)을 생성할 수 있다.Additionally, the image caption model 216 may generate a caption 3 for the provided image 10 using attribute information extracted from the attribute extraction model 212 and object information extracted from the object recognition model 214. As an example, the image caption model 216 may generate a caption 3 that reads 'living room photo of a dog and cat lying on the floor.'

도 7은 본 발명의 실시 예에 따른 확장된 캡션 생성을 나타내는 도면이다.Figure 7 is a diagram showing extended caption creation according to an embodiment of the present invention.

도 7을 참조하면, 오브젝트 추출 모델(222)은 제공된 이미지(10) 내 오브젝트 정보 및 오브젝트를 포함하는 오브젝트 영역(2)을 추출할 수 있다. 오브젝트 추출 모델(222)은 학습된 이미지 및 이미지의 정답 캡션을 기초로 제공된 이미지(10) 내 오브젝트 정보를 추출할 수 있다. 일 예로, 오브젝트 추출 모델(222)은 개, 고양이, 소파, 노트북, 문 등을 오브젝트 정보로 추출할 수 있고, 오브젝트 정보를 포함하는 오브젝트 영역(2)을 추출할 수 있다. 이때, 오브젝트 추출 모델(222)은 추출한 오브젝트들을 두 개 이상 포함하도록 오브젝트 영역(2)을 추출할 수 있다. 이를 통해, 관계 예측 모델(224)은 오브젝트 영역(2) 내의 오브젝트들의 관계를 예측할 수 있다.Referring to FIG. 7, the object extraction model 222 can extract object information and an object area 2 including the object within the provided image 10. The object extraction model 222 can extract object information in the provided image 10 based on the learned image and the correct caption of the image. As an example, the object extraction model 222 can extract a dog, a cat, a sofa, a laptop, a door, etc. as object information, and can extract an object area 2 including the object information. At this time, the object extraction model 222 may extract the object area 2 to include two or more extracted objects. Through this, the relationship prediction model 224 can predict the relationship between objects in the object area 2.

관계 예측 모델(224)은 오브젝트 추출 모델(222)에서 추출한 오브젝트들 간의 관계를 예측할 수 있고, 오브젝트들 간의 관계를 투플 집합(4)으로 생성할 수 있다. 일 예로, 관계 예측 모델(224)은 오브젝트로 추출된 '소파'와 '개' 사이의 관계는 소파 앞의 개가 있는 것으로 예측할 수 있으며, 이에 따라 (소파, 앞의, 개)로 투플 집합(4)을 생성할 수 있다. 다른 예로, 관계 예측 모델(224)은 오브젝트로 추출된 '고양이'와 '문' 사이의 관계는 고양이가 문 옆에 있는 것으로 예측할 수 있으며, 이에 따라 (문, 옆의, 고양이)로 투플 집합(4)을 생성할 수 있다.The relationship prediction model 224 can predict relationships between objects extracted from the object extraction model 222 and generate relationships between objects as a tuple set 4. As an example, the relationship prediction model 224 can predict that the relationship between 'sofa' and 'dog' extracted as objects is that there is a dog in front of the sofa, and accordingly, (sofa, front, dog) is a tuple set (4 ) can be created. As another example, the relationship prediction model 224 can predict that the relationship between 'cat' and 'door' extracted as objects is that the cat is next to the door, and accordingly, a tuple set (door, next to, cat) is created ( 4) can be created.

문장 재구조화 모델(232)은 관계 예측 모델(224)에서 생성한 투플 집합(4)을 이용하여 알고리즘에 따라 일부 단어를 투플 집합에 대한 구절로 대치시키고, 생성된 캡션(3)을 확장시킬 수 있다. 즉, 문장 재구조화 모델(232)은 캡션 생성 모듈(210)에서 생성한 캡션(3)에 관계 생성 모듈(220)에서 생성된 투플 집합(4)을 반영하여 캡션을 더 확장시킬 수 있다. 일 예로, 문장 재구조화 모델(232)은 '바닥에 누워있는 소파 앞의 개와 문 옆의 노트북 주변의 고양이에 대한 거실 사진'으로 캡션을 확장시킬 수 있다. The sentence restructuring model 232 can use the tuple set 4 generated by the relationship prediction model 224 to replace some words with phrases for the tuple set according to an algorithm and expand the generated caption 3. there is. That is, the sentence restructuring model 232 can further expand the caption by reflecting the tuple set 4 generated by the relationship generating module 220 in the caption 3 generated by the caption generating module 210. As an example, the sentence restructuring model 232 can expand the caption to 'living room photo of a dog in front of the sofa lying on the floor and a cat around a laptop by the door.'

관계 그래프 생성 모델(226)은 관계 예측 모델(224)에서 생성된 투플 집합(4)에 대해 관계 그래프를 생성할 수 있다. 여기서, 관계 그래프 생성 모델(226)은 투플 집합(4)의 서술어를 네모 박스로 표현하고, 투플 집합의 명사들을 원형 박스로 표현할 수 있다. 관계 그래프 생성 모델(226)은 제1명사 - 서술어 - 제2명사의 순서로 각 박스들을 연결할 수 있다. The relationship graph generation model 226 may generate a relationship graph for the tuple set 4 generated by the relationship prediction model 224. Here, the relationship graph creation model 226 may express the predicate of the tuple set 4 as a square box and the nouns of the tuple set as circular boxes. The relationship graph creation model 226 can connect each box in the following order: first noun - predicate - second noun.

시각화 모델(234)은 이미지 위에 확장된 캡션의 구절들을 오브젝트 영역으로써 표시할 수 있고, 이때, 각 오브젝트 영역들은 다른 색으로 표시될 수 있다. 또한, 시각화 모델(234)은 각 오브젝트 영역들과 대응되는 확장된 캡션의 구절들을, 해당하는 오브젝트 영역과 동일한 색으로 표시함으로써 시각화할 수 있다.The visualization model 234 may display extended caption passages on the image as object areas, and in this case, each object area may be displayed in a different color. Additionally, the visualization model 234 can visualize passages of expanded captions corresponding to each object area by displaying them in the same color as the corresponding object area.

도 8은 본 발명의 실시 예에 따른 이미지 캡션 자동 생성 방법을 나타내는 도면이다.Figure 8 is a diagram showing a method for automatically generating image captions according to an embodiment of the present invention.

도 8을 참조하면, 캡션 생성 모듈(210)은 제공된 이미지의 속성 정보 및 오브젝트 정보를 추출하고, 추출한 이미지의 속성 정보 및 오브젝트 정보를 반영하여 캡션을 생성할 수 있다(S100).Referring to FIG. 8, the caption creation module 210 may extract attribute information and object information of a provided image and generate a caption by reflecting the attribute information and object information of the extracted image (S100).

캡션 생성 모듈(210)은 이미지 내 속성 정보 및 오브젝트 정보를 추출하고, 추출한 속성 정보 및 오브젝트 정보를 이용하여 캡션을 생성할 수 있다. 여기서, 속성 정보는 이미지와 관련된 단어들일 수 있고, 오브젝트 정보는 제공받은 이미지의 핵심 대상일 수 있다. 여기서, 캡션 생성 모듈(210)은 딥 러닝을 통해 학습된 이미지 및 각 이미지에 대한 캡션들을 기초로 제공된 이미지의 캡션을 생성할 수 있다.The caption creation module 210 may extract attribute information and object information within an image and generate a caption using the extracted attribute information and object information. Here, attribute information may be words related to the image, and object information may be the core object of the provided image. Here, the caption generation module 210 may generate captions for provided images based on images learned through deep learning and captions for each image.

관계 생성 모듈(220)은 이미지 내 오브젝트들 사이의 관계를 예측하고, 예측된 관계들에 대한 투플 집합을 생성할 수 있다(S200). 관계 생성 모듈(220)은 이미지 내 오브젝트들 사이의 관계를 (제1명사, 서술어, 제2명사)로 구성되는 투플 집합으로 나타낼 수 있다.The relationship creation module 220 may predict relationships between objects in an image and generate a set of tuples for the predicted relationships (S200). The relationship creation module 220 may represent the relationship between objects in the image as a tuple set consisting of (first noun, predicate, second noun).

설명 생성 모듈(230)은 캡션 생성 모듈(210)에서 생성한 캡션과 관계 생성 모듈(220)에서 생성한 투플 집합을 이용하여 확장된 캡션을 생성할 수 있다(S300). 설명 생성 모듈(230)은 투플 집합을 문장으로 변환하고, 이를 캡션에 반영함으로써, 캡션을 확장시킬 수 있다.The description generation module 230 may generate an extended caption using the caption generated by the caption generation module 210 and the tuple set generated by the relationship generation module 220 (S300). The description generation module 230 can expand the caption by converting the tuple set into a sentence and reflecting it in the caption.

설명 생성 모듈(230)은 확장된 캡션 및 오브젝트들 사이의 관계를 그래프로 나타내어 시각화할 수 있다(S400). 설명 생성 모듈(230)은 확장된 캡션 및 오브젝트들 사이의 관계를 매칭하여 그래프를 생성할 수 있다. 설명 생성 모듈(230)은 생성한 그래프를 클라이언트(100)로 전송하여 사용자가 확장된 캡션이 생성된 근거를 확인할 수 있도록 할 수 있다. The description generation module 230 can visualize the extended caption and the relationships between objects by graphing them (S400). The description generation module 230 may generate a graph by matching extended captions and relationships between objects. The description generation module 230 may transmit the generated graph to the client 100 so that the user can check the basis for generating the extended caption.

도 9은 본 발명의 실시 예에 따른 캡션을 생성하는 방법을 나타내는 도면이다. Figure 9 is a diagram showing a method for generating captions according to an embodiment of the present invention.

도 9를 참조하면, 속성 추출 모델(212)은 이미지의 속성 정보를 추출할 수 있다(S110). 여기서, 속성 추출 모델(212)은 이미지 및 이미지에 대한 캡션이 학습되어 있을 수 있다. 이에 따라, 속성 추출 모델(212)은 학습된 정보들을 이용하여 새로운 이미지와 관련된 속성 정보를 출력할 수 있다.Referring to FIG. 9, the attribute extraction model 212 can extract attribute information of an image (S110). Here, the attribute extraction model 212 may have learned images and captions for the images. Accordingly, the attribute extraction model 212 can output attribute information related to a new image using the learned information.

오브젝트 인식 모델(214)은 이미지 내 중요 오브젝트를 추출하고, 추출된 오브젝트를 포함하는 오브젝트 영역을 투플 형태로 변환할 수 있다(S120). 오브젝트 인식 모델(214)은 Mask R-CNN 알고리즘 등과 같은 딥 러닝 기반 오브젝트 인식 모델을 활용하여 제공된 이미지 내의 미리 정의된 오브젝트 영역에 해당하는 영역들을 제공된 이미지의 오브젝트 영역으로써, 추출할 수 있다.The object recognition model 214 may extract important objects in the image and convert the object area including the extracted objects into a tuple form (S120). The object recognition model 214 may utilize a deep learning-based object recognition model such as the Mask R-CNN algorithm to extract areas corresponding to a predefined object area in the provided image as the object area of the provided image.

이미지 캡션 모델(216)은 제공된 이미지에서 추출한 속성 정보 및 오브젝트 영역에 대해 단어 주의도 및 영역 주의도를 부여할 수 있다(S130). 이미지 캡션 모델(216)은 현재 시간에 생성한 단어 태그와 관련성이 높은 단어 순서로 단어 주의도를 부여할 수 있다. 여기서, 단어 주의도 및 영역 주의도는 0 내지 1 사이의 값이며, 단어 태그와 관련성이 높을수록 1에 인접할 수 있다.The image caption model 216 can assign word attention and area attention to attribute information and object areas extracted from the provided image (S130). The image caption model 216 can assign word attention in the order of words that are highly related to the word tag created at the current time. Here, the word attention level and area attention level are values between 0 and 1, and the higher the correlation with the word tag, the closer to 1.

이미지 캡션 모델(216)은 속성 추출 모델(212)에서 추출된 속성 정보, 오브젝트 인식 모델(214)에서 추출된 오브젝트 영역, 단어 주의도 및 영역 주의도를 기초로 시간 단계마다 캡션을 위한 단어 태그 및 문법 태그를 예측할 수 있다(S140). 이미지 캡션 모델(216)은 예측한 단어 태그 및 문법 태그에 대해서 정답 캡션 문장과 비교하여 생성된 단어 태그 및 문법 태그에 대한 손실값을 각각 계산할 수 있다.The image caption model 216 generates word tags for captions at each time step based on attribute information extracted from the attribute extraction model 212, object area extracted from the object recognition model 214, word attention, and area attention. Grammar tags can be predicted (S140). The image caption model 216 can calculate loss values for the generated word tags and grammar tags by comparing the predicted word tags and grammar tags with the correct caption sentences.

이미지 캡션 모델(216)은 단어 태그 및 문법 태그에 대한 손실값들을 반영하여 캡션을 생성할 수 있다(S150). 이에 따라, 이미지 캡션 모델(216)은 단어 태그 및 문법 태그를 이용하여 제공된 이미지에 대해 문법이 고려된 캡션 문장을 생성할 수 있고, 이를 학습할 수 있다.The image caption model 216 can generate a caption by reflecting loss values for word tags and grammar tags (S150). Accordingly, the image caption model 216 can generate a caption sentence that takes grammar into account for a provided image using word tags and grammar tags, and learn it.

도 10은 본 발명의 실시 예에 따른 확장된 캡션을 생성하는 방법을 나타내는 도면이다. Figure 10 is a diagram showing a method for generating extended captions according to an embodiment of the present invention.

도 10을 참조하면, 설명 생성 모듈(210)은 관계 생성 모듈(220)에서 생성된 투플 집합들 중에서 캡션 생성 모듈(210)에서 생성된 캡션에 포함되는 투플 집합들을 제거할 수 있다(S310). 여기서, 투플 집합을 제거하는 것은 투플 집합 내 제1명사, 제2명사, 서술어가 캡션 생성 모듈(210)에서 생성된 캡션에 모두 포함되면 중복 투플 집합으로 판단하여 이 중복 투플 집합을 삭제할 수 있다.Referring to FIG. 10, the description generation module 210 may remove tuple sets included in the caption generated by the caption generation module 210 from among the tuple sets generated by the relationship generation module 220 (S310). Here, removing a tuple set means that if the first noun, second noun, and predicate in the tuple set are all included in the caption generated by the caption generation module 210, it is determined to be a duplicate tuple set and this duplicate tuple set can be deleted.

설명 생성 모듈(210)은 중복 투플 집합을 제거하고 남은 투플 집합들을 문장 형식으로 변환할 수 있다(S320). 여기서, 설명 생성 모듈(210)은 투플 집합의 서술어가 전치사인 경우 제1명사 - 전치사 - 제2명사의 순으로 나열함으로써 문장 형식으로 변환할 수 있다. 반면, 설명 생성 모듈(210)은 투플 집합의 서술어가 동사인 경우 제2명사 - 동사 - 제1명사의 순으로 나열함으로써 문장 형식으로 변환할 수 있다.The explanation generation module 210 may remove duplicate tuple sets and convert the remaining tuple sets into sentence format (S320). Here, if the predicate of the tuple set is a preposition, the description generation module 210 can convert it into a sentence format by listing the predicates in the following order: first noun - preposition - second noun. On the other hand, if the predicate of the tuple set is a verb, the description generation module 210 can convert it into a sentence format by listing the tuple set in the following order: second noun - verb - first noun.

설명 생성 모듈(210)은 투플 집합들이 변환된 문장을 캡션에 반영할 수 있다(S330). 이후, 변환된 문장이 반영된 캡션(확장된 캡션)을 정답 캡션과 비교하여 스코어를 계산하고, 가장 큰 스코어를 갖는 구절을 선택할 수 있다. 문장 재구조화 모델(232)은 투플 집합을 문장 형식으로 변환 - 캡션에 적용 - 가장 큰 스코어를 갖는 구절 선택하는 방식을 통해 더 이상 남은 투플 집합이 없을때까지 반복할 수 있다. 이후, 설명 생성 모듈(210)은 마지막에 선택된 구절을 최종 확장된 캡션으로서 선택할 수 있다.The description generation module 210 may reflect the sentences converted from tuple sets into the caption (S330). Afterwards, the caption (extended caption) reflecting the converted sentence is compared with the correct caption to calculate the score, and the phrase with the highest score can be selected. The sentence restructuring model 232 can repeat the process of converting a set of tuples into a sentence format - applying it to a caption - selecting a phrase with the highest score until there are no more sets of tuples remaining. Thereafter, the description generation module 210 may select the last selected phrase as the final extended caption.

설명 생성 모듈(210)은 문장 재구조화 모델(232)에서 확장시킨 캡션을 투플 집합과 매칭하여 시각화할 수 있다(S340). 설명 생성 모듈(210)은 확장된 캡션을 투플 집합과 매칭하여 투플 집합의 관계를 나타내는 그래프를 생성할 수 있다. 또한, 시각화 모델(234)은 생성한 투플 집합의 관계를 나타내는 그래프를 클라이언트(100)로 전송하여 사용자가 확장된 캡션이 생성된 근거를 확인할 수 있도록 할 수 있다. The description generation module 210 can visualize the caption expanded from the sentence restructuring model 232 by matching it with a tuple set (S340). The description generation module 210 may generate a graph representing the relationship between the tuple sets by matching the extended caption with the tuple set. Additionally, the visualization model 234 may transmit a graph representing the relationship between the generated tuple sets to the client 100 so that the user can check the basis for generating the extended caption.

전술한 바와 같이, 본 발명의 실시 예에 따르면 딥 러닝을 이용하여 이미지 내 속성 정보 및 오브젝트 정보를 추출하여 캡션을 생성하고, 오브젝트 정보들 사이의 관계를 예측하여 생성된 캡션을 재구조화하는 이미지 캡션 자동 생성 시스템 및 방법을 제공할 수 있다.As described above, according to an embodiment of the present invention, an image caption is created by extracting attribute information and object information in an image using deep learning, and reconstructing the generated caption by predicting the relationship between object information. An automatic generation system and method can be provided.

본 발명이 속하는 기술 분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있으므로, 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Those skilled in the art to which the present invention pertains should understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features, and that the embodiments described above are illustrative in all respects and not restrictive. Just do it. The scope of the present invention is indicated by the claims described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

100: 클라이언트 200: 캡션 생성기
210: 캡션 생성 모듈 220: 관계 생성 모듈
230: 설명 생성 모듈 212: 속성 추출 모델
214: 오브젝트 인식 모델 216: 이미지 캡션 모델
216a: 속성 주의 모델 216b: 오브젝트 주의 모델
216c: 문법 학습 모델 216d: 언어 생성 모델
222: 오브젝트 추출 모델 224: 관계 예측 모델
226: 관계 그래프 생성 모델 232: 문장 재구조화 모델
234: 시각화 모델 226: 관계 그래프 생성 모델100: Client 200: Caption Generator
210: Caption creation module 220: Relationship creation module
230: Description generation module 212: Attribute extraction model
214: Object recognition model 216: Image caption model
216a: Attribute attention model 216b: Object attention model
216c: Grammar learning model 216d: Language production model
222: Object extraction model 224: Relationship prediction model
226: Relationship graph generation model 232: Sentence restructuring model
234: Visualization model 226: Relationship graph generation model

Claims

In an automatic caption generation system for automatically generating captions for images that describe the image.
a client providing an image for generating the caption; and
A caption generator that analyzes the image provided by the client, generates a caption describing the image, and transmits the generated caption and the basis for generating the caption to the client,
The caption generator is,
a caption generation module that extracts attribute information and object information in the provided image using deep learning and generates the caption using the attribute information and the object information;
a relationship creation module that predicts relationships between objects in the image and generates a tuple set in which the predicted relationships are structured in the form of a tuple; and
Description of generating an extended caption by restructuring the caption using the caption generated by the caption generation module and the tuple set generated by the relationship generation module, and visualizing a graph for the extended caption and the tuple set. Contains a creation module;
The caption creation module is,
an attribute extraction model that extracts words most related to the provided image and converts each word into a tuple form;
an object recognition model that extracts important objects in the image and converts an object area including the extracted objects into a tuple form;
An image caption model that generates a caption for the image using words extracted from the attribute extraction model and an object area extracted from the object recognition model,
The image caption model is,
an attribute attention model that assigns word attention scores to words extracted from the attribute extraction model;
an object attention model that assigns region attention to regions of the object extracted from the object recognition model;
a grammar learning model that learns the grammar of sentences for the image and the caption of the image; and
A language generation model that generates word tags and grammar tags for captions at each time step based on words extracted from the attribute extraction model, object areas extracted from the object recognition model, the word attention level, and the area attention level. Contains ;,
The language generation model includes the word attention level, the area attention level, the average vector of words converted to tuple form in the attribute extraction model, the average vector of object regions converted to tuple form in the object recognition model, and the language generation. An automatic image caption generation system that predicts word tags and grammar tags at the current time by considering both words generated by the model at a previous time and compressed information about all words generated by the language generation model.

delete

According to paragraph 1,
The image caption model is performed using a deep learning algorithm and is based on a recurrent neural network (RNN), and an automatic image caption generation system that predicts relationships between objects in the image in time series.

delete

According to paragraph 1,
The attribute attention model assigns the word attention to the word order based on its relationship with the word tag generated by the language generation model,
The object attention model assigns the region attention to word order based on its relationship with the word tag generated by the language generation model,
The word attention level and the area attention level are values between 0 and 1, and the higher the relevance to the word tag, the closer to 1 is the automatic image caption generation system.

According to paragraph 1,
The relationship creation module is,
an object recognition model that extracts important object areas within the provided image; and
a relationship prediction model that predicts relationships between the extracted regions and generates a tuple set by structuring the relationships between the predicted regions in a tuple form; and
An automatic image caption generation system including a relationship graph generation model that generates a graph for the generated tuple set.

In clause 7,
The description generation module is,
a sentence restructuring model that replaces some words with phrases for tuples according to an algorithm using the caption generated in the caption generation module and the set of tuples generated in the relationship generation module, and expands the generated caption; and
A visualization model that visualizes the caption expanded from the sentence restructuring model by matching it with the tuple set. An automatic image caption generation system comprising a.

In the automatic caption generation method for automatically generating a caption describing an image,
extracting attribute information and object information in an image using deep learning in a caption generation module, and generating the caption using the attribute information and the object information;
Predicting relationships between objects in the image in a relationship creation module and generating a tuple set in which the predicted relationships are structured in the form of a tuple; and
Restructuring the caption in a description generation module using the generated caption and the tuple set to generate an extended caption, and visualizing a graph for the extended caption and the tuple set,
The step of generating the caption is,
extracting words most related to the image from the caption generation module and converting each word into a tuple form;
extracting important objects in the image from an object recognition model and converting an object area including the extracted objects into a tuple form; and
Generating a caption for the image using the extracted words and the extracted object area from an image caption model,
The step of generating a caption for the image is,
assigning word attention scores to words extracted from the image caption model;
assigning region attention to object regions extracted from the image caption model;
predicting word tags and grammar tags for captions at each time step based on the attribute information, the object areas, the word attention level, and the area attention level; and
An image caption automatic generation method comprising: generating a caption by reflecting loss values for word tags and grammar tags.

delete

The method of claim 9, wherein generating the tuple set comprises:
extracting important object areas within the image from an object recognition model;
predicting relationships between the extracted regions in a relationship prediction model, and structuring the relationships between the predicted regions in a tuple form to generate a tuple set; and
An image caption automatic generation method further comprising: generating a graph for the generated set of tuples in a relationship graph generation model.

The method of claim 9, wherein visualizing a graph for the set of tuples comprises:
In the sentence restructuring model, replacing some words with phrases for tuples according to an algorithm using the tuple set generated in the caption and relationship generation module and expanding the generated caption; and
A method for automatically generating image captions, further comprising matching the expanded caption with the tuple set and visualizing it in a visualization model.