KR102551960B1

KR102551960B1 - Image captioning method and system based on object information condition

Info

Publication number: KR102551960B1
Application number: KR1020210100468A
Authority: KR
Inventors: 조충상; 이영한
Original assignee: 한국전자기술연구원
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-07-06
Also published as: KR20230018657A

Abstract

객체 정보 컨디션 기반의 이미지 캡션 생성 방법 및 시스템이 제공된다. 본 발명의 실시예에 따른 캡션 자동 생성 방법은, 시각 데이터에서 객체 정보 벡터를 추출하고, 시각 데이터에서 특징 벡터를 추출하며, 추출한 객체 정보 벡터와 특징 벡터를 융합하고, 융합 벡터를 '융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델인 캡션 생성 모델'에 입력하여 캡션을 생성한다. 이에 의해, 이미지/비디오에 대한 이미지 캡션을 생성함에 있어 객체의 존재 유무 등의 객체 정보를 함께 컨디션으로 학습하고 분석함으로써, 객체의 특성이 보다 잘 반영된 캡션 정보를 생성할 수 있게 된다.A method and system for generating an image caption based on an object information condition are provided. A caption automatic generation method according to an embodiment of the present invention extracts an object information vector from visual data, extracts a feature vector from visual data, fuses the extracted object information vector and feature vector, and converts the fusion vector into a 'fusion vector'. Captions are generated by inputting them into the 'caption generation model', an artificial intelligence model trained to generate captions by receiving input. Accordingly, in generating an image caption for an image/video, object information, such as whether an object exists or not, is learned and analyzed together as a condition, thereby generating caption information in which the characteristics of the object are better reflected.

Description

Image captioning method and system based on object information condition

본 발명은 인공지능 관련 기술에 관한 것으로, 더욱 상세하게는 인공지능 모델을 활용하여 이미지/비디오에 대한 캡션을 자동으로 생성하는 방법 및 시스템에 관한 것이다.The present invention relates to artificial intelligence-related technologies, and more particularly, to a method and system for automatically generating captions for images/videos using artificial intelligence models.

이미지 캡션 생성 기술은, 주어진 이미지/비디오를 설명하여 주는 문장을 자동으로 생성하는 기술이다. 인공지능 기술의 비약적인 발전으로 인해 이 기능을 제공하는 것이 가능해졌다.An image caption generation technology is a technology that automatically generates sentences describing a given image/video. The quantum leap in artificial intelligence technology has made it possible to provide this functionality.

현재 이미지에 대한 캡션을 생성하기 위해, 이미지의 특징 정보와 캡션 정보를 이용하여 CNN 모델을 학습하고 있는데, 이미지의 내용에 부합하기는 하지만, 객체의 특성에는 잘맞지 않는 부자연스러운 캡션이 생성되는 경우가 있다.In order to generate a caption for the current image, a CNN model is being trained using the feature information and caption information of the image, but an unnatural caption is generated that matches the content of the image but does not match the characteristics of the object there is

이는, CNN 모델이 캡션을 생성함에 있어 이미지의 특징 벡터에 지나치게 의존하기 때문인 것으로 분석되는 바, 이를 해소하기 위한 방안이 필요하다.It is analyzed that this is because the CNN model relies too much on feature vectors of images in generating captions, and a solution to this problem is needed.

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 목적은, 이미지/비디오에 대한 이미지 캡션을 생성함에 있어 객체에 대한 정보를 함께 컨디션으로 학습하고 분석함으로써, 객체의 특성이 보다 잘 반영된 캡션 정보를 생성하기 위한 방법 및 시스템을 제공함에 있다.The present invention has been made to solve the above problems, and an object of the present invention is to learn and analyze information about an object together as a condition in generating an image caption for an image / video, so that the characteristics of the object are more An object of the present invention is to provide a method and system for generating well-reflected caption information.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른, 캡션 자동 생성 방법은, 시각 데이터에서 객체 정보 벡터를 추출하는 단계; 시각 데이터에서 특징 벡터를 추출하는 단계; 추출한 객체 정보 벡터와 특징 벡터를 융합하는 단계; 및 융합 벡터를, '융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델인 캡션 생성 모델'에 입력하여, 캡션을 생성하는 단계;를 포함한다.According to an embodiment of the present invention for achieving the above object, a method for automatically generating captions may include extracting an object information vector from visual data; extracting feature vectors from visual data; fusing the extracted object information vector with the feature vector; and generating a caption by inputting the fusion vector into a 'caption generation model that is an artificial intelligence model learned to generate a caption by receiving the fusion vector'.

그리고, 객체 정보 벡터 추출단계는, 시각 데이터에서 객체를 검출하는 단계; 검출된 객체에 대한 객체 정보를 객체 정보 벡터로 변환하는 단계;를 포함할 수 있다.The object information vector extraction step may include: detecting an object from visual data; It may include converting object information about the detected object into an object information vector.

객체 정보는, 객체 종류, 객체 개수 및 객체 위치 중 적어도 하나를 포함할 수 있다.Object information may include at least one of an object type, the number of objects, and an object location.

객체 정보 벡터는, 시각 데이터에 해당 객체 종류가 존재하는지를 여부를 나타내는 인덱스들이 나열된 벡터일 수 있다.The object information vector may be a vector in which indices indicating whether a corresponding object type exists in visual data are listed.

객체 정보 추출단계는, 캡션 생성시 객체 정보를 고려할 것을 지시하는 컨트롤 정보가 입력된 경우에는, 시각 데이터에서 객체 정보 벡터를 추출하고, 캡션 생성시 객체 정보를 고려하지 않을 것을 지시하는 컨트롤 정보가 입력된 경우에는, 객체 정보 벡터의 인덱스들에 더미 데이터를 수록할 수 있다.In the object information extraction step, when control information indicating that object information should be considered when generating captions is input, an object information vector is extracted from visual data, and control information indicating that object information should not be considered when generating captions is input. In this case, dummy data may be recorded in indices of the object information vector.

융합 단계는, 객체 정보 벡터 뒤에 특징 벡터를 연결하여, 하나의 벡터로 융합할 수 있다.In the fusion step, the object information vector may be connected to the feature vector, and the object information vector may be fused into one vector.

융합 단계는, 융합 벡터의 크기를 캡션 생성 모델의 입력 크기에 맞게 변환할 수 있다.In the fusion step, the size of the fusion vector may be converted to match the input size of the caption generation model.

융합 단계는, 융합 벡터를, '융합 벡터를 입력 받아 캡션 생성 모델의 입력 크기에 맞게 변환하도록 학습된 인공지능 모델인 벡터 변환 모델'에 입력하여, 융합 벡터의 크기를 변환할 수 있다.In the fusion step, the size of the fusion vector may be converted by inputting the fusion vector to 'a vector conversion model, which is an artificial intelligence model trained to receive the fusion vector and convert it to fit the input size of the caption generation model'.

벡터 변환 모델은, 벡터 변환 모델에서 출력되는 크기가 변환된 융합 벡터를 캡션 생성 모델에 입력하여 생성된 캡션과 GT(Ground Truth) 캡션 간의 차이가 작아지는 방향으로 학습될 수 있다.The vector conversion model may be learned in a direction in which a difference between a caption generated by inputting a size-converted fusion vector output from the vector conversion model to a caption generation model and a ground truth (GT) caption is reduced.

한편, 본 발명의 다른 실시예에 따른, 캡션 자동 생성 시스템은, 시각 데이터에서 객체 정보 벡터를 추출하는 객체 정보화 모듈; 시각 데이터에서 특징 벡터를 추출하는 특징 추출 모듈; 추출한 객체 정보 벡터와 특징 벡터를 융합하는 융합 모듈; 및 융합 벡터를, '융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델인 캡션 생성 모델'에 입력하여, 캡션을 생성하는 캡셔닝 모듈;을 포함한다.Meanwhile, according to another embodiment of the present invention, an automatic caption generation system includes an object informatization module for extracting an object information vector from visual data; a feature extraction module for extracting feature vectors from visual data; A fusion module that fuses the extracted object information vector and feature vector; and a captioning module that generates captions by inputting the fusion vectors into a 'caption generation model, which is an artificial intelligence model trained to generate captions by receiving the fusion vectors.'

한편, 본 발명의 다른 실시예에 따른, 캡션 자동 생성 방법은, 시각 데이터에서 추출한 객체 정보 벡터와 특징 벡터를 융합하는 단계; 및 융합 벡터를, '융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델인 캡션 생성 모델'에 입력하여, 캡션을 생성하는 단계;를 포함한다.Meanwhile, a method for automatically generating captions according to another embodiment of the present invention may include fusing an object information vector extracted from visual data with a feature vector; and generating a caption by inputting the fusion vector into a 'caption generation model that is an artificial intelligence model learned to generate a caption by receiving the fusion vector'.

한편, 본 발명의 다른 실시예에 따른, 캡션 자동 생성 시스템은, 시각 데이터에서 추출한 객체 정보 벡터와 특징 벡터를 융합하는 융합 모듈; 및 융합 벡터를, '융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델인 캡션 생성 모델'에 입력하여, 캡션을 생성하는 캡셔닝 모듈;을 포함한다.Meanwhile, according to another embodiment of the present invention, an automatic caption generation system includes a fusion module that fuses an object information vector extracted from visual data with a feature vector; and a captioning module that generates captions by inputting the fusion vectors into a 'caption generation model, which is an artificial intelligence model trained to generate captions by receiving the fusion vectors.'

이상 설명한 바와 같이, 본 발명의 실시예들에 따르면, 이미지/비디오에 대한 이미지 캡션을 생성함에 있어 객체의 존재 유무 등의 객체 정보를 함께 컨디션으로 학습하고 분석함으로써, 객체의 특성이 보다 잘 반영된 캡션 정보를 생성할 수 있게 된다.As described above, according to embodiments of the present invention, in generating an image caption for an image/video, object information such as the presence or absence of an object is learned and analyzed together as a condition, thereby providing a caption in which the characteristics of the object are better reflected. information can be generated.

도 1은 본 발명의 일 실시예에 따른 캡션 생성 시스템의 블럭도,
도 2는 본 발명의 다른 실시예에 따른 캡션 생성 방법의 설명에 제공되는 흐름도,
도 3은 객체 검출 방법 및 결과를 예시한 도면,
도 4는 검출된 객체 종류 정보를 객체 정보 벡터로 변환하는 방법 및 결과를 예시한 도면,
도 5에는 객체 정보를 고려하지 않는 경우 객체 정보 벡터를 생성하는 방법 및 결과를 예시한 도면,
도 6은 시각 데이터 특징 벡터 추출을 위한 CNN 기반의 딥러닝 모델을 예시한 도면,
도 7에는 객체 정보 벡터와 시각 데이터 특징 벡터를 융합하는 방법 및 결과를 예시한 도면,
도 8은 융합 벡터의 크기를 변환하는 방법 및 결과를 예시한 도면이다.1 is a block diagram of a caption generating system according to an embodiment of the present invention;
2 is a flowchart provided to explain a caption generating method according to another embodiment of the present invention;
3 is a diagram illustrating an object detection method and result;
4 is a diagram illustrating a method and result of converting detected object type information into an object information vector;
5 is a diagram illustrating a method and result of generating an object information vector when object information is not considered;
6 is a diagram illustrating a CNN-based deep learning model for visual data feature vector extraction;
7 is a diagram illustrating a method and result of fusing an object information vector and a visual data feature vector;
8 is a diagram illustrating a method and result of converting the size of a fusion vector.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다.Hereinafter, the present invention will be described in more detail with reference to the drawings.

본 발명의 실시예에서는, 객체 정보 컨디션 기반의 이미지 캡션 생성 방법 및 시스템을 제시한다. 이미지/비디오에 대한 캡션을 생성할 때 객체의 정보를 함께 고려하기 위한 기법이다.In an embodiment of the present invention, a method and system for generating an image caption based on an object information condition are presented. This is a technique for considering object information together when creating captions for images/videos.

구체적으로, 이미지에 대한 캡션을 학습하고 생성할 때 이미지에 대한 특징 벡터 뿐만 아니라, 이미지로부터 도출되는 객체에 대한 정보를 컨디션으로 추가하여, 객체를 고려한 이미지 캡션을 생성할 수 있도록 하는 것이다.Specifically, when learning and generating a caption for an image, information on an object derived from the image as well as a feature vector of the image is added as a condition so that the image caption considering the object can be created.

도 1은 본 발명의 일 실시예에 따른 캡션 생성 시스템의 블럭도이다. 본 발명의 실시예에 따른 캡션 생성 시스템은, 도시된 바와 같이, 입력부(110), 객체 정보화 모듈(120) 및 캡션 생성 모듈(130)을 포함하여 구성된다.1 is a block diagram of a caption generating system according to an embodiment of the present invention. As shown, a caption generation system according to an embodiment of the present invention includes an input unit 110, an object informatization module 120, and a caption generation module 130.

입력부(110)는 이미지, 비디오 등의 시각 데이터를 입력받아, 객체 정보화 모듈(120)과 캡션 생성 모듈(130)로 전달한다.The input unit 110 receives visual data such as images and videos, and transmits them to the object informatization module 120 and the caption generation module 130.

객체 정보화 모듈(120)은 시각 데이터에서 객체 정보 벡터를 추출하여 캡션 생성 모듈(130)로 제공하는 모듈로, 객체 검출 엔진(121)과 객제 정보 임베딩 모듈(122)을 포함하여 구성된다.The object informatization module 120 is a module that extracts object information vectors from visual data and provides them to the caption generation module 130, and includes an object detection engine 121 and an object information embedding module 122.

캡션 생성 모듈(130)은 시각 데이터에서 특징 벡터를 추출하여 객체 정보화 모듈(120)에서 추출한 객체 정보 벡터와 융합하고, 융합 정보를 이용하여 캡션을 생성하는 모듈로, 특징 추출 모듈(131), 융합 모듈(132) 및 캡셔닝 모듈(133)을 포함하여 구성된다.The caption generation module 130 is a module that extracts feature vectors from visual data, fuses them with the object information vectors extracted from the object informatization module 120, and generates captions using the fusion information. The feature extraction module 131, fusion It is composed of a module 132 and a captioning module 133.

도 1에 도시된 시스템에 의해 캡션이 생성되는 과정에 대해 도 2를 참조하여 상세히 설명한다. 도 2는 본 발명의 다른 실시예에 따른 캡션 생성 방법의 설명에 제공되는 흐름도이다.A process of generating a caption by the system shown in FIG. 1 will be described in detail with reference to FIG. 2 . 2 is a flowchart provided to explain a caption generating method according to another embodiment of the present invention.

비디오/이미지 등의 시각 데이터가 입력부(110)를 통해 입력되면(S210), 객체 정보화 모듈(120)은 캡션 생성시 객체 정보를 고려할 것인지 여부를 지시하는 컨트롤 정보를 확인한다(S220).When visual data such as video/image is input through the input unit 110 (S210), the object informatization module 120 checks control information indicating whether to consider object information when generating captions (S220).

컨트롤 정보가 객체 정보를 고려하는 것으로 확인되면(S220-YES), 객체 정보화 모듈(120)의 객체 검출 엔진(121)은 S210단계에서 입력된 시각 데이터에서 주요 객체를 검출한다(S230).If it is confirmed that the control information considers the object information (S220-YES), the object detection engine 121 of the object informatization module 120 detects the main object from the visual data input in step S210 (S230).

도 3에는 객체 검출 방법 및 결과를 예시하였다. 도시된 바와 같이, 객체 검출 결과로, 객체 종류, 객체 개수, 객체 위치 등을 객체 정보로 획득하게 된다. 이하에서는 객체 정보로써 객체 종류를 활용하는 것을 상정한다. 하지만, 이는 예시적인 것으로, 객체 종류를 다른 객체 정보로 대체하거나 다른 객체 정보를 추가하는 것을 배제하지 않는다.3 illustrates an object detection method and result. As shown, as an object detection result, an object type, number of objects, object location, and the like are acquired as object information. Hereinafter, it is assumed that an object type is used as object information. However, this is an example and does not rule out replacing the object type with other object information or adding other object information.

다음, 객체 정보화 모듈(120)의 객제 정보 임베딩 모듈(122)은 S230단계에서 검출된 객체에 대한 객체 정보를 객체 정보 벡터로 변환한다(S240).Next, the object information embedding module 122 of the object informatization module 120 converts the object information about the object detected in step S230 into an object information vector (S240).

객체 정보로 객체 종류를 활용하는 경우, S240단계에서의 객체 정보 벡터는 시각 데이터에 해당 객체 종류가 존재하는지를 여부를 나타내는 인덱스들이 나열된 벡터가 된다.When the object type is used as the object information, the object information vector in step S240 becomes a vector in which indices indicating whether the corresponding object type exists in the visual data are listed.

구체적으로, 객체 종류가 N개인 경우, 객체 정보 벡터는 1×N 벡터로 구성하며, 벡터의 각 인덱스들은 해당 객체 종류가 검출되었는지 여부, 즉, 해당 객체가 시각 데이터에 존재하는지 여부를 나타낸다.Specifically, when there are N object types, the object information vector is composed of 1×N vectors, and each index of the vector indicates whether the corresponding object type has been detected, that is, whether the corresponding object exists in the visual data.

도 4에는 검출된 객체 종류 정보를 객체 정보 벡터로 변환하는 방법 및 결과를 예시하였다. 도 4에서 객체 종류의 개수(N)은 8로 상정하였는데, 첫 번째 객체 종류는 사람, 두 번째 객체 종류는 개, 세 번째 객체 종류는 나무, 네 번째 객체 종류는 꽃, 다섯 번째 객체 종류는 음식, 여섯 번째 객체 종류는 집, 일곱 번째 객체 종류는 자동차, 여덟 번째 객체 종류는 장난감이라고 가정하겠다. 이 경우, 도 4에 제시된 객체 정보 벡터는 두 번째, 다섯 번째 및 여섯 번 인덱스가 1로 인코딩 되어 있으므로, 이는 시각 데이터에 두 번째 객체 종류인 개, 다섯 번째 객체 종류인 음식 및 여섯 번째 객체 종류인 집이 존재하고 있음을 의미한다.4 illustrates a method and result of converting detected object type information into an object information vector. In FIG. 4, the number N of object types is assumed to be 8. The first object type is a person, the second object type is a dog, the third object type is a tree, the fourth object type is a flower, and the fifth object type is food. , the sixth object type is a house, the seventh object type is a car, and the eighth object type is a toy. In this case, since the second, fifth, and sixth indexes of the object information vector presented in FIG. 4 are encoded as 1, this corresponds to the second object type of dog, the fifth object type of food, and the sixth object type of visual data. This means that the house exists.

한편, 컨트롤 정보가 객체 정보를 고려하지 않는 것으로 확인되면(S220-NO), 객체 검출 엔진(121)은 입력된 시각 데이터에서 객체를 검출하지 않으며, 객제 정보 임베딩 모듈(122)은 객체 정보 벡터의 인덱스들에 더미 데이터를 수록한다(S250).On the other hand, if it is confirmed that the control information does not consider the object information (S220-NO), the object detection engine 121 does not detect the object from the input visual data, and the object information embedding module 122 determines the object information vector Dummy data is recorded in the indexes (S250).

도 5에는 객체 정보를 고려하지 않는 경우 객체 정보 벡터를 생성하는 방법 및 결과를 예시하였다. 도 5에 도시된 바와 같이, 객체 정보 벡터의 인덱스들에는 모두 더미 데이터 "1"이 수록되어 있다.5 illustrates a method and result of generating an object information vector when object information is not considered. As shown in FIG. 5, dummy data “1” is included in all indexes of the object information vector.

다음, 캡션 생성 모듈(130)의 특징 추출 모듈(131)은 S210단계에서 입력된 시각 데이터에서 특징 벡터를 추출한다(S260). S260단계에서의 시각 데이터 특징 벡터 추출은 CNN 기반의 딥러닝 모델로 구현 가능한데, 도 6에 제시된 바와 같이 다양한 네트워크를 활용할 수 있다.Next, the feature extraction module 131 of the caption generation module 130 extracts feature vectors from the visual data input in step S210 (S260). The visual data feature vector extraction in step S260 can be implemented with a CNN-based deep learning model, and various networks can be utilized as shown in FIG. 6 .

캡션 생성 모듈(130)의 융합 모듈(132)은 S240단계 또는 S250단계에서 생성된 객체 정보 벡터 뒤에 S260단계에서 추출된 시각 데이터 특징 벡터를 연결하여, 하나의 벡터로 융합한다(S270). 도 7에는 객체 정보 벡터와 시각 데이터 특징 벡터를 융합하는 방법 및 결과를 예시하였다.The fusion module 132 of the caption generation module 130 connects the visual data feature vectors extracted in step S260 to the object information vector generated in step S240 or step S250 and fuses them into one vector (S270). 7 illustrates a method and result of fusing an object information vector and a visual data feature vector.

또한, S270단계에서 융합 모듈(132)은 융합 벡터의 크기를 후술할 캡션 생성 모델의 입력 크기에 맞게 변환한다. 캡션 생성 모델은 융합 벡터를 입력받아 캡션을 생성하도록 학습된 인공지능 모델로 캡셔닝 모듈(133)에 의해 학습된다. 융합 벡터의 크기가 캡션 생성 모델의 입력 크기에 맞지 않는 경우를 대비하기 위함이다.In addition, in step S270, the fusion module 132 converts the size of the fusion vector to match the input size of a caption generation model to be described later. The caption generation model is an artificial intelligence model learned to generate captions by receiving fusion vectors, and is learned by the captioning module 133. This is to prepare for a case where the size of the fusion vector does not fit the input size of the caption generation model.

도 8에는 융합 벡터의 크기를 변환하는 방법 및 결과를 예시하였다. 도시된 바와 같이, 융합 벡터는 '융합 벡터를 입력 받아 캡션 생성 모델의 입력 크기에 맞게 변환하도록 학습된 인공지능 모델인 벡터 변환 모델'에 입력되어 해당 크기로 변환된다.8 illustrates a method and result of converting the size of a fusion vector. As shown, the fusion vector is input to a 'vector conversion model, which is an artificial intelligence model trained to receive a fusion vector and convert it to fit the input size of a caption generation model', and is converted to a corresponding size.

도 8에서 벡터 변환 모델은 MLP(Multi Layer Perceptron)로 구현하였다. 벡터 변환 모델은 '벡터 변환 모델에서 출력되는 크기가 변환된 융합 벡터를 캡션 생성 모델에 입력하여 생성된 캡션'과 'GT(Ground Truth) 캡션' 간의 차이(loss)가 작아지는 방향으로 학습된다.In FIG. 8 , the vector conversion model is implemented with MLP (Multi Layer Perceptron). The vector transformation model is learned in a direction in which the difference (loss) between 'caption generated by inputting the size-converted fusion vector output from the vector transformation model to the caption generation model' and 'GT (Ground Truth) caption' is reduced.

다음, 캡션 생성 모듈(130)의 캡셔닝 모듈(133)은 S270단계에서 생성된 융합 벡터를 캡션 생성 모델에 입력하여, 캡션을 생성한다(S280).Next, the captioning module 133 of the caption generation module 130 generates a caption by inputting the fusion vector generated in step S270 to a caption generation model (S280).

지금까지, 객체 정보 컨디션 기반의 이미지 캡션 생성 방법 및 시스템에 대해 바람직한 실시예를 들어 상세히 설명하였다.So far, a method and system for generating an image caption based on object information conditions have been described in detail with reference to preferred embodiments.

위 실시예에서는, 이미지/비디오에 대한 이미지 캡션을 생성할 때 객체의 존재 유무를 함께 컨디션으로 학습하여 객체의 특성을 고려한 캡션 정보를 도출할 수 있도록 하였다.In the above embodiment, when generating an image caption for an image/video, the presence or absence of an object is learned as a condition so that caption information considering the characteristics of the object can be derived.

비디오/이미지 캡션을 생성할 때 이미지에 있는 객체에 대한 정보를 추가로 입력 받아 사용하기 때문에, 위 실시예에 따르면 객체에 대한 고려가 포함된 이미지 캡션을 도출할 수 있게 된다.Since information about an object in an image is additionally input and used when generating a caption for a video/image, it is possible to derive an image caption that includes consideration for the object according to the above embodiment.

한편, 본 실시예에 따른 장치와 방법의 기능을 수행하게 하는 컴퓨터 프로그램을 수록한 컴퓨터로 읽을 수 있는 기록매체에도 본 발명의 기술적 사상이 적용될 수 있음은 물론이다. 또한, 본 발명의 다양한 실시예에 따른 기술적 사상은 컴퓨터로 읽을 수 있는 기록매체에 기록된 컴퓨터로 읽을 수 있는 코드 형태로 구현될 수도 있다. 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터에 의해 읽을 수 있고 데이터를 저장할 수 있는 어떤 데이터 저장 장치이더라도 가능하다. 예를 들어, 컴퓨터로 읽을 수 있는 기록매체는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광디스크, 하드 디스크 드라이브, 등이 될 수 있음은 물론이다. 또한, 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터로 읽을 수 있는 코드 또는 프로그램은 컴퓨터간에 연결된 네트워크를 통해 전송될 수도 있다.Meanwhile, it goes without saying that the technical spirit of the present invention can also be applied to a computer-readable recording medium containing a computer program for performing the functions of the apparatus and method according to the present embodiment. In addition, technical ideas according to various embodiments of the present invention may be implemented in the form of computer readable codes recorded on a computer readable recording medium. The computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, the computer-readable recording medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, and the like. In addition, computer readable codes or programs stored on a computer readable recording medium may be transmitted through a network connected between computers.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해되어져서는 안될 것이다.In addition, although the preferred embodiments of the present invention have been shown and described above, the present invention is not limited to the specific embodiments described above, and the technical field to which the present invention belongs without departing from the gist of the present invention claimed in the claims. Of course, various modifications are possible by those skilled in the art, and these modifications should not be individually understood from the technical spirit or perspective of the present invention.

110 : 입력부
120 : 객체 정보화 모듈
121 : 객체 검출 엔진
122 : 객제 정보 임베딩 모듈
130 : 캡션 생성 모듈
131 : 특징 추출 모듈
132 : 융합 모듈
133 : 캡셔닝 모듈110: input unit
120: object informatization module
121: object detection engine
122: object information embedding module
130: caption generation module
131: feature extraction module
132: fusion module
133: captioning module

Claims

extracting an object information vector from visual data;
extracting feature vectors from visual data;
fusing the extracted object information vector with the feature vector; and
generating a caption by inputting the fusion vector into a 'caption generation model, which is an artificial intelligence model learned to generate a caption by receiving the fusion vector';
In the object information vector extraction step,
Detecting objects in the visual data;
Converting object information about the detected object into an object information vector; includes,
object information,
Including at least one of object type, object number and object location,
The object information vector is
A vector in which indices indicating whether the corresponding object type exists in the visual data are listed,
In the object information vector extraction step,
When control information instructing to consider object information when generating captions is input, an object information vector is extracted from visual data;
A method of automatically generating captions, characterized in that, when control information indicating not to consider object information is input when generating captions, dummy data is recorded in indices of object information vectors.

delete

The method of claim 1,
The fusion stage is
A method of automatically generating captions, characterized in that by concatenating a feature vector after an object information vector and fusing it into a single vector.

The method of claim 6,
The fusion stage is
An automatic caption generation method characterized by converting a size of a fusion vector to match an input size of a caption generation model.

extracting an object information vector from visual data;
extracting feature vectors from visual data;
fusing the extracted object information vector with the feature vector; and
generating a caption by inputting the fusion vector into a 'caption generation model, which is an artificial intelligence model learned to generate a caption by receiving the fusion vector';
In the object information vector extraction step,
Detecting objects in the visual data;
Converting object information about the detected object into an object information vector; includes,
object information,
Including at least one of object type, object number and object location,
The object information vector is
A vector in which indices indicating whether the corresponding object type exists in the visual data are listed,
The fusion stage is
Concatenate the feature vector after the object information vector to fuse into one vector,
Convert the size of the fusion vector to match the input size of the caption generation model,
A method for automatically generating captions, characterized in that the size of the fusion vector is converted by inputting the fusion vector to a 'vector conversion model, an artificial intelligence model trained to receive the fusion vector and convert it to fit the input size of the caption generation model'.

The method of claim 8,
The vector transformation model is,
An automatic caption generation method characterized in that learning is performed in a direction in which a difference between a caption generated by inputting a size-converted fusion vector output from a vector conversion model to a caption generation model and a ground truth (GT) caption is reduced.

an object informatization module that extracts an object information vector from visual data;
a feature extraction module for extracting feature vectors from visual data;
A fusion module that fuses the extracted object information vector and feature vector; and
A captioning module that generates a caption by inputting the fusion vector into a 'caption generation model, which is an artificial intelligence model learned to generate captions by receiving the fusion vector';
The object informatization module,
Detect objects in visual data,
converting object information about the detected object into an object information vector;
object information,
Including at least one of object type, object number and object location,
The object information vector is
A vector in which indices indicating whether the corresponding object type exists in the visual data are listed,
The object informatization module,
When control information instructing to consider object information when generating captions is input, an object information vector is extracted from visual data;
The automatic caption generation system, characterized in that, when control information indicating not to consider object information when generating captions is input, dummy data is recorded in indices of the object information vector.

Fusing object information vectors extracted from visual data with feature vectors; and
generating a caption by inputting the fusion vector into a 'caption generation model, which is an artificial intelligence model learned to generate a caption by receiving the fusion vector';
The object information vector is
Converting and extracting object information about objects detected in visual data,
object information,
Including at least one of object type, object number and object location,
The object information vector is
A vector in which indices indicating whether the corresponding object type exists in the visual data are listed,
When control information instructing to consider object information when generating captions is input, it is extracted from visual data;
A method of automatically generating captions, characterized in that, when control information indicating not to consider object information when generating captions is input, dummy data is recorded in indices.

A fusion module that fuses object information vectors extracted from visual data with feature vectors; and
A captioning module that generates a caption by inputting the fusion vector into a 'caption generation model, which is an artificial intelligence model learned to generate captions by receiving the fusion vector';
The object information vector is
Converting and extracting object information about objects detected in visual data,
object information,
Including at least one of object type, object number and object location,
The object information vector is
A vector in which indices indicating whether the corresponding object type exists in the visual data are listed,
When control information instructing to consider object information when generating captions is input, it is extracted from visual data;
The automatic caption generation system, characterized in that, when control information indicating not to consider object information when generating captions is input, dummy data is recorded in indices.