KR20230062430A

KR20230062430A - Method, apparatus and system for determining story-based image sequence

Info

Publication number: KR20230062430A
Application number: KR1020220141376A
Authority: KR
Inventors: 김건우; 조시현; 김성우
Original assignee: 서울대학교산학협력단
Priority date: 2021-10-29
Filing date: 2022-10-28
Publication date: 2023-05-09

Abstract

본 발명은 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 이미지 시퀀스 결정 시스템에 관한 것으로, 더욱 상세하게는 텍스트를 문장의 문맥과 시각적 맥락을 고려하며 묘사하는 이미지를 선출하는 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 이미지 시퀀스 결정 시스템에 관한 것이다. The present invention relates to a story-based image sequence determination method, apparatus, and image sequence determination system, and more particularly, to a story-based image sequence determination method, apparatus, and It relates to an image sequence determination system.

Description

Method, apparatus and system for determining story-based image sequence

본 발명은 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 시스템에 관한 것으로, 더욱 상세하게는 텍스트를 문장의 문맥과 시각적 맥락을 고려하며 묘사하는 이미지 시퀀스를 선출하는 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 시스템에 관한 것이다. The present invention relates to a story-based image sequence determination method, apparatus, and system, and more particularly, to a story-based image sequence determination method, apparatus, and system for selecting an image sequence depicting text while considering a sentence context and a visual context. it's about

현존하는 이미지 검색 시스템의 대부분은 단일 문장 또는 단일 이미지 질의를 입력 받아 의미적으로 또는 시각적으로 가장 유사한 이미지 한 장을 검색하는 단일 입출력 구조를 가진다.Most of the existing image retrieval systems have a single input/output structure that receives a single sentence or a single image query and searches for one semantically or visually similar image.

대한민국 등록특허 제10-173254호는 "복수의 이미지를 검색하기 위한 컴퓨터 구현 방법으로서, 이미지를 포함하는 검색 질의를 수신하는 단계와, 컴퓨팅 장치에 의해, 상기 검색 질의에 기반하여 적어도 한 개의 제1 서술자 식별자를 식별하는 단계- 상기 적어도 한 개의 제1 서술자 식별자는 이미지 내 관심 포인트를 기술하는 서술자에 해당함-, 상기 적어도 한 개의 제1 서술자 식별자를 복수의 인덱스된 이미지 각각과 연관된 한 개 이상의 제2 서술자 식별자와 비교함으로써 상기 복수의 인덱스된 이미지를 검색하는 단계, 및 상기 비교결과에 기반하여 상기 인덱스된 이미지 중 한 개 이상을 순위화하는 단계"를 포함하는 구성을 개시한다.Korean Patent Registration No. 10-173254 discloses "A computer-implemented method for searching a plurality of images, comprising the steps of receiving a search query including images, and, by a computing device, at least one first search query based on the search query. Identifying a descriptor identifier, wherein the at least one first descriptor identifier corresponds to a descriptor describing a point of interest in the image, and sets the at least one first descriptor identifier to one or more second descriptor identifiers associated with each of a plurality of indexed images. Searching the plurality of indexed images by comparing them with descriptor identifiers, and ranking one or more of the indexed images based on the comparison result".

대한민국 공개특허 제10-2021-009347호는 "이미지 검색엔진 검색 결과에 기초하여 검색 이미지 및 이미지 주소 중 적어도 하나를 포함하는 검색 정보와 사용자 질의를 획득하는 단계; 획득된 상기 사용자 질의의 종류에 기초하여 키워드-카테고리 조합을 검출하는 단계; 검출된 상기 키워드-카테고리 조합과 매치되는 캐시 데이터의 존재 여부를 판단하는 단계; 상기 키워드-카테고리 조합과 매치되는 캐시 데이터가 존재하지 않는 경우 획득된 상기 검색 정보에 대한 인공 지능 기술 기반 객체 검출을 통해 객체-카테고리 조합을 생성하는 단계; 획득된 상기 키워드-카테고리 조합과 상기 객체-카테고리 조합 간의 매칭을 수행하여 매칭이 되는 경우 상기 객체-카테고리 조합이 검출된 이미지를 표시하는 단계; 및 상기 키워드-카테고리 조합과 상기 검색 정보를 맵핑하여 새로운 캐시 데이터로 저장하는 단계를 포함하고, 상기 인공지능 기술은 자연어 처리 기술 및 이미지 객체 인식 기술 중 적어도 하나"를 포함하는 구성을 개시한다.Republic of Korea Patent Publication No. 10-2021-009347 discloses "Acquiring search information and user query including at least one of a search image and an image address based on a search result of an image search engine; Based on the type of the obtained user query. Detecting a keyword-category combination by using a keyword-category combination; Determining whether or not cache data matching the detected keyword-category combination exists; Search information obtained when cache data matching the keyword-category combination does not exist Generating an object-category combination through artificial intelligence technology-based object detection for the image in which the object-category combination is detected by performing matching between the acquired keyword-category combination and the object-category combination. and mapping the keyword-category combination and the search information and storing them as new cache data, wherein the artificial intelligence technology includes at least one of a natural language processing technology and an image object recognition technology. Initiate.

이러한 선행 발명의 이미지 검색 시스템 대부분은 단일 문장 또는 단일 이미지 질의를 입력받아 의미나 시각적으로 가장 유사한 이미지 한 장을 검색하는 단일 입출력 구조를 갖는다.Most of the image retrieval systems of the prior invention have a single input/output structure for receiving a single sentence or single image query and searching for one image that is most similar in meaning or visually.

대한민국 등록특허 제10-173254호Republic of Korea Patent Registration No. 10-173254 대한민국 공개특허 제10-2021-009347호Republic of Korea Patent Publication No. 10-2021-009347

따라서, 본 발명은 학습된 인공지능 모델을 기반으로 스토리를 문장의 문맥과 시각적 맥락을 고려하며 묘사하는 이미지 시퀀스를 선출하는 스토리 기반 이미지 시퀀스 결정 모델 방법, 장치 및 이미지 시퀀스 결정 시스템을 제공하는데 목적이 있다.Accordingly, an object of the present invention is to provide a story-based image sequence determination model method, apparatus, and image sequence determination system for selecting an image sequence depicting a story while considering the context of a sentence and the visual context based on a learned artificial intelligence model. there is.

본 발명의 목적은 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 본 발명이 속하는 기술 분야의 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.Objects of the present invention are not limited to those mentioned above, and other objects not mentioned above will be clearly understood by those skilled in the art from the description below.

상기와 같은 목적을 달성하기 위한 본 발명의 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법은, 이미지 시퀀스 결정 모델에 의해 실행되는 스토리 기반 이미지 시퀀스 결정 방법에 있어서, 스토리 인코더(Story encoder)를 이용하여, 텍스트 기반 스토리의 현재문장의 토큰 단위로 상기 스토리 상의 적어도 하나의 이전 문장과 상기 현재문장 간의 연관도에 기반하여 상기 현재문장을 인코딩하는 단계와, 인코딩된 상기 현재문장과 기 저장된 이미지 간 유사도를 기반으로 주어진 이미지 데이터베이스에서 상기 현재문장과 연관된 복수의 제1 후보 이미지를 검색하는 단계와, 이미지 간 씬 그래프(Scene Graph) 유사도를 기반으로 상기 이미지 데이터베이스에서 상기 현재문장의 직전문장의 최적 이미지와 연관된 복수의 제2 후보 이미지를 검색하는 단계, 및 상기 복수의 제1 후보 이미지 및 상기 복수의 제2 후보 이미지에 기반하여 상기 현재문장의 최적 이미지를 결정하는 단계를 포함할 수 있다.In order to achieve the above object, a method for determining a story-based image sequence according to an embodiment of the present invention is a method for determining a story-based image sequence executed by an image sequence determination model, using a story encoder, Encoding the current sentence based on the degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story, and based on the degree of similarity between the encoded current sentence and a pre-stored image Searching for a plurality of first candidate images associated with the current sentence in an image database given by , and searching for a plurality of first candidate images associated with an optimal image of a sentence immediately before the current sentence in the image database based on a scene graph similarity between images. The method may include retrieving a second candidate image of and determining an optimal image of the current sentence based on the plurality of first candidate images and the plurality of second candidate images.

본 발명의 다른 실시예에 따른 텍스트 기반 스토리를 수신하는 통신부와, 텍스트 기반 스토리를 묘사하는 이미지를 결정하도록 학습된 이미지 시퀀스 결정 모델을 저장하는 메모리 및 상기 메모리 및 통신부를 제어하는 프로세서를 포함하고, 상기 프로세서는, 스토리 인코더(Story encoder)를 이용하여, 텍스트 기반 스토리의 현재문장의 토큰 단위로 상기 스토리 상의 적어도 하나의 이전 문장과 상기 현재문장 간의 연관도에 기반하여 상기 현재문장을 인코딩하고, 인코딩된 상기 현재문장과 기 저장된 이미지 간 유사도를 기반으로 주어진 이미지 데이터베이스에서 상기 현재문장과 연관된 복수의 제1 후보 이미지를 검색하며, 이미지 간 씬 그래프(Scene Graph) 유사도를 기반으로 상기 이미지 데이터베이스에서 상기 현재문장의 직전문장의 최적 이미지와 연관된 복수의 제2 후보 이미지를 검색하고, 상기 복수의 제1 후보 이미지 및 상기 복수의 제2 후보 이미지에 기반하여 상기 현재문장의 최적 이미지를 결정할 수 있다.A communication unit for receiving a text-based story according to another embodiment of the present invention, a memory for storing an image sequence determination model learned to determine an image depicting the text-based story, and a processor for controlling the memory and the communication unit, The processor encodes the current sentence based on a degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story using a story encoder, and encodes the current sentence. Searches for a plurality of first candidate images associated with the current sentence in a given image database based on the similarity between the current sentence and a pre-stored image, and based on the similarity of a scene graph between images, the current A plurality of second candidate images associated with an optimal image of an immediately preceding sentence of a sentence may be searched for, and an optimal image of the current sentence may be determined based on the plurality of first candidate images and the plurality of second candidate images.

본 발명의 다른 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템은, 텍스트 기반의 복수의 문장으로 구성된 스토리를 전송하는 사용자 디바이스와, 학습된 이미지 시퀀스 결정 모델을 기반으로 상기 사용자 디바이스로부터 입력된 스토리의 현재문장과 이전 문장들의 문맥 연관성을 고려하고, 동시에 직전문장에 대응하여 선출된 최적 이미지와 시각적 맥락을 유지하면서, 현재문장을 의미상 가장 높은 유사도로 묘사하는 이미지를 선출하는 이미지 시퀀스 결정 서버, 및 상기 이미지 시퀀스 결정 모델의 추론에 이용되는 이미지들 및 상기 이미지들마다 사전 정의된 씬 그래프를 매핑하여 저장하는 데이터베이스를 포함할 수 있다.A story-based image sequence determination system according to another embodiment of the present invention includes: a user device transmitting a text-based story composed of a plurality of sentences; and a current story input from the user device based on a learned image sequence determination model. An image sequence determination server that selects an image depicting the current sentence with the highest semantically similarity while considering the contextual relevance of the sentence and previous sentences and at the same time maintaining the optimal image and visual context selected corresponding to the immediately preceding sentence, and A database for mapping and storing predefined scene graphs for each of images used for inference of the image sequence determination model and the images may be included.

본 발명의 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 시스템에 의하면, 이미지 결정 모델을 스토리 기반 이미지 시퀀스 검색 목적에 맞게 조합하여 파이프라인을 구성한 데에서 그 효과가 발휘된다. According to the method, apparatus, and system for determining a story-based image sequence according to an embodiment of the present invention, an effect is exhibited when a pipeline is configured by combining an image determination model suitable for a purpose of searching for a story-based image sequence.

즉, 본 발명의 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 시스템에 의하면, 스토리를 구성하는 문장들의 문맥과 이미지들의 시각적 맥락을 고려하며 스토리를 가장 최적으로 묘사하는 이미지 시퀀스를 산출할 수 있다. That is, according to the method, apparatus, and system for determining a story-based image sequence according to an embodiment of the present invention, an image sequence that most optimally describes a story can be calculated by considering the context of sentences constituting a story and the visual context of images. there is.

본 발명의 효과는 이상에서 언급한 것으로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술 분야의 통상의 지식을 가진 자에게 명확히 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템의 구성을 도시하는 블록도이다.
도 2는 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템의 개념을 도시하는 개념도이다.
도 3은 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 장치의 구성을 도시하는 블록도이다.
도 4는 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델의 학습과정을 설명하기 위한 개념도이다.
도 5는 본 발명의 일 실시예에 따른 스토리 인코더의 개념을 도시하는 개념도이다.
도 6은 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법을 설명하기 위한 순서도이다.
도 7은 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 모델을 학습시키는 방법 중 일부를 구체적으로 설명하기 위한 순서도이다.1 is a block diagram showing the configuration of a story-based image sequence determination system according to an embodiment of the present invention.
2 is a conceptual diagram illustrating the concept of a story-based image sequence determination system according to an embodiment of the present invention.
3 is a block diagram showing the configuration of an image sequence determining device according to an embodiment of the present invention.
4 is a conceptual diagram for explaining a learning process of an image sequence determination model according to an embodiment of the present invention.
5 is a conceptual diagram illustrating the concept of a story encoder according to an embodiment of the present invention.
6 is a flowchart illustrating a method for determining a story-based image sequence according to an embodiment of the present invention.
7 is a flowchart for specifically explaining a part of a method of learning a story-based image sequence determination model according to an embodiment of the present invention.

본 발명의 목적 및 효과, 그리고 그것들을 달성하기 위한 기술적 구성들은 첨부되는 도면과 함께 상세하게 뒤에 설명이 되는 실시 예들을 참조하면 명확해질 것이다. 본 발명을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 뒤에 설명되는 용어들은 본 발명에서의 구조, 역할 및 기능 등을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다.Objects and effects of the present invention, and technical configurations for achieving them will become clear with reference to embodiments to be described later in detail in conjunction with the accompanying drawings. In describing the present invention, if it is determined that a detailed description of a known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the terms described later are terms defined in consideration of the structure, role, and function in the present invention, which may vary according to the intention or custom of a user or operator.

그러나 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있다. 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 오로지 특허청구범위에 기재된 청구항의 범주에 의하여 정의될 뿐이다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.However, the present invention is not limited to the embodiments disclosed below and may be implemented in a variety of different forms. Only these embodiments are provided to complete the disclosure of the present invention and to fully inform those skilled in the art of the scope of the invention, and the present invention is described only in the claims. It is only defined by the scope of the claims. Therefore, the definition should be made based on the contents throughout this specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated.

이하에서는 첨부한 도면을 참조하며, 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다. Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described in more detail.

도 1은 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템의 구성을 도시하는 블록도이고, 도 2는 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템의 개념을 도시하는 개념도이며, 도 3은 본 발명의 일 실시예에 따른 이미지 시퀀스 결정장치의 구성을 도시하는 블록도이고, 도 5는 본 발명의 일 실시예에 따른 스토리 인코더의 개념을 도시하는 개념도다.1 is a block diagram showing the configuration of a story-based image sequence determination system according to an embodiment of the present invention, and FIG. 2 is a conceptual diagram showing the concept of a story-based image sequence determination system according to an embodiment of the present invention. , FIG. 3 is a block diagram showing the configuration of an image sequence determining device according to an embodiment of the present invention, and FIG. 5 is a conceptual diagram showing the concept of a story encoder according to an embodiment of the present invention.

도 1 및 도 2를 참조하면, 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 시스템(10)은 사용자 디바이스(100), 데이터베이스(200), 이미지 시퀀스 결정 서버(300)를 포함할 수 있다. 이하 이미지 시퀀스 결정 서버(300)는 이미지 시퀀스 결정 장치와 명칭이 혼용될 수 있다.1 and 2 , a story-based image sequence determination system 10 according to an embodiment of the present invention may include a user device 100, a database 200, and an image sequence determination server 300. . Hereinafter, the image sequence determining server 300 may be used interchangeably with the image sequence determining device.

사용자 디바이스(100)는 텍스트 기반의 복수의 문장으로 구성된 스토리를 입력하는 장치일 수 있다. 일 예로, 사용자 디바이스(100)는 자신이 입력하는 복수의 문장으로 구성된 스토리를 의미와 시각적으로 가장 유사하게 묘사하는 이미지 시퀀스를 검색하고자 하는 사용자의 단말일 수 있다. The user device 100 may be a device for inputting a story composed of a plurality of text-based sentences. For example, the user device 100 may be a terminal of a user who wants to search for an image sequence that most closely resembles a story composed of a plurality of sentences input by the user in terms of meaning and visual.

사용자 디바이스(100)는 일 예로, 스마트폰(smartphone), 태블릿 PC(tablet personal computer), 이동 전화기(mobile phone), 영상 전화기, 전자책 리더기(e-book reader), 데스크탑 PC (desktop PC), 랩탑 PC(laptop PC), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), 서버, PDA(personal digital assistant), PMP(portable multimedia player), MP3 플레이어, 모바일 의료기기, 카메라, 또는 웨어러블 장치(wearable device) 중 적어도 하나로 구비될 수 있으며, 이에 한정하지 않는다.The user device 100 includes, for example, a smartphone, a tablet personal computer, a mobile phone, a video phone, an e-book reader, a desktop PC, Laptop PC, netbook computer, workstation, server, personal digital assistant (PDA), portable multimedia player (PMP), MP3 player, mobile medical device, camera, or wearable device device), but is not limited thereto.

데이터베이스(200)는 문장을 묘사하는 이미지를 해당 문장과 미리 매핑하여 저장하고, 이미지마다 사전 정의된 씬 그래프를 매핑하여 저장할 수 있다.The database 200 may map and store an image depicting a sentence with the corresponding sentence in advance, and map and store a predefined scene graph for each image.

여기서, 데이터베이스(200)에 각각 저장되는 정보들은 매핑테이블의 형식으로 저장될 수 있다. 각 매핑테이블은 서로 관계되어 실행되는 관계형 데이터베이스일 수 있다. 관계형 데이터베이스는 일련의 정형화된 테이블로 구성된 데이터 항목들의 집합체이다. 관계형 데이터베이스는 공지된 기술임에 따라 자세한 설명은 생략한다.Here, information stored in the database 200 may be stored in the form of a mapping table. Each mapping table may be a relational database executed in relation to each other. A relational database is a collection of data items organized into a set of structured tables. Since the relational database is a well-known technology, a detailed description thereof will be omitted.

이미지 시퀀스 결정 서버(300)는 통신망을 통하여 데이터베이스(200)에 접속할 수 있다. 통신망을 통하여 데이터베이스(200)에 접속하는 경우 데이터베이스(200)는 특정 하나의 IP(Internet Protocol) 주소를 갖는 서버뿐만 아니라, 통신망을 통해 연결되며 서로 다른 다양한 IP 주소를 갖는 복수의 서버일 수도 있다.The image sequence determination server 300 may access the database 200 through a communication network. When accessing the database 200 through a communication network, the database 200 may be a server having a specific Internet Protocol (IP) address, as well as a plurality of servers connected through a communication network and having various different IP addresses.

다만, 이에 한정되는 것은 아니고 일 예로, 데이터베이스(200)는 이미지 시퀀스 결정 서버(300)에 포함되는 구성일 수도 있다. However, it is not limited thereto, and as an example, the database 200 may be included in the image sequence determination server 300 .

일 실시예에서, 이미지 시퀀스 결정 서버(300)는 전적으로 하드웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 예컨대, 이미지 시퀀스 결정 서버(300)는 특정 형식 및 내용의 데이터를 전자통신 방식으로 주고 받기 위한 장치 및 이에 관련된 소프트웨어를 통칭할 수 있다. 본 명세서에서 "부(unit)", "장치" 또는 "단말" 등의 용어는 하드웨어 및 해당 하드웨어에 의해 구동되는 소프트웨어의 조합을 지칭하는 것으로 의도된다. 예를 들어, 여기서 하드웨어는 CPU 또는 다른 프로세서를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체, 실행파일, 실행 스레드, 프로그램 등을 지칭할 수 있다.In one embodiment, image sequence determination server 300 may be entirely hardware, or may have aspects that are part hardware and part software. For example, the image sequence determination server 300 may collectively refer to a device for exchanging data of a specific format and content in an electronic communication method and related software. In this specification, terms such as "unit", "apparatus" or "terminal" are intended to refer to a combination of hardware and software driven by the hardware. For example, the hardware herein may be a data processing device including a CPU or other processor. Also, software driven by hardware may refer to a running process, object, executable file, execution thread, program, and the like.

이미지 시퀀스 결정 서버(300)는 학습된 이미지 시퀀스 결정 모델을 기반으로 현재문장과 이전 문장들의 문맥 연관성을 고려하고, 동시에 직전문장에 대응하여 선출된 최적 이미지와 시각적 맥락을 유지하면서, 현재문장을 의미상 가장 유사도 높게 묘사하는 이미지를 선출할 수 있다.The image sequence determination server 300 considers the contextual correlation between the current sentence and previous sentences based on the learned image sequence determination model, and at the same time maintains the optimal image and visual context selected in response to the previous sentence, meaning the current sentence. It is possible to select an image that describes the image with the highest degree of similarity.

이를 위해, 먼저 이미지 시퀀스 결정 모델의 학습이 선행된다.To this end, learning of an image sequence determination model is preceded.

이에, 스토리 기반 이미지 시퀀스 결정 모델의 학습 장치(300)는 도 3에 도시된 바와 같이, 통신부(310), 메모리(320) 학습부(330) 및 프로세서40)를 포함할 수 있다. Accordingly, the learning device 300 of the story-based image sequence determination model may include a communication unit 310, a memory 320, a learning unit 330, and a processor 40 as shown in FIG. 3 .

여기서, 스토리 기반 이미지 시퀀스 결정 장치(300)를 구성하는 각각의 부는 반드시 물리적으로 구분되는 별개의 구성요소를 지칭하는 것으로 의도되지 않는다. 즉, 도 3에서 스토리 기반 이미지 시퀀스 결정 장치(300)의 각 부(310~340)는 서로 구분되는 별개의 블록으로 도시되었으나, 이는 스토리 기반 이미지 시퀀스 결정 장치(300)를 이에 의해 실행되는 동작에 의해 기능적으로 구분한 것이다.Here, each unit constituting the story-based image sequence determination device 300 is not intended to refer to a separate physically separated component. That is, although each unit 310 to 340 of the story-based image sequence determining device 300 is shown as a separate block in FIG. functionally separated by

실시예에 따라서는 전술한 각 부(310-340) 중 일부 또는 전부가 동일한 하나의 장치 내에 집적화될 수 있으며, 또는 하나 이상의 부가 다른 부와 물리적으로 구분되는 별개의 장치로 구현될 수도 있다. 예컨대, 스토리 기반 이미지 시퀀스 결정 장치(300)의 각 부는 분산 컴퓨팅 환경 하에서 서로 통신 가능하게 연결된 컴포넌트들일 수도 있다.Depending on embodiments, some or all of the aforementioned units 310 to 340 may be integrated into the same device, or one or more units may be implemented as separate devices physically separated from other units. For example, each unit of the device 300 for determining a story-based image sequence may be components communicatively connected to each other in a distributed computing environment.

프로세서(340)는 스토리 인코더(Story encoder)를 이용하여, 텍스트 기반 스토리의 현재문장의 토큰 단위로 상기 스토리 상의 적어도 하나의 이전 문장과 상기 현재문장 간의 연관도에 기반하여 상기 현재문장을 인코딩할 수 있다.The processor 340 may encode the current sentence based on a degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story using a story encoder. there is.

이를 위해, 프로세서(340)는 상기 현재문장의 각 토큰과 상기 적어도 하나의 이전 문장과의 어텐션(attention)을 계산하여 상기 현재문장의 토큰 단위로 상기 연관도를 결정할 수 있다.To this end, the processor 340 may calculate the attention between each token of the current sentence and the at least one previous sentence, and determine the degree of association in units of tokens of the current sentence.

프로세서(340)는 인코딩된 상기 현재문장과 기 저장된 이미지 간 유사도를 기반으로 주어진 이미지 데이터베이스(200)에서 상기 현재문장과 연관된 복수의 제1 후보 이미지를 검색할 수 있다.The processor 340 may search for a plurality of first candidate images associated with the current sentence from the given image database 200 based on similarities between the encoded current sentence and pre-stored images.

이를 위해, 구체적으로 프로세서(340)는 상기 현재문장을 벡터화하여 잠재벡터를 생성하고, 이미지 데이터베이스(200)에 저장된 이미지들의 잠재벡터와 상기 현재문장의 잠재벡터를 코사인-유사도 기반으로 분석하여, 이미지 데이터베이스(200)에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 높은 순으로 소정 개수의 상기 복수의 제1 후보 이미지를 선출할 수 있다.To this end, in detail, the processor 340 vectorizes the current sentence to generate a latent vector, analyzes the latent vector of the images stored in the image database 200 and the latent vector of the current sentence based on cosine-similarity, and Among the images stored in the database 200, a predetermined number of the plurality of first candidate images may be selected in order of high similarity according to the result of the analysis.

여기서, 스토리 인코더는 N번째부터 1번째까지의 이전 문장들과 현재문장 간의 문맥 연관성을 반영하며, 인코딩을 통해 현재문장의 공유잠재공간 벡터를 획득할 수 있다. (N은 2이상의 자연수)Here, the story encoder may reflect the context correlation between previous sentences from the Nth to the first sentences and the current sentence, and obtain a shared latent space vector of the current sentence through encoding. (N is a natural number of 2 or more)

여기서, 스토리 인코더는 이전에 입력된 문장들과의 의미관계를 반영해 현재문장을 인코딩하는 새로운 문장 인코더를 의미한다. 도 5에 도시된 바와 같이, 스토리 인코더는 현재문장을 인코딩하기 위해서 우선 단어 형태의 의미 단위인 토큰(token)들로 쪼갠다. Here, the story encoder means a new sentence encoder that encodes the current sentence by reflecting the semantic relationship with previously input sentences. As shown in FIG. 5, the story encoder first divides the current sentence into tokens, which are semantic units in word form, in order to encode the current sentence.

각 토큰은 순환형 신경망 구조(Recurrent Neural Network) 중 하나인 이중 장단기 메모리 네트워크(Bi-LSTM)를 통과해 각각에 상응하는 1차 토큰 벡터를 획득한다.Each token is passed through a bi-long short-term memory network (Bi-LSTM), one of the recurrent neural networks, to obtain a primary token vector corresponding to each token.

각 1차 토큰 벡터는 이전 문장들과 어텐션(Attention) 알고리즘을 통해 이전 문장들과의 의미 관계를 반영한 2차 토큰 벡터로 변환된다.Each primary token vector is converted into a secondary token vector reflecting a semantic relationship with previous sentences through an attention algorithm with previous sentences.

어텐션은 쿼리(query, 타겟)에 대해 키(key, 비교 대상)들을 쿼리와의 연관도에 따라 얼만큼 반영하여 밸류(value, 새로운 출력물)를 만들어낼지 학습하는 결정 알고리즘이다. Attention is a decision algorithm that learns to create values (new outputs) by reflecting keys (comparison targets) to a query (target) according to the degree of association with the query.

스토리 인코더는 토큰마다 이전 문장들과 어텐션을 계산하여 토큰 단위로 이전 문장들과의 문맥 연관성을 고려하는 것이 기존 문장 인코더와 차별되는 포인트이다. The story encoder calculates attention with previous sentences for each token, and considers contextual relevance with previous sentences in units of tokens, which is different from existing sentence encoders.

1차 토큰 벡터 하나(쿼리)는 이전 문장들의 벡터 표현(키) 각각과 단일 신경망 층(softmax층)을 통과하여 0에서 1 사이의 점수로 전환된다. 각 점수는 해당 토큰과 이전 문장과의 연관도를 표현하는데, 학습과정에서 토큰과 가장 연관있는 내용을 담고 있을 법한 문장과의 점수값은 1에 가까워지게 된다. One primary token vector (query) is converted into a score between 0 and 1 by passing each of the vector representations (keys) of the previous sentences and a single neural network layer (softmax layer). Each score represents the degree of association between the token and the previous sentence, and in the learning process, the score value for sentences that are most likely to contain the content related to the token approaches 1.

최종적으로 1차 토큰 벡터와 점수들 간의 가중합을 2차 토큰 벡터(밸류)로 정의한다. 즉, 2차 토큰 벡터는 현재문장의 한 토큰이 이전 문장 각각과 얼마나 연관도가 있는지를 종합적으로 표현한 값이다. Finally, the weighted sum between the first token vector and the scores is defined as the second token vector (value). That is, the secondary token vector is a value that comprehensively expresses how much a token of the current sentence is related to each of the previous sentences.

이후 1차, 2차 토큰 벡터를 연결(concatenate)하여 새로운 장단기 메모리 네트워크(LSTM)을 통과시켜 3차 토큰 벡터를 획득한다. 상기 과정을 현재문장 내 모든 토큰들에 반복하여 얻은 3차 토큰 벡터들의 평균 값을 최종적으로 현재문장의 벡터표현으로 정의한다.Thereafter, the first and second token vectors are concatenated and passed through a new short-term memory network (LSTM) to obtain a third token vector. The average value of the tertiary token vectors obtained by repeating the above process for all tokens in the current sentence is finally defined as the vector expression of the current sentence.

프로세서(340)는 이미지 간 씬 그래프(Scene Graph) 유사도를 기반으로 상기 이미지 데이터베이스에서 상기 현재문장의 직전문장의 최적 이미지와 연관된 복수의 제2 후보 이미지를 검색할 수 있다.The processor 340 may search the image database for a plurality of second candidate images associated with the optimal image of the sentence immediately before the current sentence based on the similarity of the scene graph between the images.

여기서, 씬 그래프는, 이미지에 포함된 적어도 하나의 객체의 종류를 꼭지점(vertex)으로, 상기 객체들 간 작용관계를 간선(edge)으로 갖는 자료 구조를 의미한다.Here, the scene graph means a data structure having at least one type of object included in the image as a vertex and an action relation between the objects as an edge.

일 예로, 씬 그래프는 대용량 디지털 이미지 데이터를 사전학습(pre-train)한 인공신경망 기반 객체 탐지 모델을 사용해 생성될 수 있다. 이미지를 객체 탐지 모델에 입력했을 때, 모델은 객체의 위치를 bounding box와 XY좌표로 하여 객체가 어떤 종류에 속하는지 0에서 1 사이의 확률로 사전 학습한 데이터를 기준으로 예측할 수 있다. 각 객체는 예측한 종류명, 예측된 확률을 key:value 형태로 정보 쌍이 매핑되어 자료구조 딕셔너리(dictionary)로 저장될 수 있다. 두 개의 객체간 작용관계는 씬 그래프 시드 데이터인 Visual Genome의 연결관계통계를 기준으로 예측될 수 있다. 씬 그래프를 그래프 신경망(Graph neural network)에 입력하면 이미지 내 객체 및 연관 관계를 함축적으로 표현한 추가정보표현을 얻을 수 있기 때문에 본 발명의 이미지 간 유사성 비교에 활용될 수 있다.For example, the scene graph may be generated using an artificial neural network-based object detection model pre-trained on large-capacity digital image data. When an image is input to an object detection model, the model can predict which type the object belongs to based on pretrained data with a probability between 0 and 1, with the object's location as the bounding box and XY coordinates. For each object, the predicted type name and predicted probability can be stored as a data structure dictionary by mapping information pairs in the form of key:value. The functional relationship between two objects can be predicted based on the connection relationship statistics of Visual Genome, which is the scene graph seed data. When a scene graph is input into a graph neural network, it is possible to obtain an additional information expression that implicitly expresses an object and related relationship in an image, so it can be used for similarity comparison between images according to the present invention.

구체적으로, 프로세서(340)는 상기 직전문장의 최적 이미지의 씬 그래프를 생성하고, 이를 그래프 신경망에 통과시켜 잠재벡터를 획득하고, 상기 이미지 데이터베이스(200)에 저장된 이미지들의 잠재벡터와 상기 직전문장의 최적 이미지의 잠재벡터를 코사인-유사도 기반으로 분석하여, 상기 이미지 데이터베이스(200)에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 높은 순으로 소정 개수의 상기 복수의 제2 후보 이미지를 선출할 수 있다.Specifically, the processor 340 generates a scene graph of the optimal image of the previous sentence, passes it through a graph neural network to obtain a latent vector, and obtains a latent vector of the images stored in the image database 200 and the previous sentence The latent vector of the optimal image may be analyzed based on cosine-similarity, and a predetermined number of the plurality of second candidate images may be selected in order of high similarity among images stored in the image database 200 according to a result of the analysis. there is.

프로세서(340)는 상기 복수의 제1 후보 이미지 및 상기 복수의 제2 후보 이미지에 기반하여 상기 현재문장의 최적 이미지를 결정할 수 있다.The processor 340 may determine the optimal image of the current sentence based on the plurality of first candidate images and the plurality of second candidate images.

구체적으로, 프로세서(340)는 현재문장의 최적 이미지를 결정하기 위해서, 상기 복수의 제1 후보 이미지의 잠재벡터들을 상기 직전문장의 최적 이미지의 잠재벡터와 유사 분석하여, 상기 복수의 제1 후보 이미지에 대한 각 유사도를 산출한다. Specifically, the processor 340 analyzes the latent vectors of the plurality of first candidate images to be similar to the latent vector of the optimal image of the immediately preceding sentence in order to determine the optimal image of the current sentence, and the plurality of first candidate images. Calculate each degree of similarity for

그리고, 프로세서(340)는 상기 복수의 제2 후보 이미지의 잠재벡터들을 상기 직전문장의 최적 이미지의 잠재벡터와 유사 분석하여, 상기 복수의 제2 후보 이미지에 대한 각 유사도를 산출한다.Then, the processor 340 analyzes the latent vectors of the plurality of second candidate images for similarity with the latent vector of the optimal image of the previous sentence, and calculates the degree of similarity of each of the plurality of second candidate images.

이어, 프로세서(340)는 상기 복수의 제1 후보 이미지에 대한 각 유사도 및 상기 복수의 제2 후보 이미지에 대한 각 유사도를 평균하여 가장 유사도가 높은 이미지를 최적 이미지, 즉 현재문장을 묘사하는 최적 이미지로 결정할 수 있다.Subsequently, the processor 340 averages the similarities of the plurality of first candidate images and the similarities of the plurality of second candidate images, and selects an image having the highest similarity as an optimal image, that is, an optimal image depicting the current sentence. can be determined by

프로세서(340)는 현재문장이 텍스트 기반 스토리의 마지막 문장이 될 때까지 순차적으로 각 문장에 대해 전술한 과정을 실행하여 최적 이미지를 결정할 수 있다. 예를 들어, 5문장으로 구성된 스토리에 대해 5장의 최적 이미지가 결정될 수 있다.The processor 340 may determine an optimal image by sequentially executing the above-described process for each sentence until the current sentence becomes the last sentence of the text-based story. For example, 5 optimal images may be determined for a story composed of 5 sentences.

여기서, 현재문장이 텍스트 기반 스토리의 첫 문장인 경우에 프로세서(340)는 첫 문장을 벡터화하여 잠재벡터를 생성하고, 이미지 데이터베이스(200)에 저장된 이미지들의 잠재벡터와 상기 첫 문장의 잠재벡터를 코사인-유사도 기반으로 분석한다. 이에, 이미지 데이터베이스(200)에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 가장 높은 이미지를 상기 최적 이미지로 결정할 수 있다.Here, when the current sentence is the first sentence of a text-based story, the processor 340 vectorizes the first sentence to generate a latent vector, and calculates the cosine of the latent vector of the images stored in the image database 200 and the latent vector of the first sentence. -Analyze based on similarity. Thus, among the images stored in the image database 200, an image having the highest degree of similarity according to the result of the analysis may be determined as the optimal image.

학습부(330)는 전술한 이미지 시퀀스 결정 모델을 학습시킬 수 있다.The learning unit 330 may learn the above-described image sequence determination model.

구체적으로, 학습부(330)는 복수의 문장을 포함하는 텍스트 기반 스토리로 구성된 학습 데이터셋을 구비할 수 있다. 그리고, 상기 복수의 문장을 이미지 시퀀스 결정 모델에 입력시켜 복수의 최적 이미지를 획득한다.Specifically, the learning unit 330 may have a learning dataset composed of text-based stories including a plurality of sentences. Then, a plurality of optimal images are obtained by inputting the plurality of sentences to an image sequence determination model.

그리고, 복수의 최적 이미지를 BiLSTM 및 디코더를 이용하여 텍스트 기반 스토리로 복원하고, 상기 복원된 스토리와 원본 스토리를 비교하여 산술적 차이를 계산한다. 이어, 상기 산술적 차이를 기반으로 상기 이미지 시퀀스 결정 모델을 역전파 방식으로 학습시킬 수 있다.Then, a plurality of optimal images are restored into text-based stories using BiLSTM and a decoder, and an arithmetic difference is calculated by comparing the restored story with the original story. Then, based on the arithmetic difference, the image sequence determination model may be learned using a backpropagation method.

도 4는 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델의 학습과정을 설명하기 위한 개념도다. 도 4를 참조하여, 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델의 학습 과정의 일 예를 설명하기로 한다.4 is a conceptual diagram for explaining a learning process of an image sequence determination model according to an embodiment of the present invention. Referring to FIG. 4 , an example of a learning process of an image sequence determination model according to an embodiment of the present invention will be described.

본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 모델은 크게 '스토리 인코딩 - 관련 이미지 1차 분류 - 직전문장의 최적 이미지의 그래프 표현 획득 - 관련 이미지 2차 분류 및 최종 검색 - 캡션 생성기로 전달'의 5 단계로 구성된 학습과정을 거쳐 개발될 수 있다.The story-based image sequence determination model according to an embodiment of the present invention largely consists of 'story encoding - primary classification of related images - acquisition of graph representation of the optimal image of the previous sentence - secondary classification of related images and final search - transmission to caption generator' can be developed through a learning process consisting of five stages of

도 4처럼, S₁, S₂, S₃의 세 문장으로 구성된 텍스트 스토리에 문장 상 문맥과 이미지상 맥락이 의미적으로 연관되는 이미지를 검색하기 위해 우선 스토리를 인코딩할 수 있다. 문장과 이미지는 서로 다른 종류의 인지 정보이기 때문에, 동일한 차원을 가진 공유 잠재공간으로 인코딩해야 두 정보의 연관성을 정략적으로 비교할 수 있기 때문이다.As shown in FIG. 4 , the story may be first encoded in order to search for an image in which the sentence context and the image context are semantically related to a text story composed of three sentences S ₁ , S ₂ , and S ₃ . Since sentences and images are different kinds of cognitive information, they must be encoded in a shared latent space with the same dimension in order to quantitatively compare the correlation between the two information.

본 발명에서 순환신경망(Recurrent neural network)과 동일 정보 내 의미연관성을 계산하는 심층학습의 최신 연산기법인 어텐션(Attention)을 활용해 새롭게 개발한 스토리 인코더를 통해 시간스텝마다 각 문장을 공유 잠재공간으로 매핑한다. 스토리 인코더는 도 5를 참조하여 상세히 전술한 바 있다.In the present invention, each sentence is mapped to a shared latent space at each time step through a newly developed story encoder using Attention, a state-of-the-art algorithm of deep learning that calculates semantic relevance within the same information as a recurrent neural network. do. The story encoder has been described in detail with reference to FIG. 5 .

현재 시간 스텝에 인가된 문장에 대응하는 이미지를 검출하기 위해, 이미지 인코더를 통해 학습데이터의 모든 이미지들 또한 벡터표현과 매핑한다. 실제 학습 시엔 계산의 효율성을 위해 이미지들을 미리 매핑하여 저장할 수 있다. 문장의 잠재벡터와 이미지들의 잠재벡터간 코사인-유사도를 계산하여 예를 들어, 최대 100장의 연관된 이미지를 1차 검색할 수 있다. In order to detect an image corresponding to the sentence applied at the current time step, all images of the learning data are also mapped to a vector representation through an image encoder. In actual learning, images may be mapped in advance and stored for efficiency of calculation. For example, up to 100 related images may be initially searched by calculating the cosine-similarity between the latent vector of the sentence and the latent vector of the images.

첫 번째 시간스텝의 문장, 즉 입력된 스토리의 첫 번째 문장의 경우엔 1차 검색에서 가장 유사도가 높은 1장의 사진을 최종 검색결과로 선택할 수 있다.In the case of the sentence of the first time step, that is, the first sentence of the input story, one photo having the highest similarity in the first search may be selected as the final search result.

이미지 간 구조적 유사성을 기반으로 2차 검색은 두 번째 시간 스텝부터 적용한다. 직전 스텝에서 최종 선택된 이미지의 씬 그래프를 생성하고, 그리고 이전과 동일하게 현재 시간스텝의 문장에 대응하는 이미지 100장을 1차 검색한다. 이어서, 100장의 그래프 잠재표현을 획득하고, 직전 스텝의 그래프 잠재 표현과 코사인-유사도를 계산하여 매긴 순위를 1차 검색 순위와 평균을 구해 새로운 순위를 산출할 수 있다. 이와 같이 그래프 잠재표현을 활용한 것이 2차 분류이다. 2차 분류의 결과 가장 순위가 높은 이미지가 최종 선택된다. Based on the structural similarity between images, the secondary search is applied from the second time step. A scene graph of the image finally selected in the previous step is created, and as before, 100 images corresponding to the sentence of the current time step are first searched. Subsequently, 100 graph latent expressions are acquired, and a new ranking can be calculated by averaging the first search ranking and the average of the ranking obtained by calculating cosine-similarity with the graph latent expression of the previous step. In this way, the use of latent graph representation is the second classification. As a result of the secondary classification, the image with the highest ranking is finally selected.

전술한 과정을 세 번째 타임 스텝까지 반복하여, 세 문장을 포함하는 스토리에 대해 문장의 문멕과 이미지의 구조적 유사도를 고려한 이미지 3장을 최종 획득할 수 있다.By repeating the above-described process up to the third time step, it is possible to finally acquire three images considering the structural similarity between the context of the sentence and the image for the story including the three sentences.

마지막으로, 최종 획득한 이미지 3장을 캡션 생성기에 입력하여, 초기에 입력한 원본 스토리를 복원할 수 있다. 캡션 생성기는 GLACNet(2018)의 구조를 활용할 수 있다. Lastly, the original story input at the beginning can be restored by inputting the three finally acquired images to the caption generator. The caption generator can utilize the structure of GLACNet (2018).

선출한 이미지 3장의 유사도가 높을수록 캡션 생성기 입장에서는 원본 스토리를 복원하기 수월할 것이다. 캡션 생성기가 복원한 스토리와 원본 스토리와의 차이를 확률 통계적 비교를 통해 산술적으로 계산한다. The higher the similarity of the three selected images, the easier it will be for the caption generator to restore the original story. The difference between the story restored by the caption generator and the original story is arithmetically calculated through a statistical statistical comparison.

이 계산에 따라 2차 분류 결과와 실제 이미지 시퀀스와 산술적 차이를 기반으로 본 발명의 시퀀스 이미지 결정 모델을 학습시키는 데에 이용되는 손실 함수를 생성할 수 있다. 시퀀스 이미지 결정 모델은 인공신경망 기반이므로 학습은 상기 손실 함수를 기반으로 역전파(backpropagation) 방식으로 실행될 수 있다.According to this calculation, a loss function used to train the sequence image determination model of the present invention can be generated based on the arithmetic difference between the secondary classification result and the actual image sequence. Since the sequence image determination model is based on an artificial neural network, learning may be performed in a backpropagation method based on the loss function.

도 4는 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델의 학습과정을 설명하기 위한 개념도다.4 is a conceptual diagram for explaining a learning process of an image sequence determination model according to an embodiment of the present invention.

도 4를 참조하여, 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델의 학습 과정의 일 예를 설명하기로 한다. Referring to FIG. 4 , an example of a learning process of an image sequence determination model according to an embodiment of the present invention will be described.

본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 모델은 크게 '스토리 인코딩 - 관련 이미지 1차 분류 - 직전 이미지의 그래프 표현 획득 - 관련 이미지 2차 분류 및 최종 검색 - 캡션 생성기로 전달'의 5 단계로 구성된 학습과정을 거쳐 개발될 수 있다.The story-based image sequence determination model according to an embodiment of the present invention largely consists of five steps: 'story encoding - primary classification of related images - acquisition of graph representation of previous image - secondary classification and final search of related images - transfer to caption generator'. It can be developed through a learning process consisting of

도 4처럼, S₁, S₂, S₃의 세 문장으로 구성된 텍스트 스토리에 문장 상 문맥과 이미지상 맥락이 의미적으로 연관되는 이미지를 검색하기 위해 우선 스토리를 인코딩할 수 있다. 문장과 이미지는 서로 다른 종류의 인지 정보이기 때문에, 동일한 차원을 가진 공유 잠재공간으로 인코딩해야 두 정보의 연관성을 정략적으로 비교할 수 있기 때문이다. As shown in FIG. 4 , the story may be first encoded in order to search for an image in which the sentence context and the image context are semantically related to a text story composed of three sentences S ₁ , S ₂ , and S ₃ . Since sentences and images are different kinds of cognitive information, they must be encoded in a shared latent space with the same dimension in order to quantitatively compare the correlation between the two information.

본 발명에서 순환신경망(Recurrent neural network)과 동일 정보 내 의미연관성을 계산하는 심층학습의 최신 연산기법인 어텐션(Attention)을 활용해 새롭게 개발한 스토리 인코더를 통해 시간스텝마다 각 무장을 공유 잠재공간으로 매핑한다. 스토리 인코더는 도 5를 참조하여 전술한 바 있다.In the present invention, each armament is mapped to a shared latent space at each time step through a newly developed story encoder using Attention, a state-of-the-art algorithm of deep learning that calculates semantic relevance within the same information as a recurrent neural network. do. The story encoder has been described above with reference to FIG. 5 .

첫 번째 시간스텝의 문장, 즉 입력된 스토리의 첫 번째 문장의 경우엔 1차 검색에서 가장 유사도가 높은 1장의 사진을 최종 검색결과로 선택할 수 있다. In the case of the sentence of the first time step, that is, the first sentence of the input story, one photo having the highest similarity in the first search may be selected as the final search result.

이 계산에 따라 2차 분류 결과와 실제 이미지 시퀀스와 산술적 차이를 기반으로 본 발명의 시퀀스 이미지 결정 모델을 학습시키는 데에 이용되는 손실 함수를 생성할 수 있다. 시퀀스 이미지 결정 모델은 인공신경망 기반이므로 학습은 상기 손실 함수를 기반으로 역전파(backpropagation) 방식으로 이뤄질 수 있다.According to this calculation, a loss function used to train the sequence image determination model of the present invention can be generated based on the arithmetic difference between the secondary classification result and the actual image sequence. Since the sequence image determination model is based on an artificial neural network, learning may be performed in a backpropagation method based on the loss function.

도 6은 본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법을 설명하기 위한 순서도이다. 6 is a flowchart illustrating a method for determining a story-based image sequence according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 스토리 기반 이미지 시퀀스 결정방법은, 도 3의 스토리 기반 이미지 시퀀스 결정 장치(300)와 실질적으로 동일한 구성에서 진행될 수 있다. 따라서, 도 3의 장치(300)와 동일한 구성요소는 동일한 도면부호를 부여하고, 반복되는 설명은 생략한다.A method for determining a story-based image sequence according to an embodiment of the present invention may be performed in substantially the same configuration as the apparatus 300 for determining a story-based image sequence of FIG. 3 . Accordingly, components identical to those of the apparatus 300 of FIG. 3 are given the same reference numerals, and repeated descriptions are omitted.

또한, 본 실시예에 따른 스토리 기반 이미지 시퀀스 결정 방법은 소프트웨어(어플리케이션)에 의해 실행될 수 있다.Also, the story-based image sequence determination method according to the present embodiment may be executed by software (application).

도 6을 참조하면, 일 실시예에 따른 인공지능 모델 기반 이미지 시퀀스 결정 장치에 의해 실행되는 스토리 기반 이미지 시퀀스 결정 방법은,Referring to FIG. 6 , a method for determining a story-based image sequence executed by an apparatus for determining an image sequence based on an artificial intelligence model according to an exemplary embodiment includes:

먼저, 스토리 인코더(Story encoder)를 이용하여, 텍스트 기반 스토리의 현재문장의 토큰 단위로 상기 스토리 상의 적어도 하나의 이전 문장과 상기 현재문장 간의 연관도에 기반하여 상기 현재문장을 인코딩할 수 있다(S110).First, the current sentence may be encoded based on the degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story using a story encoder (S110). ).

이를 위해, 현재문장의 각 토큰과 상기 적어도 하나의 이전 문장과의 어텐션(attention)을 계산하여 상기 현재문장의 토큰 단위로 상기 연관도를 결정하는 단계를 포함할 수 있다.To this end, the method may include calculating an attention between each token of the current sentence and the at least one previous sentence, and determining the degree of association in units of tokens of the current sentence.

다음으로, 인코딩된 상기 현재문장과 기 저장된 이미지 간 유사도를 기반으로 주어진 이미지 데이터베이스에서 상기 현재문장과 연관된 복수의 제1 후보 이미지를 검색할 수 있다(S120).Next, a plurality of first candidate images associated with the current sentence may be searched from a given image database based on similarities between the encoded current sentence and pre-stored images (S120).

이를 위해, 이미지 시퀀스 결정 장치는 상기 현재문장을 벡터화하여 잠재벡터를 생성하는 단계와, 상기 이미지 데이터베이스에 저장된 이미지들의 잠재벡터와 상기 현재문장의 잠재벡터를 코사인-유사도 기반으로 분석하는 단계, 및To this end, the image sequence determining apparatus vectorizes the current sentence to generate a latent vector, analyzes the latent vector of the images stored in the image database and the latent vector of the current sentence based on cosine-similarity, and

상기 이미지 데이터베이스에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 높은 순으로 소정 개수의 상기 복수의 제1 후보 이미지를 선출하는 단계를 실행할 수 있다.A step of selecting a predetermined number of the plurality of first candidate images in an order of high similarity according to a result of the analysis among the images stored in the image database may be performed.

다음으로, 이미지 간 씬 그래프(Scene Graph) 유사도를 기반으로 상기 이미지 데이터베이스에서 상기 현재문장의 직전문장의 최적 이미지와 연관된 복수의 제2 후보 이미지를 검색할 수 있다(S130).Next, a plurality of second candidate images associated with the optimal image of the sentence immediately before the current sentence may be searched from the image database based on the similarity of the scene graph between the images (S130).

이를 위해, 이미지 시퀀스 결정 장치는, 상기 직전문장의 최적 이미지의 씬 그래프를 생성하고, 이를 그래프 신경망에 통과시켜 잠재벡터를 획득하는 단계와, 상기 이미지 데이터베이스에 저장된 이미지들의 잠재벡터와 상기 직전문장의 최적 이미지의 잠재벡터를 코사인-유사도 기반으로 분석하는 단계, 및 상기 이미지 데이터베이스에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 높은 순으로 소정 개수의 상기 복수의 제2 후보 이미지를 선출하는 단계를 실행할 수 있다.To this end, the image sequence determination apparatus generates a scene graph of the optimum image of the previous sentence and passes it through a graph neural network to obtain a latent vector, and the latent vector of the images stored in the image database and the previous sentence Analyzing the latent vector of the optimal image based on cosine-similarity, and selecting a predetermined number of the plurality of second candidate images in order of high similarity according to the result of the analysis among the images stored in the image database. can run

다음으로, 상기 복수의 제1 후보 이미지 및 상기 복수의 제2 후보 이미지에 기반하여 상기 현재문장의 최적 이미지를 결정할 수 있다(S140).Next, an optimal image of the current sentence may be determined based on the plurality of first candidate images and the plurality of second candidate images (S140).

이를 위해, 구체적으로 상기 복수의 제1 후보 이미지의 잠재벡터들을 상기 직전문장의 최적 이미지의 잠재벡터와 유사 분석하여, 상기 복수의 제1 후보 이미지에 대한 각 유사도를 산출하는 단계와, 상기 복수의 제2 후보 이미지의 잠재벡터들을 상기 직전문장의 최적 이미지의 잠재벡터와 유사 분석하여, 상기 복수의 제2 후보 이미지에 대한 각 유사도를 산출하는 단계, 및 상기 복수의 제1 후보 이미지에 대한 각 유사도 및 상기 복수의 제2 후보 이미지에 대한 각 유사도를 평균하여 가장 유사도가 높은 이미지를 상기 최적 이미지로 결정하는 단계를 포함할 수 있다.To this end, in detail, analyzing the latent vectors of the plurality of first candidate images for similarity to the latent vector of the optimal image of the previous sentence, and calculating the degree of similarity for each of the plurality of first candidate images; Analyzing the latent vectors of the second candidate image for similarity with the latent vector of the optimal image of the previous sentence, calculating similarity between the plurality of second candidate images, and calculating each similarity between the plurality of first candidate images. and determining an image having the highest similarity as the optimal image by averaging the similarity of each of the plurality of second candidate images.

한편, 현재문장이 상기 텍스트 기반 스토리의 첫 문장인 경우, 첫 문장을 벡터화하여 잠재벡터를 생성하고, 상기 이미지 데이터베이스에 저장된 이미지들의 잠재벡터와 상기 첫 문장의 잠재벡터를 코사인-유사도 기반으로 분석하여, 상기 이미지 데이터베이스에 저장된 이미지들 중 상기 분석의 결과에 따라 유사도가 가장 높은 이미지를 상기 최적 이미지로 결정할 수 있다.Meanwhile, when the current sentence is the first sentence of the text-based story, the first sentence is vectorized to generate a latent vector, and the latent vector of the images stored in the image database and the latent vector of the first sentence are analyzed based on cosine-similarity , Among the images stored in the image database, an image having the highest similarity according to the result of the analysis may be determined as the optimal image.

그리고, 현재문장이 상기 텍스트 기반 스토리의 마지막 문장이 될 때까지 순차적으로 각 문장에 대해 전술한 단계(S110~S140)를 반복 실행하여 스토리를 묘사하는 적어도 하나 이상의 최적 이미지를 결정할 수 있다.In addition, at least one optimal image depicting the story may be determined by repeatedly executing the above steps S110 to S140 sequentially for each sentence until the current sentence becomes the last sentence of the text-based story.

도 7은 본 발명의 일 실시예에 따른 이미지 시퀀스 결정 모델을 학습하는 방법을 설명하기 위한 순서도이다.7 is a flowchart illustrating a method of learning an image sequence determination model according to an embodiment of the present invention.

도 7을 참조하면, 먼저 복수의 문장을 포함하는 텍스트 기반 스토리로 구성된 학습 데이터셋을 구비할 수 있다(S210).Referring to FIG. 7 , first, a learning dataset composed of text-based stories including a plurality of sentences may be provided (S210).

다음으로, 상기 복수의 문장을 이미지 시퀀스 결정 모델에 입력시켜 복수의 최적 이미지를 획득할 수 있다(S220).Next, a plurality of optimal images may be obtained by inputting the plurality of sentences to an image sequence determination model (S220).

다음으로, 복수의 최적 이미지를 BiLSTM 및 디코더를 이용하여 텍스트 기반 스토리로 복원할 수 있다(S230).Next, a plurality of optimal images can be restored as a text-based story using BiLSTM and a decoder (S230).

다음으로, 상기 복원된 스토리와 원본 스토리를 비교하여 산술적 차이를 계산하고, 상기 산술적 차이를 기반으로 상기 이미지 시퀀스 결정 모델을 역전파 방식으로 학습시킬 수 있다(S240).Next, the restored story and the original story are compared to calculate an arithmetic difference, and based on the arithmetic difference, the image sequence determination model may be trained using a backpropagation method (S240).

본 발명의 실시예에 따른 스토리 기반 이미지 시퀀스 결정 모델 학습 방법, 장치 및 결정 시스템에 의하면, 기존에 공개된 이미지 결정 모델을 스토리 기반 이미지 시퀀스 검색이라는 목적에 맞게 조합하여 파이프라인을 구성한 데에서 그 효과가 발휘된다. According to the story-based image sequence determination model learning method, apparatus, and determination system according to an embodiment of the present invention, the effect of constructing a pipeline by combining previously disclosed image determination models to suit the purpose of story-based image sequence search is exerted

전술한 스토리 기반 이미지 시퀀스 결정 모델 학습 방법, 장치 및 결정 시스템은, 프로세서, 메모리, 사용자 입력장치, 프레젠테이션 장치 중 적어도 일부를 포함하는 컴퓨팅 장치에 의해 구현될 수 있다. 메모리는, 프로세서에 의해 실행되면 특정 태스크를 수행할 수 있도록 코딩되어 있는 컴퓨터-판독가능 소프트웨어, 애플리케이션, 프로그램 모듈, 루틴, 인스트럭션(instructions), 및/또는 데이터 등을 저장하는 매체이다. 프로세서는 메모리에 저장되어 있는 컴퓨터-판독가능 소프트웨어, 애플리케이션, 프로그램 모듈, 루틴, 인스트럭션, 및/또는 데이터 등을 판독하여 실행할 수 있다.The above-described story-based image sequence determination model learning method, apparatus, and determination system may be implemented by a computing device including at least some of a processor, a memory, a user input device, and a presentation device. Memory is a medium that stores computer-readable software, applications, program modules, routines, instructions, and/or data coded such that when executed by a processor, it performs a particular task. A processor may read and execute computer-readable software, applications, program modules, routines, instructions, and/or data stored in memory.

사용자 디바이스는 사용자로 하여금 프로세서에게 특정 태스크를 실행하도록 하는 명령을 입력하거나 특정 태스크의 실행에 필요한 데이터를 입력하도록 하는 수단일 수 있다. 사용자 디바이스는 물리적인 또는 가상적인 키보드나 키패드, 키버튼, 마우스, 조이스틱, 트랙볼, 터치-민감형 입력수단, 또는 마이크로폰 등을 포함할 수 있다. 프레젠테이션 장치는 디스플레이, 프린터, 스피커, 또는 진동장치 등을 포함할 수 있다.The user device may be a means for allowing the user to input a command to execute a specific task to the processor or input data required for execution of the specific task. The user device may include a physical or virtual keyboard or keypad, key buttons, mouse, joystick, trackball, touch-sensitive input means, or microphone. The presentation device may include a display, a printer, a speaker, or a vibrator.

컴퓨팅 장치는 스마트폰, 태블릿, 랩탑, 데스크탑, 서버, 클라이언트 등의 다양한 장치를 포함할 수 있다. 컴퓨팅 장치는 하나의 단일한 스탠드-얼론 장치일 수도 있고, 통신망을 통해 서로 협력하는 다수의 컴퓨팅 장치들로 이루어진 분산형 환경에서 동작하는 다수의 컴퓨팅 장치를 포함할 수 있다.Computing devices may include a variety of devices such as smart phones, tablets, laptops, desktops, servers, and clients. A computing device may be a single stand-alone device or may include multiple computing devices operating in a distributed environment consisting of multiple computing devices cooperating with each other over a communications network.

또한 전술한 스토리 기반 이미지 시퀀스 결정 방법, 장치 및 결정 시스템은, 프로세서를 구비하고, 또한 프로세서에 의해 실행되면 인공지능 모델을 활용한 스토리 기반 이미지 시퀀스 결정 방법 및 검색 방법을 수행할 수 있도록 코딩된 컴퓨터 판독가능 소프트웨어, 애플리케이션, 프로그램 모듈, 루틴, 인스트럭션, 및/또는 데이터 구조 등을 저장한 메모리를 구비하는 컴퓨팅 장치에 의해 실행될 수 있다.In addition, the above-described story-based image sequence determination method, apparatus, and determination system have a processor and are coded to perform the story-based image sequence determination method and search method using an artificial intelligence model when executed by the processor. readable software, applications, program modules, routines, instructions, and/or data structures, etc. may be executed by a computing device having a memory.

상술한 본 실시예들은 다양한 수단을 통해 구현될 수 있다. 예를 들어, 본 실시예들은 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다.The present embodiments described above may be implemented through various means. For example, the present embodiments may be implemented by hardware, firmware, software, or a combination thereof.

하드웨어에 의한 구현의 경우, 본 실시예들에 따른 인공지능 모델을 활용한 영상 진단 방법은 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 컨트롤러, 마이크로 컨트롤러 또는 마이크로 프로세서 등에 의해 구현될 수 있다.In the case of implementation by hardware, the image diagnosis method using the artificial intelligence model according to the present embodiments includes one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), It can be implemented by Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors, controllers, microcontrollers, or microprocessors.

예를 들어, 실시예들에 따른 스토리 기반 이미지 시퀀스 결정 방법 및 학습 방법은 심층 신경망의 뉴런(neuron)과 시냅스(synapse)가 반도체 소자들로 구현된 인공지능 반도체 장치를 이용하여 구현될 수 있다. 이때 반도체 소자는 현재 사용하는 반도체 소자들, 예를 들어 SRAM이나 DRAM, NAND 등일 수도 있고, 차세대 반도체 소자들, RRAM이나 STT MRAM, PRAM 등일 수도 있고, 이들의 조합일 수도 있다.For example, a story-based image sequence determination method and a learning method according to embodiments may be implemented using an artificial intelligence semiconductor device in which neurons and synapses of a deep neural network are implemented as semiconductor devices. At this time, the semiconductor device may be currently used semiconductor devices such as SRAM, DRAM, NAND, etc., next-generation semiconductor devices, RRAM, STT MRAM, PRAM, etc., or a combination thereof.

실시예들에 따른 스토리 기반 이미지 시퀀스 결정 방법을 인공지능 반도체 장치를 이용하여 구현할 때, 인공지능 모델을 소프트웨어로 학습한 결과(가중치)를 어레이로 배치된 시냅스 모방소자에 전사하거나 인공지능 반도체 장치에서 학습을 진행할 수도 있다.When the story-based image sequence determination method according to the embodiments is implemented using an artificial intelligence semiconductor device, the result (weight) obtained by learning the artificial intelligence model as software is transferred to a synaptic mimic device arranged in an array or in an artificial intelligence semiconductor device. You may proceed with your study.

펌웨어나 소프트웨어에 의한 구현의 경우, 본 실시예들에 따른 이미지 시퀀스 결정 방법은 이상에서 설명된 기능 또는 동작들을 수행하는 장치, 절차 또는 함수 등의 형태로 구현될 수 있다. 소프트웨어 코드는 메모리 유닛에 저장되어 프로세서에 의해 구동될 수 있다. 메모리 유닛은 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 프로세서와 데이터를 주고 받을 수 있다.In the case of implementation by firmware or software, the method for determining an image sequence according to the present embodiments may be implemented in the form of a device, procedure, or function that performs functions or operations described above. The software codes may be stored in a memory unit and driven by a processor. The memory unit may be located inside or outside the processor and exchange data with the processor by various means known in the art.

또한, 위에서 설명한 "시스템", "프로세서", "컨트롤러", "컴포넌트", "모듈", "인터페이스", "모델", 또는 "유닛" 등의 용어는 일반적으로 컴퓨터 관련 엔티티 하드웨어, 하드웨어와 소프트웨어의 조합, 소프트웨어 또는 실행 중인 소프트웨어를 의미할 수 있다. 예를 들어, 전술한 구성요소는 프로세서에 의해서 구동되는 프로세스, 프로세서, 컨트롤러, 제어 프로세서, 개체, 실행 스레드, 프로그램 및/또는 컴퓨터일 수 있지만 이에 국한되지 않는다. 예를 들어, 컨트롤러 또는 프로세서에서 실행 중인 애플리케이션과 컨트롤러 또는 프로세서가 모두 구성 요소가 될 수 있다. 하나 이상의 구성 요소가 프로세스 및/또는 실행 스레드 내에 있을 수 있으며, 구성 요소들은 하나의 장치(예: 시스템, 컴퓨팅 디바이스 등)에 위치하거나 둘 이상의 장치에 분산되어 위치할 수 있다.Also, the terms "system", "processor", "controller", "component", "module", "interface", "model", or "unit" as described above generally refer to computer-related entities hardware, hardware and software. can mean a combination of, software or running software. For example, but is not limited to, a process driven by a processor, a processor, a controller, a control processor, an object, a thread of execution, a program, and/or a computer. For example, a component can be both an application running on a controller or processor and a controller or processor. One or more components may reside within a process and/or thread of execution, and components may reside on one device (eg, system, computing device, etc.) or may be distributed across two or more devices.

한편, 또 다른 실시예는 전술한 이미지 시퀀스 결정 방법을 수행하는, 컴퓨터 기록매체에 저장되는 컴퓨터 프로그램을 제공한다. 또한 또 다른 실시예는 전술한 이미지 시퀀스 결정 방법 및 검색 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.Meanwhile, another embodiment provides a computer program stored in a computer recording medium that performs the above-described image sequence determining method. In addition, another embodiment provides a computer-readable recording medium on which programs for realizing the above-described image sequence determining method and retrieval method are recorded.

기록매체에 기록된 프로그램은 컴퓨터에서 읽히어 설치되고 실행됨으로써 전술한 단계들을 실행할 수 있다. 이와 같이, 컴퓨터가 기록매체에 기록된 프로그램을 읽어 들여 프로그램으로 구현된 기능들을 실행시키기 위하여, 전술한 프로그램은 컴퓨터의 프로세서(CPU)가 컴퓨터의 장치 인터페이스(Interface)를 통해 읽힐 수 있는 C, C++, JAVA, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다.A program recorded on a recording medium may be read, installed, and executed in a computer to execute the above-described steps. In this way, in order for the computer to read the program recorded on the recording medium and execute the functions implemented by the program, the above-described program is C, C++ that can be read by the computer's processor (CPU) through the computer's device interface. , JAVA, may include a code coded in a computer language such as machine language.

이러한 코드는 전술한 기능들을 정의한 함수 등과 관련된 기능적인 코드를 포함할 수 있고, 전술한 기능들을 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수도 있다.These codes may include functional codes related to functions defining the above-described functions, and may include control codes related to execution procedures necessary for a computer processor to execute the above-described functions according to a predetermined procedure.

또한, 이러한 코드는 전술한 기능들을 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조 되어야 하는지에 대한 메모리 참조 관련 코드를 더 포함할 수 있다.In addition, these codes may further include memory reference related codes for which location (address address) of the computer's internal or external memory should be referenced for additional information or media necessary for the computer's processor to execute the above-mentioned functions. .

또한, 컴퓨터의 프로세서가 전술한 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 컴퓨터의 프로세서가 컴퓨터의 통신 모듈을 이용하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야만 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는 지 등에 대한 통신 관련 코드를 더 포함할 수도 있다.In addition, when the computer processor needs to communicate with any other remote computer or server in order to execute the above-mentioned functions, the code allows the computer processor to use the computer's communication module to communicate with any other remote computer or server. It may further include communication-related codes for how to communicate with other computers or servers and what information or media to transmit/receive during communication.

이상에서 전술한 바와 같은 프로그램을 기록한 컴퓨터로 읽힐 수 있는 기록매체는, 일 예로, ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 미디어 저장장치 등이 있다.A computer-readable recording medium on which the above-described program is recorded includes, for example, ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical media storage device, and the like.

또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.In addition, the computer-readable recording medium is distributed in computer systems connected through a network, so that computer-readable codes can be stored and executed in a distributed manner.

그리고, 본 발명을 구현하기 위한 기능적인(Functional) 프로그램과 이와 관련된 코드 및 코드 세그먼트 등은, 기록매체를 읽어서 프로그램을 실행시키는 컴퓨터의 시스템 환경 등을 고려하여, 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론되거나 변경될 수도 있다.In addition, a functional program for implementing the present invention, codes and code segments related thereto, in consideration of the system environment of a computer that reads a recording medium and executes a program, etc., help programmers in the art to which the present invention belongs It may be easily inferred or changed by

이미지 시퀀스 결정 방법은, 컴퓨터에 의해 실행되는 애플리케이션이나 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.The image sequence determination method may be implemented in the form of a recording medium including instructions executable by a computer, such as an application or program module executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer readable media may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 이미지 시퀀스 결정 방법은, 단말기에 기본적으로 설치된 애플리케이션(이는 단말기에 기본적으로 탑재된 플랫폼이나 운영체제 등에 포함된 프로그램을 포함할 수 있다)에 의해 실행될 수 있고, 사용자가 애플리케이션 스토어 서버, 애플리케이션 또는 해당 서비스와 관련된 웹 서버 등의 애플리케이션 제공 서버를 통해 마스터 단말기에 직접 설치한 애플리케이션(즉, 프로그램)에 의해 실행될 수도 있다. 이러한 의미에서, 전술한 문장 기반 스케치 추천 방법은 단말기에 기본적으로 설치되거나 사용자에 의해 직접 설치된 애플리케이션(즉, 프로그램)으로 구현되고 단말기에 등의 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다.The above-described image sequence determination method may be executed by an application basically installed in the terminal (this may include a program included in a platform or operating system, etc. It may also be executed by an application (that is, a program) directly installed in the master terminal through an application providing server such as a web server related to the service. In this sense, the above-described sentence-based sketch recommendation method is implemented as an application (ie, a program) basically installed in a terminal or directly installed by a user, and may be recorded on a computer-readable recording medium such as a terminal.

이상, 본 발명의 특정 실시예에 대하여 상술하였다. 그러나, 본 발명의 사상 및 범위는 이러한 특정 실시예에 한정되는 것이 아니라, 본 발명의 요지를 변경하지 않는 범위 내에서 다양하게 수정 및 변형이 가능하다는 것을 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 이해할 것이다.In the above, specific embodiments of the present invention have been described in detail. However, the spirit and scope of the present invention is not limited to these specific embodiments, and it is common knowledge in the technical field to which the present invention belongs that various modifications and variations are possible without changing the gist of the present invention. Anyone who has it will understand.

따라서, 이상에서 기술한 실시예들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이므로, 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 하며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Therefore, since the embodiments described above are provided to completely inform those skilled in the art of the scope of the invention to which the present invention pertains, it should be understood that it is illustrative in all respects and not limiting, The invention is only defined by the scope of the claims.

Claims

A story-based image sequence determination method executed by an artificial intelligence model-based image sequence determination apparatus,
encoding the current sentence based on a degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story using a story encoder;
searching for a plurality of first candidate images associated with the current sentence in a given image database based on a similarity between the encoded current sentence and pre-stored images;
searching for a plurality of second candidate images associated with an optimal image of a sentence immediately before the current sentence in the image database based on a similarity of a scene graph between images; and
Determining an optimal image of the current sentence based on the plurality of first candidate images and the plurality of second candidate images,
A story-based image sequence determination method.

According to claim 1,
The step of encoding the current sentence,
Calculating attention between each token of the current sentence and the at least one previous sentence and determining the degree of association in units of tokens of the current sentence,
A story-based image sequence determination method.

According to claim 1,
The step of searching for the first candidate image,
generating a latent vector by vectorizing the current sentence;
analyzing latent vectors of images stored in the image database and latent vectors of the current sentence based on cosine-similarity; and
Selecting a predetermined number of the plurality of first candidate images in order of high similarity according to a result of the analysis among images stored in the image database,
A story-based image sequence determination method.

According to claim 1,
The scene graph,
A data structure having at least one type of object included in the image as a vertex and an action relationship between the objects as an edge,
A story-based image sequence determination method.

According to claim 1,
The step of searching the plurality of second candidate images,
generating a scene graph of the optimal image of the previous sentence and passing it through a graph neural network to obtain a latent vector;
analyzing a latent vector of images stored in the image database and a latent vector of an optimal image of the previous sentence based on cosine-similarity; and
Selecting a predetermined number of the plurality of second candidate images in order of high similarity according to a result of the analysis among images stored in the image database,
A story-based image sequence determination method.

According to any one of claims 3 and 5,
The step of determining the optimal image of the current sentence,
calculating a degree of similarity between the plurality of first candidate images by similarly analyzing latent vectors of the plurality of first candidate images to latent vectors of an optimal image of the previous sentence;
calculating a degree of similarity between the plurality of second candidate images by similarly analyzing latent vectors of the plurality of second candidate images to latent vectors of the optimal image of the previous sentence; and
Determining an image having the highest similarity as the optimal image by averaging the similarity of each of the plurality of first candidate images and the similarity of each of the plurality of second candidate images,
A story-based image sequence determination method.

According to claim 1,
When the current sentence is the first sentence of the text-based story,
generating a latent vector by vectorizing the first sentence;
analyzing latent vectors of images stored in the image database and latent vectors of the first sentence based on cosine-similarity; and
Determining an image having the highest degree of similarity according to a result of the analysis among images stored in the image database as the optimal image,
A story-based image sequence determination method.

According to claim 1,
The step of determining the optimal image of the current sentence,
Determining the optimal image for each sentence sequentially until the current sentence is the last sentence of the text-based story,
A story-based image sequence determination method.

According to claim 1,
Further comprising the step of learning the artificial intelligence model,
The step of learning the artificial intelligence model,
providing a learning dataset composed of text-based stories including a plurality of sentences;
acquiring a plurality of optimal images by inputting the plurality of sentences to the artificial intelligence model;
Restoring the plurality of optimal images into a text-based story using BiLSTM and a decoder;
Comparing the restored story with the original story to calculate an arithmetic difference; and
Learning the image sequence determination model by backpropagation based on the arithmetic difference.
A story-based image sequence determination method.

a communication unit receiving a text-based story;
a memory that stores an artificial intelligence model trained to determine an image depicting a text-based story; and
A processor for controlling the memory and the communication unit;
the processor,
Encoding the current sentence based on a degree of association between at least one previous sentence in the story and the current sentence in token units of the current sentence of the text-based story using a story encoder,
Searching for a plurality of first candidate images associated with the current sentence in a given image database based on similarities between the encoded current sentence and pre-stored images;
Searching for a plurality of second candidate images associated with an optimal image of a sentence immediately before the current sentence in the image database based on a scene graph similarity between images;
determining an optimal image of the current sentence based on the plurality of first candidate images and the plurality of second candidate images;
Story-based image sequence decision device.

According to claim 10,
the processor,
Calculating attention between each token of the current sentence and the at least one previous sentence to determine the degree of association in units of tokens of the current sentence,
Story-based image sequence decision device.

According to claim 10,
the processor,
Vectorizing the current sentence to generate a latent vector;
Analyzing latent vectors of images stored in the image database and latent vectors of the current sentence based on cosine-similarity;
Selecting a predetermined number of the plurality of first candidate images in order of high similarity according to a result of the analysis among the images stored in the image database,
Story-based image sequence decision device.

According to claim 10,
The scene graph,
A data structure having at least one type of object included in the image as a vertex and an action relationship between the objects as an edge,
Story-based image sequence decision device.

According to claim 10,
the processor,
Generating a scene graph of the optimal image of the previous sentence, passing it through a graph neural network to obtain a latent vector,
Analyzing a latent vector of images stored in the image database and a latent vector of an optimal image of the previous sentence based on cosine-similarity;
Selecting a predetermined number of the plurality of second candidate images in order of high similarity according to a result of the analysis among the images stored in the image database,
Story-based image sequence decision device.

According to any one of claims 13 and 15,
the processor,
The latent vectors of the plurality of first candidate images are similarly analyzed to the latent vector of the optimal image of the previous sentence, and each similarity with respect to the plurality of first candidate images is calculated;
latent vectors of the plurality of second candidate images are similarly analyzed to latent vectors of an optimal image of the previous sentence, and similarities of each of the plurality of second candidate images are calculated;
determining an image having the highest similarity as the optimal image by averaging the similarities between the plurality of first candidate images and the similarities among the plurality of second candidate images;
Story-based image sequence decision device.

According to claim 10,
the processor,
When the current sentence is the first sentence of the text-based story,
vectorizing the first sentence to generate a latent vector;
Analyzing latent vectors of images stored in the image database and latent vectors of the first sentence based on cosine-similarity;
determining an image having the highest degree of similarity according to a result of the analysis among images stored in the image database as the optimal image;
Story-based image sequence decision device.

According to claim 10,
the processor,
Determining the optimal image for each sentence sequentially until the current sentence is the last sentence of the text-based story,
Story-based image sequence decision device.

According to claim 10,
Further comprising a learning unit for learning the artificial intelligence model,
The learning unit,
A learning dataset composed of text-based stories including a plurality of sentences is provided, the plurality of sentences are input to the artificial intelligence model to obtain a plurality of optimal images, and the plurality of optimal images are obtained using BiLSTM and a decoder. Restoring a text-based story, comparing the restored story and the original story to calculate an arithmetic difference, and learning the image sequence determination model using a backpropagation method based on the arithmetic difference,
Story-based image sequence decision device.

a user device that transmits a story composed of a plurality of text-based sentences;
Based on the learned image sequence determination model, the current sentence of the story input from the user device and the contextual relevance of the previous sentences are considered, and at the same time, the optimal image and visual context selected in response to the previous sentence are maintained, meaning the current sentence. an image sequence determination server that selects an image depicting the image with the highest similarity; and
A database for mapping and storing images used for inference of the image sequence determination model and a predefined scene graph for each image;
A story-based image sequence determination system.