KR20210085694A

KR20210085694A - Apparatus for image captioning and method thereof

Info

Publication number: KR20210085694A
Application number: KR1020190179027A
Authority: KR
Inventors: 류은석; 박은수
Original assignee: 한국전력공사; 성균관대학교산학협력단
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-08

Abstract

The present invention relates to an image captioning apparatus, including: a word embedding hierarchy unit for storing learned words for image captioning; an image feature extraction unit for receiving an image to extract a feature of the image according to a previously learned feature extraction scheme; a behavior recognition model unit for identically receiving the image to classify a behavior feature from the image according to a previously learned behavior recognition scheme; a caption driving unit for extracting associated words corresponding to the feature extracted by the image feature extraction unit from the word embedding hierarchy unit to perform captioning; a morphology analysis unit for analyzing a part-of-speech for each word in a sentence captioned by the caption driving unit; and a verb changing unit for checking whether a verb in the captioned sentence is a verb matching a behavior of the image extracted by the behavior recognition model unit, and changing the verb in the captioned sentence into the verb corresponding to the behavior to obtain a final captioning sentence when a specified check condition is not satisfied.

Description

Image captioning apparatus and method {APPARATUS FOR IMAGE CAPTIONING AND METHOD THEREOF}

본 발명은 이미지 캡셔닝 장치 및 방법에 관한 것으로, 보다 상세하게는 자연어 처리 기반의 사용자의 행동 인식 묘사에 특화된 이미지 캡셔닝 장치 및 방법에 관한 것이다.The present invention relates to an image captioning apparatus and method, and more particularly, to an image captioning apparatus and method specialized for depicting a user's behavior recognition based on natural language processing.

최근 고성능 GPU(Graphic Processing Unit)의 사용으로 처리 가능한 연산량이 대폭 증가함에 따라, 패턴을 인식하는데 연산량이 많이 필요한 딥 러닝 기술이 계속 연구 되고 있다. Recently, as the amount of processing that can be processed has increased significantly due to the use of high-performance GPU (Graphic Processing Unit), deep learning techniques that require a large amount of computation to recognize patterns are continuously being studied.

예컨대 합성곱신경망(Convolution neural network, CNN)과 같은 신경망의 발달과 함께 객체 인식, 이미지 분류 등과 같은 이미지 프로세싱 연구가 상당히 빠른 속도로 진행되고 있다. 가령 헬스케어 분야의 딥 러닝 기술 적용으로 인하여, 사람이 포함된 영상 이해, 상황 인식과 같은 연구가 심도 있게 진행되어 오면서, 딥 러닝 기반의 영상 캡셔닝의 중요도가 부각되어 오고 있다.For example, with the development of neural networks such as convolutional neural networks (CNNs), image processing research such as object recognition and image classification is progressing at a fairly rapid pace. For example, due to the application of deep learning technology in the healthcare field, research such as understanding of images involving people and situation recognition has been in-depth, and the importance of image captioning based on deep learning has been highlighted.

여기서 이미지 캡셔닝 (Image captioning)이란, 입력된 이미지를 합성곱신경망을 통하여 특징을 추출하고, 학습된 단어 특징 공간에 매핑하여 입력된 이미지의 설명문을 생산하는 기술로서, 영상 이해 및 상황 인식에 가장 근접한 연구 중 하나이다. Here, image captioning is a technique for extracting features from an input image through a convolutional neural network and mapping it to a learned word feature space to produce a description of the input image. One of the closest studies.

그런데 기존의 이미지 캡셔닝 모델을 사용하는 경우, 한 장의 이미지로 행동을 예측하는 부분에서 큰 어려움이 있을 수 있다. 예컨대 기존의 이미지 캡셔닝 모델을 사용하는 경우, 같은 행동이지만 다르게 예측될 수 있다. 가령 도 1에 도시된 바와 같이, 테이블에서 팔씨름을 하고 있는 이미지가 이미지 캡셔닝 장치에 입력되었다고 가정할 때, 상기 이미지에 대하여 “두 사람이 팔씨름을 하고 있다(Two men are doing arm wrestling)”및 “두 사람이 악수를 하고 있다(Two men are shaking hands)”로 다르게 예측될 수 있는 문제점이 있다.However, when using the existing image captioning model, there may be a great difficulty in predicting behavior with one image. For example, when using an existing image captioning model, the same behavior may be predicted differently. For example, as shown in FIG. 1 , assuming that an image of an arm wrestling on a table is input to the image captioning device, “Two men are doing arm wrestling” and “Two men are doing arm wrestling” and There is a problem that can be predicted differently as “Two men are shaking hands”.

본 발명의 배경기술은 대한민국 공개특허 10-2017-0007747호(2017.01.20. 공개, 자연어 이미지 검색 기법)에 개시되어 있다. Background art of the present invention is disclosed in Korean Patent Publication No. 10-2017-0007747 (published on January 20, 2017, natural language image search technique).

본 발명의 일 측면에 따르면, 본 발명은 상기와 같은 문제점을 해결하기 위해 창작된 것으로서, 자연어 처리 기반의 사용자의 행동 인식 묘사에 특화된 이미지 캡셔닝 장치 및 방법을 제공하는 데 그 목적이 있다. According to one aspect of the present invention, the present invention was created to solve the above problems, and an object of the present invention is to provide an image captioning apparatus and method specialized for depicting a user's behavior recognition based on natural language processing.

본 발명의 일 측면에 따른 이미지 캡셔닝 장치는, 이미지 캡셔닝을 위한 학습된 단어들을 저장하는 단어 임베딩 계층부; 이미지를 입력받아 기 학습된 특징 추출 방식에 따라 이미지의 특징을 추출하는 이미지 특징 추출부; 상기 이미지를 동일하게 입력받아 기 학습된 행동 인식 방식에 따라 이미지에서 행동 특징을 구분하는 행동 인식 모델부; 상기 이미지 특징 추출부에서 추출된 특징에 대응하는 연관 단어들을 상기 단어 임베딩 계층부에서 추출하여 캡셔닝을 수행하는 캡션 구동부; 상기 캡션 구동부에서 캡셔닝된 문장에서 각 단어에 대한 품사를 분석하는 모폴로지 분석부; 및 상기 캡셔닝된 문장에서 동사에 대하여, 상기 행동 인식 모델부에서 추출한 이미지의 행동에 일치하는 동사인지 체크하여, 지정된 체크 조건을 만족하지 않을 경우, 상기 캡셔닝된 문장에서 동사를 상기 행동에 대응하는 동사로 최종 캡셔닝 문장을 변경하는 동사 변경부;를 포함하는 것을 특징으로 한다.An image captioning apparatus according to an aspect of the present invention includes: a word embedding layer unit for storing learned words for image captioning; an image feature extraction unit that receives an image and extracts features of the image according to a previously learned feature extraction method; a behavior recognition model unit for receiving the same image as input and classifying behavioral features from the image according to a pre-learned behavior recognition method; a caption driver for performing captioning by extracting related words corresponding to the features extracted from the image feature extraction unit from the word embedding layer unit; a morphology analysis unit for analyzing the part-of-speech for each word in the sentence captured by the caption driving unit; and checking whether the verb in the captioned sentence matches the action of the image extracted from the behavior recognition model unit. If a specified check condition is not satisfied, the verb in the captioned sentence corresponds to the action. and a verb change unit that changes the final captioning sentence into a verb to

본 발명에 있어서, 상기 캡션 구동부는, 복수의 LSTM(Long Short Term Memory)을 포함하는 것을 특징으로 한다.In the present invention, the caption driver includes a plurality of Long Short Term Memory (LSTM).

본 발명에 있어서, 상기 동사 변경부는, 지정된 단어의 손실(loss) 값이 지정된 값 보다 작고, 상기 지정된 단어의 품사가 동사이며, 상기 행동 인식 모델부에서 인식한 행동 인식 정확도가 지정된 기준보다 큰 경우인지 체크하고, 상기 체크 조건을 모두 만족하는 경우 상기 행동 인식 모델부에서 인식한 행동에 대응하는 단어로 최종 캡션에서 동사를 변경하는 것을 특징으로 한다.In the present invention, when the loss value of the designated word is smaller than the designated value, the part-of-speech of the designated word is a verb, and the behavior recognition accuracy recognized by the behavior recognition model part is greater than the designated criterion A recognition check is performed, and when all of the check conditions are satisfied, the verb is changed in the final caption to a word corresponding to the action recognized by the behavior recognition model unit.

본 발명에 있어서, 상기 캡셔닝 문장의 복수의 품사에 해당하는 단어를 변경하거나 추가하기 위하여, 이미지에서 얼굴을 인식하는 얼굴 인식 모델; 및 이미지를 촬영한 카메라의 위치 데이터를 이용하여 위치를 인식하는 위치 데이터 모델;을 더 포함하고, 상기 캡셔닝 문장의 기 지정된 위치에 얼굴 인식된 사용자(user)의 이름을 추가하고, 상기 캡셔닝 문장의 기 지정된 위치에 위치 데이터(Location)의 명칭을 더 추가하도록 구현된 것을 특징으로 한다.In the present invention, in order to change or add a word corresponding to a plurality of parts of speech in the captioning sentence, a face recognition model for recognizing a face in an image; and a location data model for recognizing a location using location data of a camera that has taken an image; further comprising, adding the name of a face-recognized user to a predetermined location of the captioning sentence, and the captioning It is characterized in that it is implemented to further add the name of location data (Location) to a predetermined position of the sentence.

본 발명의 다른 측면에 따른 이미지 캡셔닝 방법은, 단어 임베딩 계층부에 이미지 캡셔닝을 위한 학습된 단어들을 저장하는 단계; 이미지 특징 추출부가 이미지를 입력받아 기 학습된 특징 추출 방식에 따라 이미지의 특징을 추출하는 단계; 행동 인식 모델부가 상기 이미지를 동일하게 입력받아 기 학습된 행동 인식 방식에 따라 이미지에서 행동 특징을 구분하는 단계; 캡션 구동부가 상기 이미지 특징 추출부에서 추출된 특징에 대응하는 연관 단어들을 상기 단어 임베딩 계층부에서 추출하여 캡셔닝을 수행하는 단계; 모폴로지 분석부가 상기 캡션 구동부에서 캡셔닝된 문장에서 각 단어에 대한 품사를 분석하는 단계; 및 동사 변경부가 상기 캡셔닝된 문장에서 동사에 대하여, 상기 행동 인식 모델부에서 추출한 이미지의 행동에 일치하는 동사인지 체크하여, 지정된 체크 조건을 만족하지 않을 경우, 상기 캡셔닝된 문장에서 동사를 상기 행동에 대응하는 동사로 최종 캡셔닝 문장을 변경하는 단계;를 포함하는 것을 특징으로 한다.An image captioning method according to another aspect of the present invention includes the steps of: storing learned words for image captioning in a word embedding layer; an image feature extraction unit receiving an image and extracting features of the image according to a pre-learned feature extraction method; classifying the behavioral features in the image according to the previously learned behavior recognition method by the behavior recognition model unit receiving the image in the same way; performing captioning by extracting, by a caption driver, related words corresponding to the features extracted by the image feature extracting unit from the word embedding layer; analyzing, by a morphology analyzing unit, a part-of-speech for each word in the sentences captioned by the caption driving unit; and a verb change unit checks whether a verb in the captioned sentence is a verb that matches the action of the image extracted from the behavior recognition model unit. and changing the final captioning sentence to a verb corresponding to the action.

본 발명에 있어서, 상기 캡셔닝된 문장에서 동사에 대하여, 상기 행동 인식 모델부에서 추출한 이미지의 행동에 일치하는 동사인지 체크하는 단계에서, 상기 동사 변경부는, 지정된 단어의 손실(loss) 값이 지정된 값 보다 작고, 상기 지정된 단어의 품사가 동사이며, 상기 행동 인식 모델부에서 인식한 행동 인식 정확도가 지정된 기준보다 큰 경우인지 체크하고, 상기 체크 조건을 모두 만족하는 경우 상기 행동 인식 모델부에서 인식한 행동에 대응하는 단어로 최종 캡션에서 동사를 변경하는 것을 특징으로 한다.In the present invention, in the step of checking whether the verb in the captioned sentence is a verb that matches the action of the image extracted from the behavior recognition model unit, the verb change unit is configured to include a loss value of the specified word. It is smaller than the value, the part-of-speech of the specified word is a verb, and it is checked whether the behavior recognition accuracy recognized by the behavior recognition model unit is greater than the specified criterion, and if all of the check conditions are satisfied, the behavior recognition model unit recognized It is characterized by changing the verb in the final caption to a word corresponding to the action.

본 발명에 있어서, 상기 캡셔닝 문장의 복수의 품사에 해당하는 단어를 변경하거나 추가하기 위하여, 얼굴 인식 모델을 통해 이미지에서 얼굴을 인식하는 단계; 및 위치 데이터 모델을 통해 이미지를 촬영한 카메라의 위치 데이터를 이용하여 위치를 인식하는 단계;를 더 포함하고, 상기 캡셔닝 문장의 기 지정된 위치에 얼굴 인식된 사용자(user)의 이름을 추가하고, 상기 캡셔닝 문장의 기 지정된 위치에 위치 데이터(Location)의 명칭을 더 추가하도록 구현된 것을 특징으로 한다.In the present invention, in order to change or add words corresponding to a plurality of parts-of-speech in the captioning sentence, recognizing a face in an image through a face recognition model; and recognizing a location using location data of a camera that has taken an image through a location data model; further comprising, adding the name of a face-recognized user to a predetermined location of the captioning sentence, It is characterized in that it is implemented to further add the name of the location data (Location) to the predetermined position of the captioning sentence.

본 발명의 일 측면에 따르면, 본 발명은 자연어 처리 기반의 사용자의 행동 인식 묘사에 특화된 이미지 캡셔닝 장치 및 방법에 관한 것으로, 이미지 캡셔닝의 정확도를 향상시키는 효과가 있다.According to one aspect of the present invention, the present invention relates to an image captioning apparatus and method specialized for depicting a user's behavior recognition based on natural language processing, and has the effect of improving the accuracy of image captioning.

도 1은 종래의 이미지 캡셔닝의 문제점을 설명하기 위하여 보인 예시도.
도 2는 본 발명의 일 실시예에 따른 이미지 캡셔닝 장치의 개략적인 구성을 보인 예시도.
도 3은 상기 도 2에 있어서, 동사 변경부의 동사 교체 과정을 설명하기 위하여 보인 예시도.
도 4는 상기 도 3에 있어서, 동사 변경을 위한 조건을 만족하는지 체크하는 알고리즘을 보인 예시도.
도 5는 상기 도 2에 있어서, 품사 분석을 위한 자연어 처리 모듈(NLTK)의 동사 태그의 종류를 보인 예시도.
도 6은 본 발명의 다른 실시예에 따라 복수의 품사에 해당하는 단어를 변경하거나 추가할 수 있는 캡셔닝 장치의 개략적인 구성을 보인 예시도.1 is an exemplary view to explain the problem of the conventional image captioning.
2 is an exemplary diagram showing a schematic configuration of an image captioning apparatus according to an embodiment of the present invention.
3 is an exemplary view for explaining a verb replacement process of a verb change unit in FIG. 2 .
4 is an exemplary diagram illustrating an algorithm for checking whether a condition for changing a verb is satisfied in FIG. 3;
FIG. 5 is an exemplary diagram illustrating types of verb tags of a natural language processing module (NLTK) for part-of-speech analysis in FIG. 2 .
6 is an exemplary diagram illustrating a schematic configuration of a captioning device capable of changing or adding words corresponding to a plurality of parts-of-speech according to another embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명에 따른 이미지 캡셔닝 장치 및 방법의 일 실시예를 설명한다. Hereinafter, an embodiment of an image captioning apparatus and method according to the present invention will be described with reference to the accompanying drawings.

이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of explanation. In addition, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to the intention or custom of the user or operator. Therefore, definitions of these terms should be made based on the content throughout this specification.

상술한 바와 같이 종래에는 한 장의 이미지를 입력하는 이미지 캡셔닝 과정에서 시간적 요소가 필요한 인간의 행동 관련 요소(즉, 시간이 경과됨에 따라 수행되는 행동)를 추측하는 데 어려움이 있었다. 따라서 본 실시예에서는 상기 종래의 문제점을 해결하기 위하여 행동인식 모델을 추가로 이용하여 캡셔닝을 수행함으로써, 이미지 캡셔닝의 정확도를 향상시키기 위한 것이다.As described above, in the conventional image captioning process of inputting a single image, it is difficult to guess a human behavior-related element that requires a temporal element (ie, an action performed over time). Therefore, in this embodiment, the image captioning accuracy is improved by performing captioning by additionally using a behavior recognition model in order to solve the conventional problem.

도 2는 본 발명의 일 실시예에 따른 이미지 캡셔닝 장치의 개략적인 구성을 보인 예시도이다.2 is an exemplary diagram illustrating a schematic configuration of an image captioning apparatus according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 실시예에 따른 이미지 캡셔닝 장치는, 단어 임베딩 계층부(110), 이미지 특징 추출부(120), 행동 인식 모델부(130), 캡션 구동부(140), 모폴로지 분석부(150), 및 동사 변경부(160)를 포함한다.As shown in FIG. 2 , the image captioning apparatus according to the present embodiment includes a word embedding layer unit 110 , an image feature extraction unit 120 , a behavior recognition model unit 130 , a caption driver 140 , and a morphology. It includes an analysis unit 150 and a verb change unit 160 .

상기 단어 임베딩 계층부(110)는 캡셔닝을 위한 단어들을 저장한다.The word embedding layer unit 110 stores words for captioning.

예컨대 상기 단어 임베딩 계층부(110)는 학습된 단어들이 저장되는 일종의 데이터베이스에 해당한다.For example, the word embedding layer unit 110 corresponds to a kind of database in which learned words are stored.

상기 이미지 특징 추출부(120)는 입력받은 이미지의 특징을 추출한다. The image feature extraction unit 120 extracts features of the input image.

상기 이미지 특징 추출부(120)가 이미지의 특징을 추출하는 방식은 이미 학습된 특징 추출 방식에 기초한다.The method in which the image feature extraction unit 120 extracts the image features is based on a previously learned feature extraction method.

상기 행동 인식 모델부(130)는 상기 이미지 특징 추출부(120)에 입력된 동일한 이미지를 입력받아 이미 학습된 행동 인식 방식에 기초하여 상기 이미지에서 행동 특징을 구분한다.The behavior recognition model unit 130 receives the same image input to the image feature extraction unit 120 and classifies the behavior characteristics in the image based on a previously learned behavior recognition method.

상기 캡션 구동부(140)는 상기 이미지 특징 추출부(120)에서 추출된 특징에 대응하는 연관 단어들을 상기 단어 임베딩 계층부(110)에서 추출하여 이미지 캡셔닝을 수행한다. 즉, 상기 이미지의 특징을 문장 형태로 표시한다.The caption driver 140 extracts related words corresponding to the features extracted by the image feature extraction unit 120 from the word embedding layer unit 110 to perform image captioning. That is, the characteristics of the image are displayed in the form of sentences.

상기 캡션 구동부(140)는 복수의 LSTM(Long Short Term Memory)을 포함하고, 상기 LSTM에서 실질적인 캡셔닝 동작을 수행한다.The caption driver 140 includes a plurality of Long Short Term Memory (LSTM), and performs a substantial captioning operation in the LSTM.

상기 모폴로지 분석부(150)는 상기 캡션 구동부(140)에서 캡셔닝된 문장에서 각 단어에 대한 품사를 분석한다. 예컨대 상기 캡셔닝된 문장에서 주어, 동사, 목적어, 및 보어 등을 구분한다.The morphology analysis unit 150 analyzes the part-of-speech for each word in the sentence captured by the caption driving unit 140 . For example, in the captioned sentence, a subject, a verb, an object, and a complement are distinguished.

상기 동사 변경부(160)는 상기 캡셔닝된 문장에서 동사에 대하여, 상기 행동 인식 모델부(130)에서 추출한 이미지의 행동과 일치하는 동사인지 체크하여, 지정된 조건을 만족하지 않을 경우, 상기 캡셔닝된 문장에서 동사를 상기 행동에 대응하는 동사로 최종 캡셔닝 문장을 변경한다(도 3 참조).The verb change unit 160 checks whether a verb in the captioned sentence is a verb that matches the action of the image extracted from the behavior recognition model unit 130. If a specified condition is not satisfied, the captioning is performed. The final captioning sentence is changed from a verb to a verb corresponding to the action (see FIG. 3).

상기 캡션 구동부(140)의 동작에 대한 이해를 돕기 위하여 좀 더 구체적으로 설명한다. In order to help the understanding of the operation of the caption driving unit 140 will be described in more detail.

상기 캡션 구동부(140)는 합성곱신경망(CNN)을 사용하여 이미지의 특징을 인코딩하고, 상기 인코딩된 이미지의 특징은 단어의 특징이 임베딩 되어 있는 디코딩 단계에 입력된다. 그리고 상기 캡션 구동부(140)는 상기 디코딩 단계에서 입력된 특징 벡터 공간을 사용하여 상기 이미지의 특징에 매칭되는 단어를 추출한다. 이때 상기 캡션 구동부(140)는 입력받은 한 장의 이미지에서 행동, 및 상황을 포함한 복수의 이미지의 특징을 하나의 합성곱신경망(CNN)으로 추정해야 하므로, 정확도가 상당히 떨어질 수 있다. The caption driver 140 encodes the image features using a convolutional neural network (CNN), and the encoded image features are input to the decoding step in which the word features are embedded. In addition, the caption driver 140 extracts a word matching the feature of the image using the feature vector space input in the decoding step. At this time, since the caption driver 140 has to estimate the features of a plurality of images including actions and situations in a single input image using a single convolutional neural network (CNN), the accuracy may be considerably reduced.

따라서 본 실시예에서는 행동 인식 데이터 셋으로 미리 학습된 행동 인식 모델부(130)를 통해 부족한 부분(예 : 행동에 관련된 특징)을 보완한다. 여기서 상기 부족한 부분(예 : 행동에 관련된 특징)은 동사에 해당하는 부분이므로, 동사에 해당하는 단어의 위치를 알기 위하여, 상기 모폴로지 분석부(150)가 상기 이미지 캡셔닝에서 출력된 캡션(즉, 캡셔닝된 문장)의 각 단어에 대한 품사(part-of-speech, POS)를 분석한다. Therefore, in the present embodiment, the deficient part (eg, behavior-related features) is supplemented through the behavior recognition model unit 130 trained in advance with the behavior recognition data set. Here, since the insufficient part (eg, behavior-related features) is a part corresponding to a verb, in order to know the position of a word corresponding to the verb, the morphology analyzer 150 outputs the caption (that is, from the image captioning) Analyze the part-of-speech (POS) for each word in the captioned sentence.

그리고 상기 동사 변경부(160)가 상기 생성된 캡션(즉, 캡셔닝된 문장)의 손실 값으로 교체 여부를 판단한다. Then, the verb changing unit 160 determines whether to replace the generated caption (ie, the captioned sentence) with a lost value.

이때 동사의 손실 값이 아닌, 다른 품사의 손실 값이 낮아서 발생하는 오류를 방지하기 위하여, 도 3 및 도 4에 도시된 바와 같은 알고리즘을 이용하여 미리 지정된 조건을 만족하는 경우에 동사를 변경한다.In this case, in order to prevent an error caused by a low loss value of another part-of-speech rather than the loss value of the verb, the verb is changed when a predetermined condition is satisfied using the algorithm shown in FIGS. 3 and 4 .

도 3은 상기 도 2에 있어서, 동사 변경부의 동사 교체 과정을 설명하기 위하여 보인 예시도이고, 도 4는 상기 도 3에 있어서, 동사 변경을 위한 조건을 만족하는지 체크하는 알고리즘을 보인 예시도이다.3 is an exemplary diagram illustrating a verb replacement process of the verb change unit in FIG. 2 , and FIG. 4 is an exemplary diagram illustrating an algorithm for checking whether a condition for changing a verb is satisfied in FIG. 3 .

참고로 도 3에서 S_n(n=0, 1, ...)은 단어를 의미하며, W_e는 단어 임베딩을 의미하는 것으로서, 상기 단어들은 이미 학습되어 있는 것이며, 상기 단어들이 학습된 공간을 사용하는 것을 단어(word) 임베딩이라고 한다. 그리고 도 3에서 W_eS_n은 다음 단어를 LSTM을 이용하여 출력하기 위해 워드 임베딩에 입력 단어를 입력하는 과정을 의미하고, 소프트맥스(Softmax)함수를 통해 출력된 값이 다음 LSTM 단계에 입력되는 과정을 반복한다. 즉, S₀를 입력하여 S₁을 출력하였을 경우, 다음 LSTM 단계에 상기 출력되었던 S₁을 입력하는 과정을 반복 수행한다.For reference, in FIG. 3, S _n (n=0, 1, ...) means a word, and W _e means a word embedding, and the words are already learned, and the space in which the words are learned The use of this is called word embedding. And in FIG. 3, W _e S _n means a process of inputting an input word to word embedding to output the next word using LSTM, and the value output through the Softmax function is input to the next LSTM step. Repeat the process. That is, when S ₀ is input and S ₁ is output, the process of inputting _{the output S 1} in the next LSTM step is repeatedly performed.

그리고 상기 도 3에서 동사 변경을 위한 조건은, 지정된 단어(예 : S₂)의 손실(loss) 값이 지정된 값(예 : -3) 보다 작고, 상기 지정된 단어(예 : S₂)의 품사가 동사(VB : Base Form)이며, 상기 행동 인식 모델부(130)에서 인식한 행동 인식 정확도가 지정된 기준(예 : 80%)보다 큰 경우인지 체크한다. And a condition for the company to change in the Figure 3, the part of speech of the specified words (for example, S ₂₎ is less than: (-3 for example), the specific word (for example, S ₂₎ loss (loss) value is the specified value of It is a verb (VB: Base Form), and it is checked whether the behavior recognition accuracy recognized by the behavior recognition model unit 130 is greater than a specified standard (eg, 80%).

그리고 상기 지정된 조건을 모두 만족하는 경우, 상기 동사 변경부(160)는 상기 행동 인식 모델부(130)에서 인식한 행동에 대응하는 단어로 최종 캡션(예 : 캡셔닝된 문장)에서 동사를 변경한다.And when all of the specified conditions are satisfied, the verb change unit 160 changes the verb in the final caption (eg, a captioned sentence) to a word corresponding to the action recognized by the behavior recognition model unit 130 . .

참고로 상기 모폴로지 분석부(150)에서 상기 이미지 캡션(예 : 이미지의 특징이 캡셔닝된 문장)의 품사 분석은 자연어 처리 모듈(NLTK : Natural Language ToolKit)을 사용하고, 상기 NLTK의 pos_tag 메소드를 사용하여, 사전에 NLTK에 입력된 단어와 매핑되어 있는 태그로 캡션내의 각 단어들에 태그를 넣을 수 있다. 예컨대 상기 NLTK 내에 있는 동사관련 태그는 총 6개로서, 도 5에 도시된 바와 같다. 상기 태그는 행동 인식 모델의 클래스에 인칭 동사가 없으므로, VBP, VBZ 태그를 제외한 나머지 동사 태그(시제에 대응하는 태그)들을 사용할 수 있다.For reference, the part-of-speech analysis of the image caption (eg, a sentence in which image features are captioned) in the morphology analyzer 150 uses a natural language processing module (NLTK: Natural Language ToolKit), and the pos_tag method of the NLTK is used. Thus, each word in the caption can be tagged with a tag that is mapped to the word entered in the NLTK in advance. For example, there are a total of six verb-related tags in the NLTK, as shown in FIG. 5 . Since the tag does not have a person verb in the class of the behavior recognition model, the remaining verb tags (tags corresponding to the tense) other than the VBP and VBZ tags may be used.

이상으로 본 실시예는 이미지 캡셔닝 시 동사에 해당하는 단어를, 선행 학습된 행동 인식 모델을 사용하여 판단된 행동에 대응하는 단어로 변경함으로써, 이미지 캡셔닝의 정확도가 높아지게 하는 방법에 대해서 설명하였다.As described above, this embodiment describes a method of increasing the accuracy of image captioning by changing a word corresponding to a verb during image captioning to a word corresponding to an action determined using a pre-learned behavior recognition model. .

그러나 상기와 같은 방법은 단지 동사에만 한정하지 않고, 문장을 구성하는 다른 품사에 대해서도 적용할 수 있다. However, the above method is not limited to only the verb, and can be applied to other parts of speech constituting the sentence.

도 6은 본 발명의 다른 실시예에 따라 복수의 품사에 해당하는 단어를 변경하거나 추가할 수 있는 캡셔닝 장치의 개략적인 구성을 보인 예시도로서, 얼굴 인식 모델(210), 행동 인식 모델(220), 및 위치 데이터 모델(230)을 더 포함하고, 기 캡셔닝된 문장에서, 학습을 통해 기 지정된 위치에 얼굴 인식된 사용자(user)의 이름을 추가하고, 학습을 통해 기 지정된 동사를 행동(Action)에 더 적합하게 매칭되는 단어로 변경하며, 학습을 통해 기 지정된 위치에 위치(또는 공간) 데이터(Location)의 명칭을 더 추가하는 방식으로 캡셔닝의 정확도를 향상시키는 것이다.6 is an exemplary diagram illustrating a schematic configuration of a captioning apparatus capable of changing or adding words corresponding to a plurality of parts of speech according to another embodiment of the present invention. A face recognition model 210 and a behavior recognition model 220 ), and a location data model 230, further including, in a previously captioned sentence, adding the name of a face-recognized user to a predetermined location through learning, and acting a predetermined verb through learning ( Action) is changed to a more suitable word, and the name of location (or spatial) data (Location) is further added to a predetermined location through learning to improve the accuracy of captioning.

도 6에 도시된 바와 같이, 각 LSTM(Long Short Term Memory)에서 출력된 단어들의 품사를 분석하여 각각에 맞는 신경망 모델(예 : 얼굴 인식 모델, 행동 인식 모델, 위치 데이터 모델)(210, 220, 230)들의 출력값으로 교체하여 보다 정확한 캡셔닝을 수행할 수 있다. 가령, LSTM에서 출력된 단어가 명사라면, 얼굴 인식 모델을 통해 출력된 값인 유저의 이름을 명사와 교체할 수 있고, 카메라 디바이스의 위치 데이터를 이용하여 임의의 현장에서 근무하는 사용자의 행동을 캡셔닝 할 수 있다.As shown in FIG. 6, by analyzing the parts-of-speech of words output from each LSTM (Long Short Term Memory), neural network models (eg, face recognition model, behavior recognition model, location data model) (210, 220, 230), more accurate captioning can be performed. For example, if the word output from the LSTM is a noun, the user's name, which is a value output through the face recognition model, can be replaced with a noun, and the action of a user working in an arbitrary field can be captured using the location data of the camera device. can do.

상기와 같이 본 실시예는 기존의 이미지 캡셔닝의 문제점인 행동 묘사를 보다 정확하게 묘사할 수 있는 캡셔닝을 수행할 수 있도록 함으로써, 가령 현장 근무자의 행동에 대한 정보를 보다 정확하게 수집할 수 있고, 행동 인식 모델을 위험 행동 등에 초점에 맞추어 학습 시킨다면 근무자의 위험 행동 묘사가 가능하여 위험 상황 관리를 용이하게 하는 효과가 있다.As described above, the present embodiment enables captioning to more accurately describe the behavior description, which is a problem in the existing image captioning, so that, for example, information on the behavior of the field worker can be more accurately collected, and the behavior If the recognition model is trained by focusing on risky behaviors, it has the effect of facilitating risk situation management by enabling the description of workers' risky behaviors.

참고로 본 실시예에서 행동 인식을 위한 데이터 셋(예 : Flickr 8k 데이터 셋)의 이미지의 설명 데이터에서 자연어 처리 모듈을 사용하여, 동사 부분만을 출력 후 데이터 셋을 만들었다. 상기와 같이 동사 부분만을 모아 만든 데이터 셋의 클래스의 개수는 총 1523가지이며, 각 동사 클래스별 이미지의 개수는 전반적으로 10장 내외다. 이때 인칭 동사(예 : is, are, etc.)는 제외하였으며, 실험을 통해 이미지 캡셔닝 데이터 셋으로 학습할 경우, 행동 인식 관련 부분은 상당히 많은 동사 클래스와 각 동사 클래스의 이미지 개수는 적은 상태로 학습을 진행하였으며, 본 실시예에서 기존의 이미지 캡셔닝 모델로 정확하게 판단하기 힘든 행동 인식 부분은 선행 학습된 행동 인식 모델을 사용하여 이미지 캡셔닝의 정확도가 높아지는 것을 확인할 수 있다.For reference, in this embodiment, a data set was created after outputting only the verb part using the natural language processing module in the image description data of the data set for behavior recognition (eg, the Flickr 8k data set). As described above, the number of classes in the dataset created by collecting only the verb parts is 1523, and the total number of images for each verb class is around 10. In this case, person verbs (eg, is, are, etc.) were excluded, and when learning from an image captioning dataset through an experiment, the behavioral recognition-related part was found to have a fairly large number of verb classes and a small number of images in each verb class. Learning was carried out, and in this embodiment, it can be seen that the image captioning accuracy is increased by using the pre-learned behavior recognition model for the behavior recognition part that is difficult to accurately determine with the existing image captioning model.

이상으로 본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 기술적 보호범위는 아래의 특허청구범위에 의해서 정하여져야 할 것이다. 또한 본 명세서에서 설명된 구현은, 예컨대, 방법 또는 프로세스, 장치, 소프트웨어 프로그램, 데이터 스트림 또는 신호로 구현될 수 있다. 단일 형태의 구현의 맥락에서만 논의(예컨대, 방법으로서만 논의)되었더라도, 논의된 특징의 구현은 또한 다른 형태(예컨대, 장치 또는 프로그램)로도 구현될 수 있다. 장치는 적절한 하드웨어, 소프트웨어 및 펌웨어 등으로 구현될 수 있다. 방법은, 예컨대, 컴퓨터, 마이크로프로세서, 집적 회로 또는 프로그래밍 가능한 로직 디바이스 등을 포함하는 프로세싱 디바이스를 일반적으로 지칭하는 프로세서 등과 같은 장치에서 구현될 수 있다. 프로세서는 또한 최종-사용자 사이에 정보의 통신을 용이하게 하는 컴퓨터, 셀 폰, 휴대용/개인용 정보 단말기(personal digital assistant: "PDA") 및 다른 디바이스 등과 같은 통신 디바이스를 포함한다.As described above, the present invention has been described with reference to the embodiment shown in the drawings, but this is merely exemplary, and various modifications and equivalent other embodiments are possible therefrom by those of ordinary skill in the art. will understand the point. Therefore, the technical protection scope of the present invention should be defined by the following claims. Also, the implementations described herein may be implemented as, for example, a method or process, an apparatus, a software program, a data stream, or a signal. Although discussed only in the context of a single form of implementation (eg, only as a method), implementations of the discussed features may also be implemented in other forms (eg, in an apparatus or a program). The apparatus may be implemented in suitable hardware, software and firmware, and the like. A method may be implemented in an apparatus such as, for example, a processor, which generally refers to a computer, a microprocessor, a processing device, including an integrated circuit or programmable logic device, or the like. Processors also include communication devices such as computers, cell phones, portable/personal digital assistants ("PDAs") and other devices that facilitate communication of information between end-users.

110 : 단어 임베딩 계층부
120 : 이미지 특징 추출부
130 : 행동 인식 모델부
140 : 캡션 구동부
150 : 모폴로지 분석부
160 : 동사 변경부110: word embedding hierarchy
120: image feature extraction unit
130: behavior recognition model unit
140: caption driver
150: morphology analysis unit
160: verb change part

Claims

a word embedding layer that stores learned words for image captioning;
an image feature extraction unit that receives an image and extracts features of the image according to a pre-learned feature extraction method;
a behavior recognition model unit for receiving the same image as input and classifying behavioral features from the image according to a previously learned behavior recognition method;
a caption driver for performing captioning by extracting related words corresponding to the features extracted from the image feature extraction unit from the word embedding layer unit;
a morphology analysis unit for analyzing the part-of-speech for each word in the sentence captured by the caption driving unit; and
For the verb in the captioned sentence, it is checked whether the verb matches the action of the image extracted from the behavior recognition model unit, and if the specified check condition is not satisfied, the verb in the captioned sentence corresponding to the action is checked. An image captioning apparatus comprising: a verb changing unit that changes a final captioning sentence into a verb.

The method of claim 1, wherein the caption driver comprises:
Image captioning apparatus comprising a plurality of Long Short Term Memory (LSTM).

According to claim 1, wherein the verb change unit,
It is checked whether the loss value of the specified word is less than the specified value, the part-of-speech of the specified word is a verb, and the behavior recognition accuracy recognized by the behavior recognition model unit is greater than a specified criterion, and all of the check conditions are satisfied. , the image captioning apparatus, characterized in that for changing the verb in the final caption to a word corresponding to the behavior recognized by the behavior recognition model unit.

The method of claim 1, wherein in order to change or add words corresponding to a plurality of parts of speech in the captioning sentence,
a face recognition model that recognizes faces in images; and a location data model for recognizing a location using location data of a camera that has taken an image;
Image cap, characterized in that it is implemented to add a name of a face-recognized user to a predetermined position of the captioning sentence, and further add a name of location data (Location) to a predetermined position of the captioning sentence shinning device.

storing the learned words for image captioning in a word embedding layer;
an image feature extraction unit receiving an image and extracting features of the image according to a pre-learned feature extraction method;
classifying the behavioral features in the image according to the previously learned behavior recognition method by the behavior recognition model unit receiving the image in the same way;
performing captioning by extracting, by a caption driver, related words corresponding to the features extracted by the image feature extracting unit from the word embedding layer;
analyzing, by a morphology analyzing unit, a part-of-speech for each word in the sentences captioned by the caption driver; and
The verb change unit checks whether the verb in the captioned sentence is a verb that matches the action of the image extracted from the behavior recognition model unit. If the specified check condition is not satisfied, the verb in the captioned sentence is changed to the action. Changing the final captioning sentence to a verb corresponding to ; Image captioning method comprising the.

The method of claim 5, wherein in the step of checking whether the verb in the captioned sentence is a verb that matches the action of the image extracted from the behavior recognition model unit,
The verb change part,
It is checked whether the loss value of the specified word is less than the specified value, the part-of-speech of the specified word is a verb, and the behavior recognition accuracy recognized by the behavior recognition model unit is greater than a specified criterion, and all of the check conditions are satisfied. , an image captioning method characterized in that the verb in the final caption is changed to a word corresponding to the action recognized by the behavior recognition model unit.

The method of claim 5, wherein in order to change or add words corresponding to a plurality of parts of speech in the captioning sentence,
recognizing a face in the image through a face recognition model; and recognizing a location using location data of a camera that has taken an image through a location data model;
Image cap, characterized in that it is implemented to add a name of a face-recognized user to a predetermined position of the captioning sentence, and further add a name of location data (Location) to a predetermined position of the captioning sentence shinning method.