KR20130078663A

KR20130078663A - Synchronized text editing method and apparatus based on image data

Info

Publication number: KR20130078663A
Application number: KR1020110147730A
Authority: KR
Inventors: 이인권; 이선영
Original assignee: 연세대학교 산학협력단; 포항공과대학교 산학협력단
Priority date: 2011-12-30
Filing date: 2011-12-30
Publication date: 2013-07-10
Also published as: KR101378493B1

Abstract

PURPOSE: A text data setting method synchronized in image data and a text data setting apparatus are provided to set text data, thereby easily distinguishing which actor acts a displayed subtitle in a video when sound is not exactly heard. CONSTITUTION: A text data setting apparatus groups image data based on a characteristic value or a feature point included in the image data (S100). The text data setting apparatus maps text data on grouped image data based on a time axis of the image data (S200). The text data setting apparatus determines image coordinates from the feature point of the grouped image data using a preset weight function and generates position information of the text data based on determined image coordinates (S300). [Reference numerals] (AA) Start; (BB) Finish; (S100) Group an image data; (S200) Map text data; (S300) Determine a text location

Description

Method and apparatus for setting text data synchronized to image data {SYNCHRONIZED TEXT EDITING METHOD AND APPARATUS BASED ON IMAGE DATA}

본 발명은 텍스트 데이터 설정 방법 및 장치에 관한 것으로, 보다 상세하게는 영상 데이터에 포함된 특징점 등에 기초하여 영상 데이터에 동기화된 텍스트 데이터의 표시 정보를 설정하는 방법 또는 장치에 관한 것이다.The present invention relates to a method and apparatus for setting text data, and more particularly, to a method or apparatus for setting display information of text data synchronized to image data based on feature points and the like included in the image data.

최근 지상파 방송에 대하여 PC나 스마트폰으로 플레이 가능한 K-Player, 푹(Pooq) 등의 서비스가 제공되고 있으며, 모바일 영역에서는 LTE가 상용화 되는 등 스마트 미디어 영역은 빠르게 발전되어 가고 있다.Recently, services such as K-Player and Pooq, which can be played on a PC or a smartphone, are being provided for terrestrial broadcasting, and the smart media area is rapidly developing, such as LTE commercialization in the mobile area.

특히, 스마트 미디어 환경에서의 사용자의 플레이 시간은 스마트 기기를 통한 동영상 이용이 증가됨에 따라 스마트 미디어 환경에서의 콘텐츠 제작이 활발하게 요구되고 있으며, 동영상 플레이를 위한 어플리케이션을 사용하는 사용자들을 만족시키는 어플리케이션 개발을 위한 필요성이 있다. In particular, as the user's play time in the smart media environment increases the use of video through smart devices, content creation is actively required in the smart media environment, and the application development that satisfies users who use the application for video play There is a need for it.

동영상 플레이를 위한 어플리케이션의 한 종류로서, 자막을 포함하고 있는 비디오에 대하여 말풍선을 생성하고 렌더링하는 시스템이 개발되고 있다. 말풍선이란 영상에 표시하기 위하여 설정된 텍스트 상자의 일종으로서, 청각 장애인, 어린이, 또는 언어 교육을 위한 용도로 다양하게 사용되고 있다. 말풍선의 위치는 말하고 있는 배우 등의 얼굴과 근접하여야하며, 다른 사람의 얼굴 또는 중요한 영역을 기리면 안된다는 제약 조건을 가진다. As a kind of application for moving picture, a system for generating and rendering a speech bubble for a video including subtitles has been developed. A speech balloon is a kind of text box set to be displayed on an image, and is used in various ways for the deaf, children, or language education. The position of the speech bubble should be close to the face of the actor, etc., who is speaking, with the constraint that it should not honor another person's face or an important area.

특히, 얼굴 자동 인식 알고리즘을 통해 배우의 얼굴 위치는 알아낼 수 있으나, 자막 파일에는 시간과 대사에 대한 텍스트만 있고 어느 배우의 대사인지에 대한 정보는 제공하고 있지 아니하다.In particular, the position of the actor's face can be determined through the automatic face recognition algorithm, but the subtitle file contains only text about time and dialogue, and does not provide information about which actor's dialogue.

영상에서 텍스트 데이터를 자동으로 매핑하고 말풍선등으로 구현되는 텍스트 상자의 최적화된 위치를 계산하여 제공하고자 한다.We will automatically map text data in an image and calculate and provide an optimized position of a text box that is implemented with speech bubbles.

상술한 기술적 과제를 해결하기 위한 본 발명의 일 실시예는 영상 데이터에 포함된 특성값 또는 특징점을 기초로 상기 영상 데이터를 그룹핑하는 단계; 상기 영상 데이터의 시간축을 기준으로 상기 그룹핑된 영상 데이터에 텍스트 데이터를 매핑하는 단계; 및 상기 그룹핑된 영상 데이터의 특징점으로부터 미리 설정된 가중치 함수를 이용하여 영상 좌표를 결정하고, 상기 결정된 영상 좌표에 기초하여 상기 텍스트 데이터의 위치 정보를 생성하는 텍스트 위치 결정 단계를 포함하는 텍스트 데이터 설정 방법을 제공하는 것을 특징으로 할 수 있다.One embodiment of the present invention for solving the above-described technical problem comprises the steps of grouping the image data based on the feature value or feature point included in the image data; Mapping text data to the grouped image data based on a time axis of the image data; And a text positioning step of determining image coordinates from a feature point of the grouped image data by using a preset weight function and generating position information of the text data based on the determined image coordinates. It may be characterized by providing.

또한, 상기 텍스트 위치 결정 단계는 상기 그룹핑된 영상 데이터의 특징점을 이용하여 객체의 중심점, 오버랩(overlap) 객체의 영역 또는 현출성(saliency) 중 적어도 하나 이상을 추출하는 단계; 상기 추출된 객체의 중심점, 오버랩 객체의 영역 또는 현출성 중 적어도 하나 이상을 기초로 미리 설정된 가중치 함수를 이용하여 결과값을 산출하는 단계; 및 상기 산출된 결과값을 이용하여 영상 좌표를 결정하고, 상기 결정된 영상 좌표에 기초하여 상기 텍스트 데이터의 위치 정보를 생성하는 단계를 포함하는 것을 특징으로 할 수 있다.The text positioning may include extracting at least one of a center point of an object, an area of an overlap object, or saliency using feature points of the grouped image data; Calculating a result value using a preset weight function based on at least one of a center point of the extracted object, an area of the overlap object, or saliency; And determining image coordinates using the calculated result value, and generating location information of the text data based on the determined image coordinates.

바람직하게는, 상기 결과값을 산출하는 단계는 상기 추출된 객체의 중심점과 상기 텍스트 데이터의 거리의 차이에 대하여 미리 설정된 거리 가중치를 부여하여 거리 결과값을 산출하는 단계; 상기 오버랩 객체의 영역과 상기 텍스트 데이터를 표시하기 위한 영역의 비율에 대하여 미리 설정된 영역 가중치를 부여하여 영역 결과값을 산출하는 단계; 상기 텍스트 데이터의 위치에 대한 현출성(saliency)에 대하여 미리 설정된 현출성 가중치를 부여하여 현출성 결과값을 산출하는 단계; 및 상기 거리 결과값, 상기 영역 결과값 또는 상기 현출성 결과값 중 적어도 어느 하나 이상을 기초로 가중치 결과값을 산출하는 단계를 포함하는 것을 특징으로 할 수 있다.Preferably, the calculating of the result may include calculating a distance result by giving a predetermined distance weight to a difference between the center point of the extracted object and the distance between the text data; Calculating an area result value by applying a predetermined area weight to a ratio of an area of the overlap object to an area for displaying the text data; Calculating a saliency result value by applying a saliency weight that is preset to a saliency of a position of the text data; And calculating a weighted result based on at least one of the distance result, the region result, and the saliency result.

또한, 상기 가중치 결과값을 산출하는 단계는 상기 거리 결과값, 상기 영역 결과값, 및 상기 현출성 결과값을 합산하여 가중치 결과값을 생성하는 단계인 것을 특징으로 하는 텍스트 데이터 설정 방법인 것을 특징으로 할 수 있다.The calculating of the weighted result may include generating a weighted result by summing the distance result, the region result, and the saliency result. can do.

또한, 상기 현출성 결과값을 산출하는 단계는 FAST 특징점 추출법을 이용하여 텍스트 데이터의 현출성 결과값을 산출하는 것을 특징으로 하는 단계인 것을 특징으로 할 수 있다.In addition, the step of calculating the saliency result value may be characterized in that the step of calculating the saliency result value of the text data using the FAST feature point extraction method.

또한, 상기 영상 데이터를 그룹핑하는 단계는 상기 영상 데이터에 포함된 픽셀 값들의 분포로 정의되는 특성값이 미리 설정된 임계치 이상으로 변환되는 경우 데이터 전환 지점으로 설정하고 상기 특성값이 미리 설정된 임계치 이하로 변화된 영상의 시간축 구간을 그룹핑하는 장면 전환 단계; 및 상기 영상 데이터에 포함된 객체를 식별할 수 있는 특징점이 공통되는 영상의 시간축 구간을 그룹핑하는 단계인 객체 그룹핑 단계를 포함하는 것을 특징으로 할 수 있다.The grouping of the image data may include setting a data switching point when the characteristic value defined by the distribution of pixel values included in the image data is converted to a preset threshold or more and changing the characteristic value to a preset threshold or less. A scene change step of grouping time axis sections of an image; And an object grouping step of grouping time-axis sections of an image having a common feature point for identifying an object included in the image data.

바람직하게는, 상기 객체 그룹핑 단계는 상기 영상데이터에 대하여 할라이크 특징(Haar-like feature)을 이용하여 상기 영상의 특징점을 추출하는 단계; 및 상기 추출된 특징점을 PCA(Principal Component Analysis) 기반의 얼굴 인식 기법을 이용하여 그룹핑하는 단계를 포함하는 것을 특징으로 할 수 있다.Preferably, the object grouping step comprises: extracting feature points of the image using a haar-like feature with respect to the image data; And grouping the extracted feature points using a PCA-based face recognition technique.

또한, 상기 텍스트 데이터를 매핑하는 단계는 음성 인식(Voice Recognition)을 이용하여 상기 영상 데이터의 시간축을 기준으로 상기 그룹핑된 영상 데이터에 텍스트 데이터를 매핑하는 단계인 것을 특징으로 할 수 있다.The mapping of the text data may include mapping text data to the grouped image data based on a time axis of the image data using voice recognition.

또한, 상기 텍스트 데이터의 위치 정보를 메타데이터로 저장하는 단계를 더 포함하는 것을 특징으로 할 수 있다.The method may further include storing location information of the text data as metadata.

상술한 기술적 과제를 해결하기 위한 본 발명의 일 실시예는 영상 데이터에 포함된 특성값 또는 특징점을 기초로 상기 영상 데이터를 그룹핑하는 영상 그룹핑부; 상기 영상 데이터의 시간축을 기준으로 상기 그룹핑된 영상 데이터에 텍스트 데이터를 매핑하는 텍스트 데이터 매핑부; 및 상기 그룹핑된 영상 데이터의 특징점으로부터 미리 설정된 가중치 함수를 이용하여 영상 좌표를 결정하고, 상기 결정된 영상 좌표에 기초하여 상기 텍스트 데이터의 위치 정보를 생성하는 텍스트 위치 결정부를 포함하는 텍스트 데이터 설정 장치를 제공하는 것을 특징으로 할 수 있다.One embodiment of the present invention for solving the above technical problem is an image grouping unit for grouping the image data on the basis of the feature value or feature point included in the image data; A text data mapping unit configured to map text data to the grouped image data based on a time axis of the image data; And a text position determiner configured to determine image coordinates using a preset weight function from feature points of the grouped image data and to generate position information of the text data based on the determined image coordinates. It can be characterized by.

종래 기술은 영상에 대하여 정해진 위치에 고정된 자막을 제공하는 기술에 불과하였으나, 본 발명에 따르면 텍스트 데이터를 설정함에 따라 청각 장애인 또는 소음이 많은 야외 등 소리를 정확히 듣지 못하는 경우에 있어 재생되고 있는 텍스트 자막 등이 어느 배우 등의 대사인지 여부를 판별하기 용이하도록 하는 효과를 제공한다. The prior art is only a technique for providing a caption fixed to a predetermined position with respect to an image, but according to the present invention, text that is being reproduced in a case where a hearing impaired person or a noisy outdoor sound is not accurately heard by setting text data. It provides an effect of making it easy to determine whether the caption or the like is the dialogue of which actor or the like.

도 1은 본 발명의 일 실시예에 따른 영상 데이터에 동기화된 텍스트 데이터 설정 방법을 도시한 순서도이다.
도 2는 본 발명의 일 실시예에 따른 영상 데이터에 동기화된 텍스트 데이터 설정 장치를 도시한 블록도이다.
도 3 및 도4는 본 발명의 일 실시예에 따른 영상 데이터에 동기화된 텍스트 데이터 설정 방법을 설명하기 위한 참고도이다.1 is a flowchart illustrating a method of setting text data synchronized to image data according to an embodiment of the present invention.
2 is a block diagram illustrating an apparatus for setting text data synchronized to image data according to an embodiment of the present invention.
3 and 4 are reference diagrams for explaining a method of setting text data synchronized to image data according to an embodiment of the present invention.

이하에서는 본 발명의 일부 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 아울러 본 발명을 설명함에 있어 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

이하의 실시예들은 본 발명의 구성요소들과 특징들을 소정 형태로 결합한 것들이다. 각 구성요소 또는 특징은 별도의 명시적 언급이 없는 한 선택적인 것으로 고려될 수 있다. 각 구성요소 또는 특징은 다른 구성요소나 특징과 결합하지 않은 형태로 실시될 수 있다. 또한, 일부 구성요소들 및/또는 특징들을 결합하여 본 발명의 실시예를 구성할 수도 있다. 본 발명의 실시예들에서 설명되는 동작들의 순서는 변경될 수 있다. 어느 실시예의 일부 구성이나 특징은 다른 실시예에 포함될 수 있고, 또는 다른 실시예의 대응하는 구성 또는 특징과 교체될 수 있다.The following embodiments are a combination of elements and features of the present invention in a predetermined form. Each component or feature may be considered to be optional unless otherwise stated. Each component or feature may be implemented in a form that is not combined with other components or features. In addition, some of the elements and / or features may be combined to form an embodiment of the present invention. The order of the operations described in the embodiments of the present invention may be changed. Some configurations or features of certain embodiments may be included in other embodiments, or may be replaced with corresponding configurations or features of other embodiments.

본 발명의 실시예들은 다양한 수단을 통해 구현될 수 있다. 예를 들어, 본 발명의 실시예들은 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다. Embodiments of the invention may be implemented through various means. For example, embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.

하드웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 하나 또는 그 이상의 ASICs(application specific integrated circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서, 콘트롤러, 마이크로 콘트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.For a hardware implementation, the method according to embodiments of the present invention may be implemented in one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs) , Field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and the like.

펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 실시예들에 따른 방법은 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차 또는 함수 등의 형태로 구현될 수 있다. 소프트웨어 코드는 메모리 유닛에 저장되어 프로세서에 의해 구동될 수 있다. 상기 메모리 유닛은 상기 프로세서 내부 또는 외부에 위치하여, 이미 공지된 다양한 수단에 의해 상기 프로세서와 데이터를 주고 받을 수 있다.In the case of an implementation by firmware or software, the method according to embodiments of the present invention may be implemented in the form of a module, a procedure or a function for performing the functions or operations described above. The software code can be stored in a memory unit and driven by the processor. The memory unit may be located inside or outside the processor, and may exchange data with the processor by various well-known means.

이하의 설명에서 사용되는 특정(特定) 용어들은 본 발명의 이해를 돕기 위해서 제공된 것이며, 이러한 특정 용어의 사용은 본 발명의 기술적 사상을 벗어나지 않는 범위에서 다른 형태로 변경될 수 있다.
Specific terms used in the following description are provided to help the understanding of the present invention, and the use of the specific terms may be modified in other forms without departing from the technical spirit of the present invention.

도 1을 참조하여 본 발명의 일 실시예에 따른 텍스트 데이터 설정 방법에 대하여 설명한다. 텍스트 데이터 설정 장치를 이용하여 텍스트 데이터를 설정하는 방법은 영상 데이터를 그룹핑하는 단계(S100), 영상 데이터에 텍스트 데이터를 매핑하는 단계(S200), 및 텍스트 위치 결정 단계(S300)를 포함한다.A method of setting text data according to an embodiment of the present invention will be described with reference to FIG. 1. The method of setting text data using the text data setting apparatus includes grouping image data (S100), mapping text data to image data (S200), and text positioning step (S300).

S100 단계는 텍스트 데이터 설정 장치는 영상 데이터에 포함된 특성값 또는 특징점을 기초로 영상데이터를 그룹핑한다. In operation S100, the text data setting apparatus groups the image data based on the feature value or the feature point included in the image data.

특성값이란 영상 데이터에 포함된 픽셀 값들의 분포로 정의되는 수치로서, 영상에 포함된 RGB 값, 또는 명암값 등의 분포를 그래프화하여 산출할 수 있으며, 예를 들어 특성값은 히스토그램(Histrogram)으로 구현될 수 있다. 따라서, 텍스트 데이터 설정 장치는 영상 데이터에 포함된 특성값이 미리 설정된 임계치 이상으로 변환되는 경우에는 장면 등이 전환된 것으로 판단하여 데이터 전환 지점으로 설정하고, 특성값이 미리 설정된 임계치 이하로 변화된 영상의 시간축 구간을 그룹핑한다. 미리 설정된 임계치는 장면 전환을 한 것으로 텍스트 데이터 설정 장치가 판단하기 위한 수치를 의미하며, 바람직하게는 영상을 흑백으로 처리한 상태에서 85의 값을 가진다.The characteristic value is a numerical value defined as a distribution of pixel values included in the image data. The characteristic value may be calculated by graphing a distribution of an RGB value or a contrast value included in the image. For example, the characteristic value may be a histogram. It can be implemented as. Therefore, when the feature value included in the image data is converted to the preset threshold or higher, the text data setting apparatus determines that the scene is switched and sets the data change point, and determines that the image value has changed below the preset threshold. Group time axis sections. The preset threshold means a numerical value for determining by the text data setting apparatus that the scene is changed, and preferably has a value of 85 when the image is processed in black and white.

특징점이란 영상 데이터에 포함된 객체를 식별할 수 있도록 하는 점을 말하는 것으로, 예를 들어 영상에 표시되는 사람의 얼굴에 대하여 식별할 수 있도록 하는 점으로 구현될 수 있다. 따라서, 본 발명의 일 실시예에 따르면, 특징점이 공통되는 영상의 시간축 구간을 하나의 영상 그룹으로 생성할 수 있다. 즉, 영상데이터에 대하여 할라이크 특징(Haar-like feature)를 이용하여 영상의 특징점을 추출한다. 예를 들어, OpenCV에서 제공하는 Haar-like feature 기반의 검출기를 사용하는 경우 초당 3프레임 정도의 성능이 나타난다. 얼굴이 아닌 객체를 식별한 경우에는 사용자에 의하여 제거됨이 바람직하다. The feature point refers to a point for identifying an object included in the image data. For example, the feature point may be implemented to identify a face of a person displayed on the image. Therefore, according to an embodiment of the present invention, a time axis section of an image having a common feature point may be generated as one image group. That is, feature points of the image are extracted from the image data by using a haar-like feature. For example, using a detector based on the Haar-like feature provided by OpenCV, the performance is about 3 frames per second. If an object other than a face is identified, the object is preferably removed by the user.

특징점을 기준으로 그룹핑된다는 것은 영상 데이터의 시간축을 기준으로 동일한 특징점을 가지는 객체에 대하여 하나의 그룹으로 설정되는 것을 의미한다. 하나의 그룹으로 설정되는 것은 사용자에 의하여 객체에 대한 그룹으로 설정되거나, 얼굴 인식(face recognition)과 같은 객체 인식 기법을 통하여 자동으로 설정될 수 있다. 예를 들어, 추출된 특징점은 PCA(Principal Component Analysis) 기반의 얼굴 인식 기법을 이용하여 그룹핑될 수 있다.The grouping based on the feature points means that a group is set for an object having the same feature point based on the time axis of the image data. What is set as a group may be set by the user as a group for an object or automatically through an object recognition technique such as face recognition. For example, the extracted feature points may be grouped using a Principal Component Analysis (PCA) based face recognition technique.

S200 단계는 텍스트 데이터 설정 장치는 영상 데이터의 시간축을 기준으로 그룹핑된 영상 데이터에 텍스트 데이터를 매핑한다. 본 발명에서, 텍스트 데이터는 시간 정보 또는 시간축에 따른 구간 정보를 가지고 있는 것이 바람직하며, SMI 파일 형식과 같은 시간과 텍스트 정보가 존재하는 자막 파일 또는 구조화된 텍스트 데이터를 포함한다. In operation S200, the text data setting apparatus maps the text data to the image data grouped based on the time axis of the image data. In the present invention, the text data preferably has time information or section information along the time axis, and includes a subtitle file or structured text data in which time and text information such as an SMI file format exist.

텍스트 데이터에 사용자가 미리 설정한 문장 구별 부호가 포함되어 있는 경우에는 동일한 시간 구간에 포함된 경우라도 순차적으로 표시되도록 시간축을 분할할 수 있다. 예를 들어,"안녕하세요/오랜만이네요"라는 텍스트 데이터를 포함하며, 사용자가 "/"에 대하여 미리 문장 구별 부호로 설정한 경우에는 "안녕하세요"를 표시하고 이후에 "오랜만이네요"를 영상에 매핑하도록 설정된다.When the text data includes a sentence distinguishing code set in advance by the user, the time axis may be divided so as to be sequentially displayed even when included in the same time interval. For example, it may contain text data that says "Hi / Long time", and if the user has previously set a punctuation mark for "/", display "Hi" and then map "Long time" to the image. Is set.

본 발명의 일 실시예에 따르면, S100 단계에서 영상 데이터가 그룹핑되는바, 그룹핑된 영상 데이터의 재생 시간을 기준으로 재생 시작 시간과 재생 종료 시간을 산출한다. 산출된 재생 시작 시간과 재생 종료 시간에 따른 그룹핑된 영상 데이터의 재생 구간이 결정되며, 결정된 재생 구간에 대하여 대응되는 텍스트 데이터의 시간 정보가 매칭된다. 그룹핑된 영상 데이터와 텍스트 데이터가 매칭되는 경우, 그룹핑된 영상 데이터에 매칭되는 텍스트 데이터가 매핑(mapping)도니다. 본 발명의 일 실시예에 따르면, 텍스트 데이터를 매핑하는 과정은 사용자에 의하여 설정될 수도 있으나, 음성 인식(Voice Recognition)을 이용하여 영상 데이터의 시간축을 기준으로 그룹핑된 영상 데이터에 텍스트 데이터를 자동으로매핑하도록 설정될 수 있다. According to an embodiment of the present invention, the image data is grouped in step S100, and the reproduction start time and the reproduction end time are calculated based on the reproduction time of the grouped image data. A playback section of the grouped image data according to the calculated playback start time and playback end time is determined, and time information of corresponding text data is matched with the determined playback section. When the grouped image data and the text data match, the text data matching the grouped image data is also mapped. According to an embodiment of the present invention, the process of mapping the text data may be set by the user, but the text data is automatically added to the image data grouped based on the time axis of the image data using voice recognition. Can be set to map.

S300 단계는 텍스트 데이터 설정 장치는 그룹핑된 영상 데이터의 특징점으로부터 미리 설정된 가중치 함수를 이용하여 영상 좌표를 결정하고, 결정된 영상 좌표에 기초하여 텍스트 데이터의 위치 정보를 생성한다. S300 단계는 그룹핑된 영상 데이터의 특징점을 이용하여 객체의 중심점, 오버랩(Overlap)된 객체의 영역 또는 현출성(saliency) 중 적어도 하나 이상을 추출하는 단계, 추출된 객체의 중심점, 오버랩 객체의 영역 또는 현출성 중 적어도 하나 이상을 기초로 미리 설정된 가중치 함수를 이용하여 결과값을 산출하는 단계, 산출된 결과값을 이용하여 영상 좌표를 결정하며, 결정된 영상 좌표에 기초하여 텍스트 데이터의 위치 정보를 생성하는 단계를 포함할 수 있다. 현출성(saliency)이란 주변의 이미지 영역이나 객체에 비하여 눈에 띄는 상태, 즉 시각적으로 현출되는 정도를 말하며, 현출성이 높을수록 사용자가 확인이 용이하다.In operation S300, the text data setting apparatus determines image coordinates from a feature point of the grouped image data by using a preset weight function, and generates position information of the text data based on the determined image coordinates. In step S300, extracting at least one or more of the center point of the object, the area of the overlapped object or the saliency using the feature points of the grouped image data, the center point of the extracted object, the area of the overlap object or Calculating a result value using a preset weight function based on at least one of saliency, determining image coordinates using the calculated result value, and generating position information of text data based on the determined image coordinates It may include a step. Saliency refers to a prominent state, that is, a degree of visual appearance, compared to the surrounding image area or object, and the higher the saliency, the easier the user can check.

결과값을 산출하는 단계는 추출된 객체의 중심점과 텍스트 데이터의 거리의 차이에 대하여 미리 설정된 거리 가중치를 부여하여 거리 결과값을 산출하는 단계, 오버랩 객체의 영역과 텍스트 데이터를 표시하기 위한 영역의 비율에 대하여 미리 설정된 영역 가중치를 부여하여 영역 결과값을 산출하는 단계, 텍스트 데이터의 위치에 대한 현출성(saliency)에 대하여 미리 설정된 현출성 가중치를 부여하여 현출성 결과값을 산출하는 단계 및 거리 결과값, 영역 결과값 또는 현출성 결과값 중 적어도 어느 하나 이상을 기초로 가중치 결과값을 산출하는 단계를 포함한다.The calculating of the result may include calculating a distance result by assigning a predetermined distance weight to the difference between the center point of the extracted object and the distance between the text data and the ratio of the area of the overlap object and the area for displaying the text data. Calculating an area result value by assigning a predetermined area weight to an area, calculating an area result value by assigning a predetermined value of saliency to a saliency of a position of text data, and a distance result value Calculating a weighted result based on at least one of an area result value and a saliency result value.

거리 결과값을 산출하는 단계는 추출된 객체의 중심점과 텍스트 데이터의 거리의 차이를 미리 산출한다. 객체의 중심점이란, 특징점으로 식별된 객체에 대한 특징점의 좌표들을 기초로 중심이 되는 좌표를 의미하며, 객체의 중심점과 텍스트 데이터의 거리의 차이란 텍스트 데이터가 표시되는 텍스트 상자의 중심이 되는 좌표와의 거리를 의미한다. 객체의 중심점과 텍스트 데이터의 거리의 차이는 절대적인 차이를 계산하는 것인바, 제곱 연산 등을 통하여 양수의 결과값을 산출하는 것이 바람직하며, 미리 설정된 거리 가중치란 사용자가 객체와 텍스트 사이의 거리에 대한 중요도를 설정한 것이다. 미리 설정된 거리 가중치는 텍스트와 객체의 거리가 가까워야하며, 텍스트와 객체 간의 거리가 지나치게 멀어지거나, 텍스트 표시 영역의 지나친 확대를 방지하기 위하여 0 내지 1의 범위에서 가중치로 설정되며, 바람직하게는 0.2로 설정되는 경우 사용자의 선호도가 높다.The step of calculating the distance result value calculates in advance the difference between the center point of the extracted object and the distance between the text data. The center point of an object is a coordinate that is centered on the basis of the coordinates of the feature point for the object identified as the feature point, and the difference between the center point of the object and the distance of the text data refers to the coordinate that is the center of the text box where the text data is displayed. Means the distance. The difference between the center point of the object and the distance between the text data is to calculate an absolute difference. It is preferable to calculate a positive result value through a square operation, etc. The preset distance weight means that the user can determine the distance between the object and the text. The importance is set. The preset distance weight should be close to the distance between the text and the object, and set to a weight in the range of 0 to 1 in order to prevent excessive distance between the text and the object or excessive enlargement of the text display area. If set to, the user's preference is high.

오버랩 객체의 영역과 텍스트 데이터를 표시하기 위한 영역의 비율에 대하여 미리 설정된 영역 가중치를 부여하여 영역 결과값을 산출하는 단계는 텍스트 데이터를 표시하기 위한 텍스트 상자 등의 영역에 대하여 넓이를 계산하고, 다른 객체를 가리는 것을 방지하기 위하여 오버랩된 객체의 영역의 넓이 비율을 산출한다. 미리 설정된 영역 가중치란 텍스트 데이터가 다른 객체를 가리는 것의 중요도가 설정된 것이다. 미리 설정된 영역 가중치는 다른 객체를 가리는 것을 방지하기 위하여 0 내지 100의 가중치로 설정되며, 바람직하게는 다른 객체를 가리는 것을 방지하기 위하여 100으로 설정되는 경우 사용자의 선호도가 높다.Calculating the area result by giving a preset area weight to the ratio of the area of the overlap object to the area for displaying the text data, calculating the area for an area such as a text box for displaying the text data, In order to prevent the object from being obscured, the area ratio of the area of the overlapping object is calculated. The preset area weight means that importance of text data covering another object is set. The preset area weight is set to a weight of 0 to 100 to prevent other objects from being masked, and preferably set to 100 to prevent other objects from being blocked.

텍스트 데이터의 위치에 대한 현출성(saliency)에 대하여 미리 설정된 현출성 가중치를 부여하여 현출성 결과값을 산출하는 단계는 텍스트 데이터가 영상에서 잘 표시될 수 있는지 여부의 기준인 현출성 결과값을 산출한다. 현출성 가중치란 텍스트 데이터가 시각적으로 명확하게 표시될 수 있는 영역(important region)을 계산하기 위한 가중치로서, 0 내지 1의 가중치를 가지도록 설정되며, 바람직하게는 사용자에게 명확하게 표시되는 1로 설정되는 경우 사용자의 선호도가 높다. 본 발명의 일 실시예에 따르면, FAST 특징점 추출법[E. Rosten et. al, 2010 참조]을 이용하여 텍스트 데이터의 현출성 결과값이 산출될 수 있다.The step of calculating the saliency result value by assigning a preset saliency weight to the saliency of the position of the text data calculates the saliency result value which is a criterion of whether the text data can be well displayed on the image. do. The saliency weight is a weight for calculating an import region in which text data can be clearly displayed, and is set to have a weight of 0 to 1, preferably set to 1 clearly displayed to the user. If so, the user's preference is high. According to an embodiment of the present invention, the FAST feature point extraction method [E. Rosten et. al, 2010] can be calculated using the saliency result value of the text data.

거리 결과값, 영역 결과값 또는 현출성 결과값 중 적어도 어느 하나 이상을 기초로 가중치 결과값을 산출하는 단계는 개별적으로 산출된 수치를 이용하여 텍스트 데이터가 표시되는 위치 정보를 최적화한다. 본 발명의 일 실시예에 따르면, 거리 결과값, 영역 결과값 또는 현출성 결과값에 대하여 사용자로부터 입력이 되지 아니하거나, 산출되지 아니한 결과값에 대하여는 가중치를 0으로 설정하여 가중치 결과값을 산출한다. 예를 들어, Nelder-Mead Simplex method 최적화 방법으로 가중치 결과값이 최소화되는 지점을 산출하고, 산출된 지점을 위치 정보로 결정할 수 있다.The calculating of the weighted result based on at least one of the distance result value, the area result value, or the saliency result value optimizes the position information on which the text data is displayed using the individually calculated numerical value. According to an embodiment of the present invention, the weighted result is calculated by setting the weight to 0 for the result value that is not input or is not calculated by the user for the distance result value, the region result value or the saliency result value. . For example, the Nelder-Mead Simplex method optimization method may calculate a point where the weighted result value is minimized and determine the calculated point as location information.

본 발명의 일 실시예에 따르면, 가중치 결과값을 산출하는 단계는 거리 결과값, 영역 결과값 및 현출성 결과값을 합산하여 가중치 결과값을 생성할 수 있다. According to an embodiment of the present invention, the calculating of the weighted result may include generating a weighted result by summing the distance result, the area result, and the saliency result.

[수학식 1]을 참조하여 텍스트 데이터의 위치 정보를 생성하는 것을 설명하면,Referring to generating position information of text data with reference to [Equation 1],

텍스트 데이터(WB)의 위치 정보(x)를 구하는 것으로, 텍스트 데이터의 위치 정보(x)는 좌표로 설정될 수 있으며, 가로축 좌표(coord_x), 세로축 좌표(coord_y), 폭(width) 및 높이(height) 정보를 포함한다. 객체의 중심점(fi)와 텍스트 데이터의 거리의 차이에 대한 제곱연산을 통하여 양수의 값을 생성하고, 생성된 차이에 거리 가중치(w1)를 연산하여 거리 결과값을 산출한다. 오버랩 객체의 영역의 넓이(overlap)와 텍스트 데이터를 표시하기 위한 영역의 넓이(Area(WB))에 대한 비율에 대하여 영역 가중치(w2)를 연산하여 영역 결과값을 산출한다. 현출성(Saliency)에 대하여 텍스트 데이터가 사용자에게 효과적으로 제공되기 위한 영역(important region)을 계산하기 위하여 현출성 가중치(w3)를 이용하여 현출성 결과값을 산출한다. 가중치 결과값(E(x))는 거리 결과값, 영역 결과값 및 현출성 결과값을 합산하여 산출된다.By obtaining the position information x of the text data WB, the position information x of the text data may be set as a coordinate, and the horizontal coordinate (coord_x), vertical coordinate (coord_y), width and height ( height) information. A positive value is generated through a square operation on the difference between the distance between the center point fi of the object and the text data, and the distance result value is calculated by calculating the distance weight w1. The area result value is calculated by calculating the area weight w2 with respect to the ratio of the area of the overlap object to the area of the overlap object and the area WB for displaying text data. The saliency result value is calculated using the saliency weight w3 to calculate an import region for text data to be effectively provided to the user with respect to saliency. The weight result value E (x) is calculated by summing the distance result value, the region result value and the saliency result value.

본 발명의 일 실시예에 따르면, 텍스트 데이터 설정 장치가 가중치 결과값에 따라 산출된 텍스트 데이터의 위치 정보를 메타데이터로 저장하는 단계를 더 포함할 수 있다. 예를 들어, 말풍선으로 표시가능한 텍스트 상자에 대하여 텍스트 상자의 위치와 넓이, 높이 정보를 SMI 자막 형식의 파일에 메타데이터로 저장함으로써, 다른 사용자에게 동영상 재생에 따라 텍스트 상자가 오버레이되는 메타데이터가 저장된 SMI 자막 형식의 파일을 제공할 수 있다.
According to an embodiment of the present invention, the text data setting apparatus may further include storing location information of the text data calculated according to the weighted result as metadata. For example, for text boxes that can be displayed as speech bubbles, the text box's position, width, and height are stored as metadata in a SMI subtitle format file, so that other users can save metadata that overlays the text box as the video plays. A file in the SMI subtitle format can be provided.

도 2를 참조하여 본 발명의 일 실시예에 따른 텍스트 데이터 설정 장치를 설명한다. 텍스트 데이터 설정 방법에서 상술한 내용과 동일한 내용은 상술한 내용으로 대체한다. A text data setting apparatus according to an embodiment of the present invention will be described with reference to FIG. 2. In the text data setting method, the same contents as those described above are replaced with the above contents.

텍스트 데이터 설정 장치는 영상 그룹핑부(100), 텍스트 데이터 매핑부(200) 및 텍스트위치 결정부(300)를 포함한다.The text data setting apparatus includes an image grouping unit 100, a text data mapping unit 200, and a text positioning unit 300.

영상 그룹핑부(100)는 영상 데이터에 포함된 특성값 또는 특징점을 기초로 영상 데이터를 그룹핑하며, 영상 그룹핑부(100)는 영상 데이터에 포함된 픽셀 값들의 분포로 정의되는 특성값이 미리 설정된 임계치 이상으로 변환되는 경우 데이터 전환 지점으로 설정하고 특성값이 미리 설정된 임계치 이하로 변화된 영상의 시간축 구간을 그룹핑하는 장면 전환 설정부(110) 및 영상 데이터에 포함된 객체를 식별할 수 있는 특징점이 공통되는 영상의 시간축 구간을 그룹핑하는 단계인 객체 그룹핑부를 포함한다.The image grouping unit 100 groups the image data based on the feature values or the feature points included in the image data, and the image grouping unit 100 has a threshold value in which a characteristic value defined as a distribution of pixel values included in the image data is preset. In the case of the conversion, the scene change setting unit 110 grouping the time axis sections of the image, which is set as the data switching point and whose characteristic value is changed below the preset threshold, and the feature points for identifying the objects included in the image data are common. And an object grouping unit which is a step of grouping a time axis section of an image.

텍스트 데이터 매핑부(200)는 영상 데이터의 시간축을 기준으로 그룹핑된 영상 데이터에 텍스트 데이터를 매핑한다.The text data mapping unit 200 maps text data to image data grouped based on a time axis of the image data.

텍스트 위치 결정부(300)는 그룹핑된 영상 데이터의 특징점으로부터 미리 설정된 가중치 함수를 이용하여 영상 좌표를 결정하고, 결정된 영상 좌표에 기초하여 텍스트 데이터의 위치 정보를 생성한다. 텍스트 위치 결정부(300)는 그룹핑된 영상 데이터의 특징점을 이용하여 객체의 중심점, 오버랩(overlap) 객체의 영역 또는 현출성(saliency) 중 적어도 하나 이상을 추출하는 추출부(310), 추출된 객체의 중심점, 오버랩 객체의 영역 또는 현출성 중 적어도 하나 이상을 기초로 미리 설정된 가중치 함수를 이용하여 결과값을 산출하는 산출부(320) 및 산출된 결과값을 이용하여 영상 좌표를 결정하고, 결정된 영상 좌표에 기초하여 텍스트 데이터의 위치 정보를 생성하는 위치 정보 생성부(330)를 포함한다.The text position determiner 300 determines image coordinates using a preset weight function from feature points of the grouped image data, and generates position information of the text data based on the determined image coordinates. The text position determiner 300 extracts at least one or more of the center point of the object, the area of the overlap object, or the saliency using the feature points of the grouped image data, and the extracted object. A calculator 320 that calculates a result value using a preset weighting function based on at least one of a center point, an area of the overlap object, or saliency of the object, and determines the image coordinates using the calculated result value, and determines the image coordinate. And a location information generator 330 for generating location information of the text data based on the coordinates.

산출부(320)는 추출된 객체의 중심점과 텍스트 데이터의 거리의 차이에 대하여 미리 설정된 거리 가중치를 부여하여 거리 결과값을 산출하는 거리 결과값 산출부(미도시), 오버랩 객체의 영역과 텍스트 데이터를 표시하기 위한 영역의 비율에 대하여 미리 설정된 영역 가중치를 부여하여 영역 결과값을 산출하는 영역 결과값 산출부(미도시), 텍스트 데이터의 위치에 대한 현출성(saliency)에 대하여 미리 설정된 현출성 가중치를 부여하여 현출성 결과값을 산출하는 현출성 결과값 산출부(미도시) 및 거리 결과값, 영역 결과값 또는 현출성 결과값 중 적어도 어느 하나 이상을 기초로 가중치 결과값을 산출하는 가중치 결과값 산출부(미도시)를 포함하며, 바람직하게는 가중치 결과값 산출부는 거리 결과값, 영역 결과값, 및 현출성 결과값을 합산하여 가중치 결과값을 생성하는 것을 특징으로 한다.The calculation unit 320 is a distance result calculator (not shown) that calculates a distance result by giving a predetermined distance weight to the difference between the center point of the extracted object and the distance between the text data, the area of the overlap object and the text data. An area result calculation unit (not shown) which calculates an area result value by giving a preset area weight to a ratio of an area for displaying a, and a preset saliency weight for saliency of a position of text data. A saliency result calculation unit (not shown) that calculates saliency result value by applying a value and a weighted result value that calculates a weighted result value based on at least one of distance result value, region result value or saliency result value A calculator may include a calculator (not shown). Preferably, the weight result calculator calculates a weight by summing a distance result value, an area result value, and a saliency result value. It generates a result value.

본 발명의 일 실시예에 따르면, 메타데이터 저장부(미도시)를 더 포함할 수 있으며, 텍스트 데이터의 위치 정보를 메타데이터로 저장한다.
According to an embodiment of the present invention, the metadata storage unit (not shown) may further include, and stores location information of the text data as metadata.

도 3 및 도 4를 참조하여 본 발명의 일 실시예에 따른 텍스트 데이터 설정 방법 중 텍스트 데이터를 매핑하는 단계에 대하여 설명한다.A process of mapping text data in the text data setting method according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4.

영상 그룹핑하는 단계를 수행하여 그룹화된 영상 데이터(410) 가운데 일부를 사용자에게 디스플레이한다(420). 텍스트 데이터 설정 장치는 그룹화된 영상 데이터와 매칭되는 텍스트 데이터(430)를 사용자에게 함께 제공할 수 있고, 사용자로부터 디스플레이된 영상에 대하여 텍스트 데이터를 동기화 설정을 입력받을 수 있다.In operation 420, a part of the grouped image data 410 is displayed to the user by performing image grouping. The text data setting apparatus may provide the user with text data 430 matched with the grouped image data, and receive a synchronization setting of the text data with respect to the displayed image from the user.

또한, 영상에 포함된 객체에 대하여 특징점이 공통된다고 판단되는 객체에 대하여 사용자가 판독가능하도록 표시하여(440), 사용자로부터 표시한 객체들에 대하여 병합(Merge), 첨가(Add) 또는 삭제(Delete) 명령을 입력받아 객체를 그룹핑할 수 있다. 그룹핑된 영상에 대하여는 그룹화된 영상 데이터와 매칭되는 텍스트 데이터가 매핑된다(450).
In addition, the user may display the object that the feature points are determined to be common for the objects included in the image to be readable by the user (440), and merge, add, or delete the objects displayed by the user. ) You can group objects by receiving commands. The text data matching the grouped image data is mapped to the grouped image (450).

본 발명에 의한 실시예들은 컴퓨터 프로그램으로 작성 가능하다. 이 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 또한, 해당 컴퓨터 프로그램은 컴퓨터가 읽을 수 있는 정보저장매체(Computer Readable Media)에 저장되고, 컴퓨터에 의하여 읽혀지고 실행됨으로써 실시예를 구현한다. 정보저장매체는 자기 기록매체, 광 기록매체 및 캐리어 웨이브 매체를 포함한다.
Embodiments of the present invention can be written in a computer program. The code and code segments that make up this computer program can be easily deduced by a computer programmer in the field. In addition, the computer program is stored in a computer readable medium (Computer Readable Media), and the embodiment is implemented by being read and executed by a computer. The information storage medium includes a magnetic recording medium, an optical recording medium and a carrier wave medium.

이제까지 본 발명에 대하여 바람직한 실시예를 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명을 구현할 수 있음을 이해할 것이다. 그러므로, 상기 개시된 실시예 들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 한다.The present invention has been described above with reference to preferred embodiments. It will be understood by those skilled in the art that the present invention may be embodied in various other forms without departing from the spirit or essential characteristics thereof. Therefore, the above-described embodiments should be considered in an illustrative rather than a restrictive sense. The scope of the present invention is shown not in the above description but in the claims, and all differences within the scope should be construed as being included in the present invention.

Claims

Grouping the image data based on feature values or feature points included in the image data;
Mapping text data to the grouped image data based on a time axis of the image data; And
And a text positioning step of determining image coordinates from a feature point of the grouped image data using a preset weight function, and generating position information of the text data based on the determined image coordinates.

The method of claim 1, wherein the text positioning step
Extracting at least one of a center point of an object, an area of an overlap object, or saliency by using feature points of the grouped image data;
Calculating a result value using a preset weight function based on at least one of a center point of the extracted object, an area of the overlap object, or saliency; And
Determining image coordinates by using the calculated result value, and generating position information of the text data based on the determined image coordinates.

The method of claim 2, wherein the calculating of the result value
Calculating a distance result by giving a predetermined distance weight to a difference between the center point of the extracted object and the distance between the text data;
Calculating an area result value by applying a predetermined area weight to a ratio of an area of the overlap object to an area for displaying the text data;
Calculating a saliency result value by applying a saliency weight that is preset to a saliency of a position of the text data; And
And calculating a weighted result value based on at least one of the distance result value, the region result value, and the saliency result value.

The method of claim 3, wherein the calculating of the weight result is performed.
And generating a weighted result by summing the distance result, the area result, and the saliency result.

The method of claim 3, wherein calculating the saliency result value
And calculating the saliency result value of the text data using the FAST feature point extraction method.

The method of claim 1, wherein the grouping of the image data comprises:
When the feature value defined by the distribution of pixel values included in the image data is converted to a predetermined threshold value or more, the scene change step of setting the data transition point and grouping the time-axis sections of the image whose characteristic value has changed below the preset threshold value ; And
And an object grouping step of grouping time axis sections of an image having a common feature point for identifying an object included in the image data.

The method of claim 6, wherein the object grouping step
Extracting feature points of the image using a haar-like feature with respect to the image data; And
And grouping the extracted feature points using a PCA-based face recognition technique.

The method of claim 1, wherein the mapping of the text data comprises:
And mapping text data to the grouped image data based on a time axis of the image data using voice recognition.

The method according to claim 1,
And storing the positional information of the text data as metadata.

An image grouping unit which groups the image data based on a feature value or a feature point included in the image data;
A text data mapping unit configured to map text data to the grouped image data based on a time axis of the image data; And
And a text position determiner configured to determine image coordinates from a feature point of the grouped image data by using a preset weight function, and generate position information of the text data based on the determined image coordinates.

The method of claim 10, wherein the text position determiner
An extracting unit extracting at least one of a center point of an object, an area of an overlap object, or saliency by using feature points of the grouped image data;
A calculator configured to calculate a result value using a preset weight function based on at least one of a center point of the extracted object, an area of the overlap object, or saliency; And
And a location information generator to determine image coordinates using the calculated result value and to generate location information of the text data based on the determined image coordinates.

The method of claim 11, wherein the calculation unit
A distance result calculation unit configured to calculate a distance result by giving a preset distance weight to a difference between the center point of the extracted object and the distance between the text data;
An area result calculation unit configured to calculate an area result by giving a preset area weight to a ratio between an area of the overlap object and an area for displaying the text data;
A saliency result calculation unit configured to calculate a saliency result by assigning a predetermined saliency weight to a saliency of a position of the text data; And
And a weight result calculation unit configured to calculate a weight result based on at least one of the distance result, the region result, and the saliency result.

The apparatus of claim 12, wherein the weight result calculator
And summing the distance result value, the area result value, and the saliency result value to generate a weighted result value.

The method of claim 10, wherein the image grouping unit
When a feature value defined as a distribution of pixel values included in the image data is converted to a preset threshold or more, a scene change setting for setting a data transition point and grouping a time axis section of an image in which the feature value has changed below a preset threshold part; And
And an object grouping unit which is a step of grouping time axis sections of an image having a common feature point for identifying an object included in the image data.

The method of claim 10,
And a metadata storage unit for storing location information of the text data as metadata.

Mapping structured caption data (text data) to image data; And
And a test position determining step of generating attribute information of the structured caption data by using the extracted feature points to identify the object of the image data.

17. The method of claim 16, wherein the text positioning step
Extracting at least one of a center point of the object, an area of an overlap object, or saliency using the feature points of the image data;
Calculating a result value using a preset weight function based on at least one of a center point of the extracted object, an area of the overlap object, or saliency; And
Determining image coordinates based on the calculated result value, and generating attribute information of the structured caption data based on the determined image coordinates.

18. The method of claim 17, wherein calculating the result value
Calculating a distance result by giving a predetermined distance weight to a difference between a distance between a center point of the extracted object and the structured caption data;
Calculating an area result value by giving a preset area weight to a ratio of an area of the overlap object to an area for displaying the structured caption data;
Calculating a saliency result value by applying a saliency weight that is preset to a saliency of a position of the structured caption data; And
And calculating a weighted result value based on at least one of the distance result value, the region result value, and the saliency result value.

19. A computer readable recording medium having recorded thereon the text data setting method of any one of claims 1 to 9 and 16 to 18 so as to be executable by a computer.