KR20100000256A

KR20100000256A - Method for displaying caption of moving picture

Info

Publication number: KR20100000256A
Application number: KR1020080059686A
Authority: KR
Inventors: 박승보; 조근식; 오경진
Original assignee: 인하대학교 산학협력단
Priority date: 2008-06-24
Filing date: 2008-06-24
Publication date: 2010-01-06
Also published as: KR100977079B1

Abstract

PURPOSE: A method for displaying a caption of a moving picture is provided to display a caption of a moving picture on the surroundings of a speaker according to the location of the determined speaker, thereby increasing visual immersion of a viewer about the moving picture. CONSTITUTION: Time information on which a caption is to be displayed is obtained(S710). A face is detected from image frames corresponding to the time information(S720). A speaker corresponding to the image frames is determined(S730). A face area range of the speaker is determined using the face(S740). An area on which the caption is to be displayed is adaptively determined. The caption is displayed on the determined area.

Description

How to display captions in a video {Method for Displaying Caption of Moving Picture}

본 발명은 영상 표시에 관한 것으로, 보다 구체적으로, 동영상의 자막을 표시하는 방법에 관한 것이다.The present invention relates to video display, and more particularly, to a method of displaying a caption of a video.

최근 과학기술의 발전 및 경제수준의 향상으로 인해 초고속 인터넷과 같은 통신망의 보급과 초고속 통신망 이용자의 급격한 증가가 이루어졌고, 초고속 통신망 이용자의 급격한 증가는 통신망을 통한 다양한 영상 컨텐츠의 보급화를 가능하게 하였다.Recently, due to the development of science and technology and the improvement of economic level, the spread of communication network such as high-speed Internet and the rapid increase of users of high-speed communication network have been made, and the rapid increase of users of high-speed communication network has made it possible to disseminate various video contents through communication network.

상술한 영상 컨텐츠의 종류에는 교육, 의료, 또는 과학 등과 같은 전문 분야의 동영상과 방송, 또는 뮤직 비디오 등과 같은 오락/문화 분야의 동영상이 포함된다. Types of the above-described image contents include videos of a professional field such as education, medicine, or science, and videos of entertainment / culture such as broadcasting or music video.

이와 같은 동영상 중 외국어로 제작되어 있는 동영상의 경우 시청자의 이해를 돕기 위한 자막이 동영상 내에 표시되는데, 시청자들은 이러한 자막을 통해 동영상에 등장하는 화자들의 대화 내용을 원활하게 이해할 수 있게 된다.In the case of a video produced in a foreign language among such videos, subtitles are displayed in the video to help viewers understand, and the viewers can understand the conversations of the speakers in the video.

한편, 최근에는 외국어로 제작된 동영상 뿐만 아니라 국어로 제작된 동영상 의 경우라도 동영상의 재미를 더하기 위해 동영상 내에 자막을 표시하는 경우도 있다.On the other hand, recently, even in the case of a video produced in a foreign language as well as a video produced in a foreign language, subtitles are displayed in the video to add fun to the video.

그러나, 기존에는 이러한 동영상의 자막을 표시함에 있어서, 동영상 내에 등장하는 화자의 위치에 관계 없이 화면의 일편, 예컨대, 화면의 하단부에 자막을 고정적으로 표시하였기 때문에, 시청자가 동영상에 등장하는 화자의 대화 내용을 이해하기 위해서는 동영상과 동영상의 하단부에 표시되는 자막을 번갈아 가며 시청할 수 밖에 없어서 동영상에 대한 시각적 몰입도가 저하될 뿐만 아니라, 이로 인해 동영상에 대한 이해도가 떨어질 수 밖에 없다는 문제점이 있다.However, conventionally, in displaying subtitles of such videos, the subtitles are fixedly displayed on one side of the screen, for example, at the bottom of the screen, regardless of the position of the speaker appearing in the video. In order to understand the contents, the subtitles displayed at the bottom of the video and the video cannot be watched alternately, so that the visual immersion of the video is lowered and the understanding of the video is inevitably reduced.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 동영상을 구성하는 음성정보와 영상정보를 이용하여 결정된 화자의 위치에 따라 동영상의 자막이 출력되는 위치를 적응적으로 결정할 수 있는 동영상의 자막 표시 방법을 제공하는 것을 기술적 과제로 한다.SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problem, and provides a caption display method of a video that can adaptively determine a position at which a caption of a video is output according to the location of a speaker determined using audio information and video information constituting a video. It is technical problem to provide.

상술한 기술적 과제를 달성하기 위한 본 발명의 일 측면에 따른 동영상의 자막 표시 방법은 동영상에 표시될 자막으로부터 상기 자막이 표시될 시간정보를 획득하는 단계; 상기 동영상을 구성하는 영상 프레임들 중 상기 시간정보에 상응하는 영상 프레임들로부터 얼굴을 검출하는 단계; 상기 시간정보에 상응하는 각 영상 프레임들로부터 검출된 얼굴의 특징점의 변경여부를 이용하여 상기 시간정보에 상응하는 영상 프레임들에 해당하는 화자를 결정하는 단계; 상기 결정된 화자의 얼굴을 이용하여 상기 시간정보에 상응하는 각 영상 프레임들 내에서 상기 화자의 얼굴영역 범위를 결정하는 단계; 및 상기 자막이 표시될 영역을 상기 화자의 얼굴영역 범위에 따라 적응적으로 결정하여 상기 결정된 영역에 상기 자막을 표시하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for displaying a caption of a video, the method including: obtaining time information for displaying the caption from a caption displayed on the video; Detecting a face from image frames corresponding to the time information among the image frames constituting the video; Determining a speaker corresponding to the image frames corresponding to the time information by changing whether a feature point of a face detected from each of the image frames corresponding to the time information is changed; Determining a range of the speaker's face region in each of the image frames corresponding to the time information by using the determined speaker's face; And adaptively determining a region in which the caption is to be displayed according to a range of the speaker's face region and displaying the caption in the determined region.

일 실시예에 있어서, 상기 자막 표시 단계는, 상기 화자의 얼굴영역 범위로부터 소정 거리 이내의 영역들 중 상기 자막이 표시될 후보 영역을 설정하는 단계; 및 상기 후보 영역들 중 색편차가 가장 적고 채도가 가장 높은 후보 영역을 상기 자막이 표시될 영역으로 결정하는 단계를 포함하는 것을 특징으로 한다.The displaying of the subtitles may include: setting a candidate area in which the subtitles are to be displayed among areas within a predetermined distance from a range of the speaker's face area; And determining a candidate region having the least color deviation and the highest saturation among the candidate regions as a region where the subtitle is to be displayed.

이때, 상기 자막 표시 단계에서, 상기 자막은 상기 자막이 표시될 영역의 색상과 보색관계에 있는 색상으로 표시하는 것을 특징으로 한다.At this time, in the subtitle display step, the subtitles are displayed in a color complementary to the color of the area in which the subtitles are to be displayed.

한편, 상술한 동영상의 자막 표시 방법은 상기 자막 표시 단계 이전에, 상기 동영상을 구성하는 음성정보를 분석하여 음성이 발생되는 공간적 위치를 결정하는 단계를 더 포함하고, 상기 자막 표시 단계에서, 상기 음성의 공간적 위치를 함께 이용하여 상기 자막이 표시될 영역을 결정하는 것을 특징으로 한다.On the other hand, the above-described subtitle display method of the video further comprises the step of analyzing the voice information constituting the video before the subtitle display step to determine the spatial location where the voice is generated, in the subtitle display step, The location of the caption is determined by using the spatial position of.

이때, 상기 음성의 공간적 위치는 각 채널별로 출력되는 음성신호의 차이 및 음성주파수 대역 중 적어도 하나를 이용하여 결정되는 것을 특징으로 한다.At this time, the spatial position of the voice is characterized by using at least one of the difference between the voice signal output for each channel and the voice frequency band.

본 발명에 따르면 동영상을 구성하는 영상정보 및 음성정보를 이용하여 결정된 화자의 위치에 따라 동영상의 자막을 화자 주변에 표시함으로써 동영상에 대한 시청자의 시각적 몰입도를 높일 수 있다는 효과가 있다.According to the present invention, there is an effect that the viewer's visual immersion of the video can be enhanced by displaying subtitles of the video around the speaker according to the location of the speaker determined using the video information and the audio information constituting the video.

또한, 본 발명에 따르면 청각장애가 있는 시청자도 동영상 내에서 2인 이상의 화자가 대화하는 경우에 동영상의 화자의 대화 내용을 쉽게 이해할 수 있도록 하는 효과가 있다.In addition, according to the present invention, the viewer with a hearing impairment can easily understand the conversation contents of the speaker of the video when two or more speakers talk in the video.

이하 첨부된 도면을 참조하여 본 발명의 실시예에 대해 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 동영상 재생 장치를 나타내는 블럭도이다. 이러한 동영상 재생 장치(100)는 디스플레이(미도시)를 통해 동영상 및 동영 상의 자막을 표시하는 것으로서, 도시된 바와 같이 수신부(110), 시간정보 검출부(120), 영상분석부(130), 화자 결정부(140), 자막위치 결정부(150), 및 동영상 출력부(160)를 포함한다.1 is a block diagram showing a video player according to an embodiment of the present invention. The video reproducing apparatus 100 displays subtitles of a video and a video through a display (not shown), and as illustrated, the receiver 110, the time information detector 120, the image analyzer 130, and the speaker are determined. The unit 140 includes a caption position determiner 150 and a video output unit 160.

수신부(110)는 저장매체(미도시)로부터 동영상 및 동영상의 자막 데이터를 수신하고, 수신된 동영상을 영상정보 및 음성정보로 분리하여 영상정보를 후술할 영상분석부(130)로 제공하고, 수신된 동영상의 자막 데이터를 후술할 시간정보 검출부(120)에 제공한다.The receiver 110 receives subtitle data of a video and a video from a storage medium (not shown), separates the received video into video information and audio information, and provides the video information to the image analyzer 130 to be described later. The caption data of the captured video is provided to the time information detector 120 to be described later.

상술한 저장매체는 동영상 재생 장치(100)와 별개로 구성될 수도 있지만 동영상 재생 장치(100)내에 포함될 수도 있을 것이다. 또한, 상술한 실시예에 있어서 수신부(110)는 저장매체로부터 동영상 및 동영상의 자막 데이터를 수신하는 것으로 기재하였지만, 변형된 실시예에 있어서는 동영상 및 동영상의 자막 데이터를 인터넷을 통해 스트리밍 형식으로 다운로드 받을 수도 있을 것이다.The above-described storage medium may be configured separately from the video playback device 100 but may be included in the video playback device 100. In addition, in the above-described embodiment, the reception unit 110 is described as receiving subtitle data of a video and a video from a storage medium. In the modified embodiment, the subtitle data of the video and a video may be downloaded in a streaming format through the Internet. Could be

시간정보 검출부(120)는 수신부(110)로부터 제공되는 자막 데이터로부터 각 자막들이 표시될 시간정보 또는 영상 프레임 정보를 검출하여 영상분석부(130)에 제공한다. 여기서, 자막은 smi, idx, sub, srt, psb, ssa, ass, 및 usf 등과 같은 파일형식일 수 있으며, smi 파일 형식으로 제작된 자막의 일예가 아래에 기재되어 있다.The time information detector 120 detects time information or image frame information for displaying each subtitle from the subtitle data provided from the receiver 110 and provides the image information to the image analyzer 130. Here, the subtitle may be a file format such as smi, idx, sub, srt, psb, ssa, ass, and usf, and an example of a subtitle produced in the smi file format is described below.

멀더! 이것 좀 봐요.Mulder! Look at this.

왜그래요? 스컬리.What's wrong? Scully.

위의 예에서 "멀더! 이것 좀 봐요."라는 자막은 46.085초부터 표시됨을 알 수 있고, "왜그래요? 스컬리."라는 자막은 48.341초부터 표시됨을 알 수 있다. 이러한 자막 데이터로부터 시간정보 검출부(120)는 "멀더! 이것 좀 봐요."라는 자막이 표시될 시간정보인 46.085초~48. 341초라는 정보를 획득하게 되는 것이다.In the example above, it can be seen that the subtitle "Mulder! Look at this" is displayed from 46.085 seconds, and the subtitle "Why? Scully." Is displayed from 48.341 seconds. From the caption data, the time information detector 120 displays 46.085 seconds to 48 hours of time information for displaying the caption "Mulder! Look at this." You will get 341 seconds of information.

한편, 상술한 실시예에 있어서는 자막 데이터로부터 시간정보를 추출하는 것으로 기재하였지만 해당 자막이 표시될 프레임 번호가 자막 데이터에 포함되어 있는 경우에는 프레임 번호를 직접 추출하여 영상분석부(130)로 제공할 수도 있을 것이다.On the other hand, in the above-described embodiment, it is described as extracting time information from the caption data, but if the frame number for displaying the caption is included in the caption data, the frame number is directly extracted and provided to the image analyzer 130. Could be

영상분석부(130)는 시간정보 검출부(120)로부터 제공된 시간정보에 상응하는 각 영상프레임으로부터 얼굴 및 얼굴 위치를 검출하여 화자 결정부(140)에 제공한다. 영상분석부(130)는 다양한 방법을 이용하여 얼굴을 검출할 수 있는데, 일 실시예에 있어서 얼굴 검출부는 스킨컬러 추출방법을 이용하여 얼굴을 검출할 수 있다.The image analyzer 130 detects a face and a face position from each image frame corresponding to the time information provided from the time information detector 120 and provides the same to the speaker determiner 140. The image analyzer 130 may detect a face using various methods. In one embodiment, the face detector may detect a face using a skin color extraction method.

스킨컬러 추출 방법을 이용하는 경우, 영상분석부(130)는 영상 프레임 내에서 피부색과 임계치 이내의 색편차 값을 갖는 영역을 흰색으로 표시하고, 나머지 영역을 검은색으로 표시한 후 흰색으로 표시된 부분에서 눈을 찾아내게 된다. 이때, 눈은 얼굴에서 기하학적으로 움푹 패인 형상이며, 피부색과는 확연히 다른 색이므로 흰색 영역 내에서 쉽게 찾아낼 수 있게 된다. 이러한 방법에 의하는 경우 가운데 검은 점이 2개 찍힌 흰색 영역이 얼굴로 결정되게 된다.In the case of using the skin color extraction method, the image analyzer 130 displays an area having a color deviation value within a threshold value within the image frame in white, displays the remaining area in black, and then displays the area in white. I find my eyes. At this time, the eyes are geometrically recessed in the face, and since the eyes are distinctly different from the skin color, they can be easily found in the white area. According to this method, the white area with two black dots is determined as the face.

한편, 스킨컬러 추출방식을 이용하는 경우 발생될 수 있는 단점(예컨대, 옆 모습이라 던지 안경을 쓴 얼굴을 검출하지 못할 수 있다는 점)을 보완하기 위해 움직임에 기반한 에지 차영상을 이용하여 얼굴의 윤곽을 찾아내는 방식을 이용하거나, 눈의 깜빡임을 이용하여 얼굴을 검출하는 방식을 함께 이용할 수도 있다. On the other hand, in order to compensate for the disadvantages that may occur when using the skin color extraction method (for example, it may not be able to detect a face or a face wearing glasses), the contour of the face is used by using the edge difference image based on the movement. It may be used by a method of finding or using a method of detecting a face by blinking eyes.

이러한 방법 이외에도 얼굴의 지역적 특성(local features)에 근거한 방법 또는 얼굴 전체 형상(template based)에 근거한 방법 등과 같은 다양한 방법을 이용하여 각 영상 프레임으로부터 얼굴을 검출할 수도 있을 것이다.In addition to these methods, a face may be detected from each image frame using various methods such as a method based on local features of a face or a method based on a template-based face.

화자결정부(140)는 영상분석부(130)에 의해 검출된 각 영상 프레임 내의 얼굴을 이용하여 자막에 해당하는 각 영상 프레임 내에서 화자를 결정한다. 동영상은 대개 1초에 24~30개의 영상프레임으로 구성되고, 각 영상 프레임에는 화자 이외의 다른 캐릭터들도 등장할 수 있기 때문에, 영상분석부(130)에 의해 검출된 얼굴이 모두 화자의 얼굴이라고 할 수는 없다. 따라서, 화자결정부(140)는 영상분석부(130)에 의해 검출된 얼굴 중 화자의 얼굴을 검출함으로써 현재 프레임 내에서 말을 하고 있는 화자를 결정하는 것이다.The speaker determiner 140 determines a speaker in each image frame corresponding to the caption by using a face in each image frame detected by the image analyzer 130. The video is usually composed of 24 to 30 video frames per second, and since each character may have other characters besides the speaker, the faces detected by the image analyzer 130 are all the speaker's faces. You can't. Accordingly, the speaker determiner 140 determines the speaker who is speaking in the current frame by detecting the speaker's face among the faces detected by the image analyzer 130.

화자결정부(140)가 영상분석부(130)에 의해 검출된 각 영상 프레임 내의 얼굴을 이용하여 화자를 결정하는 방법을 도 2를 이용하여 보다 구체적으로 설명한다.A method of determining the speaker by using the face in each image frame detected by the image analyzer 130 will be described in more detail with reference to FIG. 2.

화자결정부(140)는 영상분석부(130)에 의해 검출된 각 영상 프레임 별 얼굴들 가운데 특징점이 변경되는 얼굴을 화자의 얼굴로 결정함으로써 해당 영상 프레 임에 대한 화자를 결정한다. 일 실시예에 있어서, 특징점은 얼굴 중 눈 아래의 부위, 예컨대, 입(210)이나 턱 주변의 영역(212)일 수 있다. 이는 일반적으로 사람이 말을 할 때, 입 또는 턱 주변의 영역(212)이 움직이기 때문이다. The speaker determiner 140 determines the speaker for the corresponding image frame by determining a face whose feature point is changed among faces of each image frame detected by the image analyzer 130 as the speaker's face. In one embodiment, the feature point may be an area under the eye of the face, such as the mouth 210 or the area 212 around the chin. This is generally because the area 212 around the mouth or chin moves when a person speaks.

예컨대, 도 2에 도시된 바와 같이, 해당 자막이 표시될 시간정보에 상응하는 영상 프레임이 n 프레임 및 n+1프레임인 경우 n+1 프레임에서 얼굴(200)의 특징점(210, 212)이 변경되었기 때문에, 도 2의 얼굴(200)은 화자의 얼굴로 결정하게 되는 것이다. For example, as illustrated in FIG. 2, when the image frame corresponding to the time information on which the subtitle is to be displayed is n frames and n + 1 frames, the feature points 210 and 212 of the face 200 are changed at n + 1 frames. Thus, the face 200 of FIG. 2 is determined as the speaker's face.

예컨대, n 프레임에서 검출된 얼굴의 폭과 높이가 200 X 200의 크기인 경우, 검출된 얼굴을 폭 30개와 높이 30개로 균등하게 분할한 후, n+1 프레임에서 검출된 얼굴도 동일하게 균등 분할하여 양 프레임을 서로 겹쳐주었을 때, n+1 프레임 내에서 눈 밑의 부분의 영역들의 색상이 n 프레임 중 동일한 영역의 색상과 다른 얼굴을 화자로 결정하게 되는 것이다. For example, if the width and height of a face detected in n frames are 200 X 200, the detected faces are equally divided into 30 widths and 30 heights, and then the faces detected in n + 1 frames are equally divided. When the two frames are overlapped with each other, the color of the area under the eye in the n + 1 frame determines the face different from the color of the same area among the n frames.

한편, 화자결정부(140)는 화자가 결정되는 경우, 화자로 결정된 얼굴의 위치를 후술할 자막위치 결정부(150)로 제공한다.On the other hand, when the speaker is determined, the speaker determiner 140 provides the position of the face determined by the speaker to the subtitle position determiner 150 to be described later.

자막위치 결정부(150)는 화자결정부(140)에 의해 결정된 화자의 얼굴위치를 이용하여 해당 자막이 표시될 위치를 결정한다. 구체적으로, 자막위치 결정부(150)는 먼저, 화자의 얼굴위치를 이용하여 자막이 표시될 후보 영역을 결정한다. 일 실시예에 있어서 자막이 표시될 후보 영역은 도 3에 도시된 바와 같이, 자막이 표시될 각 영상 프레임들로부터 검출된 화자 얼굴영역 범위(310)으로부터 소정 거리 이내의 영역, 예컨대, 화자의 얼굴영역 범위(310)으로부터 1줄 높이 이상 벗어나지 않거나, 수평축이 화자의 얼굴영역 범위(310)으로부터 2글씨 크기만큼을 벗어나지 않는 영역을 자막이 표시될 후보 영역으로 결정할 수 있다. 이때, 화자의 얼굴영역 범위(310)는 자막이 표시될 각 영상 프레임들로부터 추출된 화자의 얼굴영역이 각 영상 프레임에 따라 변경되는 영역을 의미한다. 도 3에서는 자막이 표시될 영상 프레임이 3개인 경우, 3개의 영상 프레임 각각에 포함된 화자의 얼굴이 변경되는 영역이 화자의 얼굴영역 범위(310)로 결정되었다.The caption position determiner 150 determines the position at which the corresponding caption is to be displayed by using the face position of the speaker determined by the speaker determiner 140. Specifically, the caption position determiner 150 first determines a candidate region in which captions are to be displayed using the speaker's face position. In one embodiment, the candidate region in which the caption is to be displayed is an area within a predetermined distance from the speaker face region range 310 detected from each of the image frames in which the caption is to be displayed, for example, the speaker's face. A region where the height does not deviate by more than one line from the region range 310 or the horizontal axis does not deviate by two letters from the speaker's face region range 310 may be determined as a candidate region for displaying the subtitle. In this case, the speaker's face region range 310 refers to a region in which the speaker's face region extracted from the respective image frames for which the caption is to be displayed is changed according to each image frame. In FIG. 3, when there are three image frames in which subtitles are to be displayed, an area in which the speaker's face included in each of the three image frames is changed is determined as the speaker's face region range 310.

이때, 이러한 조건을 만족하는 후보영역(320)이 없을 경우에는 도 4에 도시된 바와 같이, 화자의 얼굴영역 범위(310)의 좌측 또는 우측의 영역 중 화자의 얼굴영역 범위(310)를 침범하지 않으면서 자막의 내용을 최소 1회 이상으로 줄 바꿈 하여 표시할 수 있는 지역을 후보영역으로 설정할 수 있다.In this case, when there is no candidate region 320 that satisfies such a condition, as illustrated in FIG. 4, the region of the speaker's face region 310, which is to the left or the right of the speaker's face region range 310, is not affected. You can set the area that can be displayed by wrapping the contents of the subtitle at least once without being selected as a candidate area.

한편, 이러한 후보 영역들 중 자막을 표시하였을 때 자막이 화면을 벗어나게 되는 영역이나 화자를 침범하게 되는 영역은 후보 영역에서 제외할 수 있다.On the other hand, when the subtitles are displayed among the candidate areas, the area where the subtitles leave the screen or the speaker may be excluded from the candidate areas.

이후, 자막위치 결정부(150)는 후보영역들 중 자막표시영역을 결정하는데, 일 실시예에 있어서, 자막위치 결정부(150)는 후보영역들 중 색편차가 가장 적고, 채도가 가장 높은 영역을 자막표시영역으로 결정할 수 있다. 만약, 결정된 자막표시영역의 채도값이 임계치 이하일 경우, 자막의 색상은 결정된 자막표시영역의 평균색상과 보색인 색상으로 결정할 수 있다.Subsequently, the caption position determiner 150 determines the caption display region among the candidate regions. In one embodiment, the caption position determiner 150 has the smallest color deviation and the highest saturation region among the candidate regions. Can be determined as the subtitle display area. If the saturation value of the determined subtitle display area is less than or equal to the threshold value, the color of the subtitle may be determined as an average color and a complementary color of the determined subtitle display area.

상술한 실시예에 있어서는 자막위치 결정부(150)가 화자 결정부(140)에 의해 결정된 화자의 얼굴영역 범위를 이용하여 자막위치를 결정하는 것으로 기재하였지만, 변형된 실시예에 있어서는 자막위치를 결정함에 있어서 동영상을 구성하는 음 성정보를 함께 이용하여 또는 음성정보 만을 이용하여 자막위치를 결정할 수도 있을 것이다. 이를 위해, 동영상 재생 장치(100)는 음성분석부(155)를 더 포함할 수 있다.In the above-described embodiment, the caption position determining unit 150 determines the caption position using the range of the speaker's face area determined by the speaker determination unit 140, but in the modified embodiment, the caption position is determined. In this case, the caption position may be determined by using the audio information constituting the video together or by using only the audio information. To this end, the video reproducing apparatus 100 may further include a voice analyzer 155.

음성분석부(155)는 상술한 수신부(110)로부터 동영상을 구성하는 음성정보를 수신하고, 시간정보 검출부(120)로부터 해당 자막이 표시될 시간정보를 수신하여 해당 시간정보에 상응하는 영상 프레임의 음성정보로부터 음성이 발생하는 공간적 위치를 결정하고, 결정된 공간적 위치를 상술한 화자결정부(140)로 제공한다.The voice analyzer 155 receives voice information constituting a video from the receiver 110 described above, and receives time information for displaying a corresponding subtitle from the time information detector 120 to determine a video frame corresponding to the time information. The spatial location at which the voice is generated is determined from the voice information, and the determined spatial location is provided to the speaker determiner 140 described above.

구체적으로, 음성분석부(155)는 먼저, 수신부(110)로부터 제공되는 음성정보의 규격을 분석한다. 일 실시예에 있어서, 음성정보의 규격은 각각의 채널별 음성신호가 다르게 출력되는 스테레오, 돌비 디지털, 돌비 써라운드, 돌비 프로로직, 디지털 써라운드, 또는 DTS등이 있을 수 있으며, 음성분석부(155)는 해당 동영상의 음성규격이 어떤 것에 해당하는지 여부를 분석하는 것이다.Specifically, the voice analyzer 155 first analyzes the standard of voice information provided from the receiver 110. In one embodiment, the specification of the voice information may be stereo, Dolby Digital, Dolby Surround, Dolby Pro Logic, Digital Surround, DTS, etc., in which a voice signal for each channel is output differently. 155) is to analyze whether the audio standard of the video corresponds to.

이후, 음성분석부(155)는 각 채널별 음성신호의 차이 또는 음성주파수 대역을 이용하여 음성이 발생하는 공간적 위치를 결정한다. 일반적으로 동영상의 음성정보는 해당 동영상을 구성하는 영상 프레임 내에서 발생되는 음원의 위치와 일치시켜주기 위해 좌우 2 채널이나 그 이상의 채널로 저장되므로, 이러한 각 채널의 차이를 이용한다면 음성의 공간적 위치를 유추할 수 있게 되는 것이다. 보다 구체적으로 음성신호의 차이를 이용하는 경우, 분석된 음성 규격이 스테레오 또는 돌비디지털이라고 가정하면 이러한 음성규격은 음성을 별도의 마이크를 이용하여 서로 다른 방향에서 녹음하는 것이므로, 채널 별로 음성의 크기가 상이할 수 밖에 없고, 채널 별로 음성이 마이크까지 도달하는데 걸리는 시간차이(위상차이)가 발생될 수 밖에 없으므로, 음성분석부(155)는 이러한 차이를 이용하여 음성발생위치의 좌우를 결정하게 된다. 일 실시예에 있어서 음성분석부(155)는 저음대역에서는 음성신호의 시간차이를 이용하여 음성의 좌우위치를 결정하고, 고음대역에서는 음성의 크기차이로 좌우위치를 결정하게 된다.Thereafter, the voice analyzer 155 determines the spatial location where the voice is generated using the difference of the voice signal of each channel or the voice frequency band. In general, the audio information of a video is stored in two left and right channels or more in order to match the position of a sound source generated in the video frame constituting the video. It can be inferred. More specifically, when using the difference in the voice signal, assuming that the analyzed voice standard is stereo or Dolby Digital, since the voice standard is to record the voice in different directions using a separate microphone, the size of the voice for each channel is different In addition, since the time difference (phase difference) for the voice to reach the microphone for each channel is inevitably generated, the voice analyzer 155 determines the left and right sides of the voice generation position using this difference. In one embodiment, the voice analyzer 155 determines the left and right positions of the voice by using the time difference of the voice signal in the low band, and determines the left and right positions by the size difference of the voice in the high band.

한편, 음성 주파수 대역을 이용하는 경우, 음성분석부(155)는 음성의 스펙트럼 특성을 이용하여 음성 발생위치의 상하를 결정하게 된다.On the other hand, in the case of using the voice frequency band, the voice analyzer 155 determines the top and bottom of the voice generation position using the spectral characteristics of the voice.

이러한 방법을 통해 결정된 음성의 공간적 위치가 자막위치 결정부(150)로 제공되면, 자막위치 결정부(150)는 결정된 음성의 공간적 위치를 함께 고려하여 자막표시영역을 결정하게 된다.When the spatial position of the voice determined by the above method is provided to the caption position determiner 150, the caption position determiner 150 determines the caption display area in consideration of the spatial position of the determined sound.

한편, 자막위치 결정부(150)는 상술한 방법들을 통해서 자막이 표시될 영역을 결정할 수 없는 경우, 예컨대, 시간정보에 상응하는 영상 프레임에 화자가 등장하지 않아 화자의 얼굴 및 얼굴 위치를 검출할 수 없는 경우, 영상 프레임 에 화자가 등장하지만 영상 프레임이 어두워 화자를 인식할 수 없는 경우, 또는 음성정보가 음성이 발생하는 공간적 위치를 결정하기에 적당하지 않은 음성규격인 경우에는 기본적으로 설정된 위치(화면의 우측 또는 하단과 같은 일편)를 자막이 표시될 위치로 결정할 수 있다.On the other hand, if the subtitle position determiner 150 cannot determine the area where the subtitle is to be displayed through the above-described methods, for example, the speaker does not appear in an image frame corresponding to time information and thus detects the speaker's face and face position. If it is not possible, if the speaker appears in the video frame but the video frame is dark and cannot recognize the speaker, or if the audio information is not suitable for determining the spatial location where the voice is generated, One side, such as the right side or the bottom of the screen, may be determined as the position where the subtitle is to be displayed.

이러한 다양한 상황을 고려하여 자막위치 결정부(150)에 의해 결정된 자막위치를 표로 정리하면 도 5와 같다.In consideration of such various situations, the caption positions determined by the caption position determiner 150 are summarized in a table as shown in FIG. 5.

다시 도 1을 참조하면, 동영상 출력부(160)는 동영상 및 동영상의 자막을 디 스플레이(미도시)를 통해 출력하는 것으로서, 이때, 동영상의 자막은 상술한 자막위치 결정부(150)에 의해 결정된 자막표시영역에 출력된다. 즉, 동영상 출력부(160)는 도 6에 도시된 바와 같이, 영상 프레임 내에서 화자의 주위에 자막이 출력되도록 하는 것이다.Referring back to FIG. 1, the video output unit 160 outputs a video and a caption of the video through a display (not shown). At this time, the caption of the video is determined by the caption position determiner 150 described above. The subtitle display area is displayed. That is, as shown in FIG. 6, the video output unit 160 outputs subtitles around the speaker in the video frame.

이하에서는, 도 7을 참조하여 동영상의 자막 표시 방법을 설명하기로 한다.Hereinafter, a caption display method of a video will be described with reference to FIG. 7.

도 7은 본 발명의 일 실시예 따른 동영상의 자막 표시 방법을 보여주는 플로우차트이다. 먼저, 동영상 및 동영상의 자막 데이터가 수신되면(S700), 동영상의 자막 데이터로부터 각 자막이 표시될 시간정보를 획득한다(710단계). 일 실시예에 있어서, 시간정보는 해당 자막이 표시될 시간에 대한 정보이거나, 해당 자막이 표시될 프레임 번호일 수 있다.7 is a flowchart illustrating a caption display method of a video according to an embodiment of the present invention. First, when the caption data of the video and the video are received (S700), time information for displaying each caption is obtained from the caption data of the video (S710). In one embodiment, the time information may be information about a time at which the corresponding subtitle is to be displayed or may be a frame number at which the corresponding subtitle is to be displayed.

다음으로, 동영상을 구성하는 영상 프레임들 중 해당 자막이 표시될 시간정보에 상응하는 영상 프레임들로부터 얼굴 및 얼굴 위치를 검출한다(7620). 일 실시예에 있어서, 얼굴검출은 스킨컬러 방법을 이용하여 수행될 수 있으며, 스킨컬러 방법과 함께 움직임에 기반한 에지 차영상을 이용하여 얼굴의 윤곽을 찾아내는 방식 또는 눈의 깜빡임을 이용하여 얼굴을 검출하는 방법을 이용할 수도 있다.Next, in operation 7720, a face and a face position are detected from image frames corresponding to time information in which a corresponding caption is to be displayed among image frames constituting a moving image. In one embodiment, the face detection may be performed using a skin color method, and the face is detected using a skin color method or a method of finding a contour of a face using a motion-based edge difference image or blinking eyes. Can be used.

이후, 검출된 얼굴의 특징점의 변경여부를 판단하여 시간정보에 상응하는 각 영상 프레임들에 해당하는 화자를 결정한다(S730). 여기서 화자를 결정하는 이유는 각 영상 프레임 내에는 화자 이외의 다른 캐릭터들도 등장할 수 있기 때문에, S620에서 검출된 모든 얼굴이 화자의 얼굴이라고 할 수는 없기 때문에, 검출된 얼굴 중 화자의 얼굴을 검출함으로써 현재 프레임 내에서 말을 하고 있는 화자를 결정하기 위한 것이다.Subsequently, it is determined whether the detected feature point of the face is changed to determine a speaker corresponding to each image frame corresponding to the time information (S730). The reason for determining the speaker here is that since other characters other than the speaker may appear in each image frame, not all faces detected in S620 are the speaker's face. This is to determine the speaker who is speaking in the current frame.

화자를 결정함에 있어서, 검출된 얼굴 내에서 특징점이 변경되는 얼굴을 화자의 얼굴로 결정할 수 있고, 이때, 특징점은 얼굴 중 눈 아래의 부위, 예컨대, 입이나 턱 주변의 영역일 수 있다. 보다 구체적으로 설명하면, 해당 자막이 표시될 시간정보에 상응하는 영상 프레임이 n 프레임 및 n+1프레임인 경우, n 프레임 및 n+1 프레임에서 검출된 얼굴들의 차이를 산출하고, 산출된 차이를 통해 얼굴의 특징점에 해당하는 부분이 변경된 것으로 판단되는 얼굴을 화자의 얼굴로 결정하는 것이다. In determining the speaker, the face whose feature point is changed in the detected face may be determined as the speaker's face. In this case, the feature point may be an area under the eye of the face, for example, an area around the mouth or the chin. More specifically, when the video frame corresponding to the time information to display the subtitle is n frames and n + 1 frame, the difference between the faces detected in the n frame and n + 1 frame is calculated, and the calculated difference Through this, the face determined to be the part corresponding to the feature point of the face is determined as the speaker's face.

다음으로, S730에서 결정된 화자의 얼굴을 이용하여 자막이 표시될 시간정보에 상응하는 각 영상 프레임들 내에서 화자의 얼굴영역 범위를 결정한다(S740). 일 실시예에 있어서, 각 영상 프레임들에서 검출된 얼굴을 서로 겹쳤을 때 발생되는 전체 얼굴 영역을 화자의 얼굴영역 범위로 결정한다.Next, the range of the speaker's face region is determined in each image frame corresponding to the time information on which the subtitle is to be displayed using the speaker's face determined in S730 (S740). In one embodiment, the entire face area generated when the faces detected in each of the image frames overlap each other is determined as the speaker's face area range.

이후, S740에서 결정된 화자의 얼굴영역 범위로부터 소정 거리 이내에 예컨대, 화자의 얼굴영역 범위로부터 1줄 높이 이상 벗어나지 않거나, 수평축이 화자의 얼굴영역 범위로부터 2글씨 크기만큼을 벗어나지 않는 영역을 자막이 표시될 후보 영역을 설정한다(S750). 이때, 이러한 조건을 만족하는 영역이 존재하지 않는 경우에는 화면의 일편(예컨대, 화면의 우측 또는 하단)을 후보 영역으로 설정할 수 있다.Subsequently, the subtitle may be displayed within a predetermined distance from the range of the speaker's face area determined in S740, for example, a region where the height does not deviate by more than one line from the range of the speaker's face area or the horizontal axis does not deviate by two letters from the range of the speaker's face area. The candidate area is set (S750). In this case, when there is no area satisfying such a condition, one side of the screen (for example, the right or the bottom of the screen) may be set as the candidate area.

다음으로, 자막이 표시될 후보 영역 중 자막표시영역을 선택한다(S760). 일 실시예에 있어서, 자막표시영역은 후보 영역들 중 색편차가 가장 적고 채도가 가장 높은 영역으로 결정할 수 있다.Next, a subtitle display area is selected from the candidate areas where subtitles are to be displayed (S760). In an exemplary embodiment, the caption display area may be determined to have the smallest color deviation and the highest saturation among the candidate areas.

한편, 상술한 실시예에 있어서는 각 영상 프레임 내에서 검출된 화자의 얼굴을 이용하여 자막의 위치를 결정하는 것으로 기재하였지만, 변형된 실시예에 있어서는 해당 시간정보에 상응하는 영상 프레임을 구성하는 음성정보를 분석함으로써 음성의 공간적 위치를 결정하고(S770), 결정된 음성의 공간적 위치를 함께 이용하여 자막표시영역을 결정할 수도 있을 것이다. 이때, 음성의 공간적 위치는 음성신호의 규격을 분석한 후, 분석된 음성 규격에 따라 각 채널 별 음성신호의 차이 및 음성신호의 주파수 스펙트럼 특성을 이용하여 결정할 수 있다.Meanwhile, in the above-described embodiment, the position of the subtitle is determined by using the speaker's face detected in each video frame. However, in the modified embodiment, the voice information constituting the video frame corresponding to the corresponding time information is described. By determining the spatial position of the voice (S770), it is also possible to determine the subtitle display area using the spatial position of the determined voice. In this case, the spatial position of the voice may be determined by analyzing the standard of the voice signal and then using the difference of the voice signal of each channel and the frequency spectrum characteristic of the voice signal according to the analyzed voice standard.

이후, 마지막으로, 해당 동영상 및 동영상의 자막을 출력하되, S770단계에서 결정된 자막표시영역에 해당 자막을 출력한다(S780). 이때, 자막은 자막표시영역의 색상과 보색관계에 있는 색상으로 출력되도록 할 수 있다.Thereafter, finally, the subtitles of the video and the video are output, and the subtitles are output to the subtitle display area determined in step S770 (S780). In this case, the caption may be output in a color having a complementary color relationship with the color of the caption display area.

상술한 동영상의 자막 표시 방법은 다양한 컴퓨터 수단을 이용하여 수행될 수 있는 프로그램 형태로도 구현될 수 있는데, 이때 동영상의 자막 표시 방법을 수행하기 위한 프로그램은 하드 디스크, CD-ROM, DVD, 롬(ROM), 램(RAM), 또는 플래시 메모리와 같은 컴퓨터로 판독할 수 있는 기록 매체에 저장된다.The above-described subtitle display method of the video may be implemented in a program form that can be performed using various computer means. In this case, the program for performing the subtitle display method of the video may include a hard disk, a CD-ROM, a DVD, a ROM ( ROM), RAM, or flash memory, such as a computer readable recording medium.

본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다.Those skilled in the art to which the present invention pertains will understand that the present invention can be implemented in other specific forms without changing the technical spirit or essential features.

그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Therefore, it is to be understood that the embodiments described above are exemplary in all respects and not restrictive. The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

도 1은 본 발명의 일 실시예에 따른 동영상 재생 장치의 개략적이 블럭도.1 is a schematic block diagram of a video reproducing apparatus according to an embodiment of the present invention.

도 2는 검출된 얼굴의 특징점 변경여부를 이용하여 화자를 결정하는 방법을 보여주는 도면.2 is a view showing a method of determining a speaker using whether or not to change a feature point of a detected face.

도 3 및 도 4는 자막표시 후보영역을 보여주는 도면.3 and 4 show caption display candidate regions;

도 5는 음성정보 및 영상정보에 따른 자막위치를 보여주는 도면.5 is a view illustrating a caption position according to audio information and video information.

도 6은 본 발명의 일 실시예에 따라 자막을 표시한 동영상을 보여주는 도면.6 is a view showing a video display captions according to an embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 동영상의 자막 표시 방법을 보여주는 플로우차트.7 is a flowchart illustrating a caption display method of a video according to an embodiment of the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100: 동영상 재생 장치 110: 수신부100: video playback device 110: receiver

120: 시간정보 검출부 130: 영상분석부120: time information detector 130: image analyzer

140: 화자결정부 150: 자막위치 결정부140: speaker determination unit 150: subtitle position determination unit

155: 음성분석부 160: 동영상 출력부155: voice analysis unit 160: video output unit

Claims

Obtaining time information for displaying the caption from the caption to be displayed in the video;

Detecting a face from image frames corresponding to the time information among the image frames constituting the video;

Determining a speaker corresponding to the image frames corresponding to the time information by changing whether a feature point of a face detected from each of the image frames corresponding to the time information is changed;

Determining a range of the speaker's face region in each of the image frames corresponding to the time information by using the determined speaker's face; And

And adaptively determining a region in which the caption is to be displayed according to a range of the speaker's face region and displaying the caption in the determined region.

The method of claim 1, wherein the displaying subtitles,

Setting a candidate area in which the subtitle is to be displayed among areas within a predetermined distance from the speaker's face area range; And

And determining a candidate region having the least color deviation and the highest saturation among the candidate regions as a region where the subtitle is to be displayed.

The method of claim 1, wherein in the caption display step,

The caption display method of the caption of the video, characterized in that for displaying the subtitle in a color complementary to the color of the area to be displayed.

The method of claim 1,

Before the subtitle display step, further comprising the step of analyzing the voice information constituting the video to determine the spatial location where the voice is generated,

The subtitle display method of the video, characterized in that for determining the area to display the subtitle using the spatial position of the voice together.

The method of claim 4, wherein

The spatial position of the voice is determined using at least one of the difference between the voice signal output for each channel and the voice frequency band.