KR20120029247A

KR20120029247A - Apparatus and method for detecting output error of audiovisual information of video contents

Info

Publication number: KR20120029247A
Application number: KR1020100091221A
Authority: KR
Inventors: 장성환; 권재철; 이강식; 진영민
Original assignee: 주식회사 케이티
Priority date: 2010-09-16
Filing date: 2010-09-16
Publication date: 2012-03-26
Also published as: KR101462249B1

Abstract

PURPOSE: An apparatus for detecting output error of audiovisual information of a video content and a method thereof are provided to reduce the generation of sync error of caption and audio data. CONSTITUTION: A voice extraction unit(220) checks the start time and the finish time of the voice signal from the audio signal of a video content. A subtitle detection unit(230) detects a character string from a designated subtitle area of the video content. A subtitle synchronization decision unit(240) compares the start time and a start flag. The caption synchronization determining unit compares the finish time and an ending flag.

Description

Apparatus and method for detecting audio visual information output error of video content {APPARATUS AND METHOD FOR DETECTING OUTPUT ERROR OF AUDIOVISUAL INFORMATION OF VIDEO CONTENTS}

본 발명은 비디오 컨텐츠의 시청각 정보가 이상 출력되는 오류를 검출하는 장치 및 방법에 관한 것으로서, 보다 상세하게는, 비디오 컨텐츠로부터 자막과 음성이 출력되는 타이밍의 불일치와 영상과 음성이 출력되는 타이밍의 불일치를 검출하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting an error in which audiovisual information of video content is abnormally output. More particularly, the present invention relates to a mismatch between a timing at which subtitles and audio are output from video contents and a timing at which video and audio are output. An apparatus and method for detecting the same.

정보전달 매체는 빠르게 변화하는 현대 사회에서 점차 다양화되고 그 양도 보다 풍부해지고 있다. 과거에는 정보전달 매체가 텍스트 중심의 표현 방법을 사용하였으나, 현재에는 사용자의 시각과 청각을 동시에 자극하는 동영상 매체가 가장 대중적인 정보전달 수단으로 이용되고 있다. Information transmission media are becoming more diversified and more abundant in a rapidly changing modern society. In the past, information delivery media used a text-oriented expression method, but nowadays, a video media that simultaneously stimulates a user's vision and hearing is used as the most popular means of information transmission.

특히, TV, 컴퓨터, 인터넷 등의 기술이 대중화됨에 따라 동영상의 형태로 제공되는 정보를 접하는 일이 매우 쉬워졌다. 최근에는 비디오 컨텐츠의 부호화 기술이 발달하면서 비디오 컨텐츠 파일을 수십 ~ 수백 메가 바이트의 고화질 영상도 쉽게 접할 수 있다. In particular, as the technology of TV, computer, Internet, etc. has become popular, it is very easy to access information provided in the form of moving pictures. Recently, with the development of video content encoding technology, high-definition video of tens to hundreds of megabytes can be easily accessed.

또한, 이러한 동영상을 만들어낼 수 있는 비디오 컨텐츠 촬영 장비도 발달하여 휴대폰 카메라, 디지털 카메라, 캠코더 등은 전문가가 아니더라도 많이 사용하고 있다. 이에 따라 UCC, 영화 등 다양한 동영상 컨텐츠가 끊임없이 생산되고 이러한 경향은 앞으로 더 가속화될 전망이다. 이와 더불어, 하드 디스크 드라이브, 메모리 카드 등과 같은 데이터 저장 매체도 대용량화되면서 저렴해지고 있다. In addition, video content photographing equipment that can produce such a video has also been developed, such as mobile phone cameras, digital cameras, camcorders, etc. are often used by experts. Accordingly, various video contents such as UCC and movies are continuously produced, and this trend is expected to accelerate further. In addition, data storage media such as hard disk drives, memory cards, etc. are also becoming inexpensive as they become larger in capacity.

또한, 비디오 컨텐츠를 시청할 수 있는 미디어 재생 단말도 휴대폰, 스마트 폰, PMP, 모바일 DMB 단말, 노트북, 차량 내비게이션 등 종류도 다양해졌다. 동영상 컨텐츠의 정보 전달도 과거에는 지상파 TV 방송뿐이었으나 현재에는 IPTV, 위성 방송, 디지털 케이블 방송 등 다양한 방식이 존재한다.In addition, the media playback terminals that can watch video content has also been diversified, including mobile phones, smart phones, PMPs, mobile DMB terminals, laptops, and vehicle navigation systems. In the past, information transmission of video contents was only terrestrial TV broadcasting, but there are various methods such as IPTV, satellite broadcasting, and digital cable broadcasting.

한편, 제작된 비디오 컨텐츠를 IPTV 등의 매체를 통해 서비스 하기에 앞서 비디오 컨텐츠 자체의 결함을 조사해야 한다. 왜냐하면, 비디오 컨텐츠에 자막 또는 립 싱크 오류가 있는 경우에는 시청자의 불만이 컨텐츠 제작자가 아닌 컨텐츠 서비스의 주체로 향하기 때문이다. Meanwhile, before serving the produced video content through a medium such as an IPTV, the defect of the video content itself should be investigated. This is because, when there is a subtitle or lip sync error in the video content, the viewer's complaint is directed to the content service subject, not the content producer.

도 1은 종래 동영상 컨텐츠의 시청각 출력 정보 오류를 검출하는 방법을 설명하기 위한 도면이다.1 is a diagram illustrating a method of detecting an audiovisual output information error of a conventional video content.

도 1에서 보는 바와 같이, 일반적으로 비디오 컨텐츠는 시청각 출력 정보인 자막 정보와 음성 정보의 오류를 검출하기 위해서 대부분 인간의 육안에 의존한 검사를 한다. 하지만, 비디오 컨텐츠를 제공하는 회사가 하루에도 수백 편씩 유통되는 비디오 컨텐츠의 오류들을 모두 검출한다는 것은 현실적으로 불가능하다. As shown in FIG. 1, in general, video content is mostly examined by the human eye in order to detect errors in subtitle information and audio information, which are audiovisual output information. However, it is practically impossible for a company that provides video content to detect all errors in video content that are distributed hundreds of times a day.

또한, 비디오 컨텐츠를 빠르게 재생하여 오류를 조사한다 하여도 종래 기술은 아주 많은 시간과 인력을 필요로 하고 비용도 많이 든다. 또한, 사람이 수작업으로 오류를 검출하는 방법이기 때문에 오류를 검출하는 사람의 능력에 따른 편차도 크고 육안 검사가 끝나도 오류가 완전히 제거되지는 않는 문제점이 있다.In addition, even if the video content is played back quickly to investigate an error, the conventional technology requires a lot of time, manpower, and cost. In addition, since a person detects an error by hand, there is a problem in that the deviation is large according to a person's ability to detect an error and the error is not completely eliminated even after the visual inspection is completed.

상술한 종래 기술의 문제점을 해결하기 위해, 본 발명의 일부 실시예는 비디오 컨텐츠의 자막이 위치하는 영역에서 자막의 유무를 검출하고 음성을 추출하여 자막 표시 오류를 검출하는 장치 및 방법을 제공한다.In order to solve the above-mentioned problems of the prior art, some embodiments of the present invention provide an apparatus and method for detecting a subtitle display error by detecting the presence or absence of a subtitle in a region where a subtitle of video content is located and extracting a voice.

또한, 본 발명의 일부 실시예는 안면 인식 기술을 이용하여 비디오 컨텐츠의 얼굴의 입 모양을 분석하고 음성을 추출하여 음성 출력 오류를 검출하는 장치 및 방법을 제공한다.In addition, some embodiments of the present invention provide an apparatus and method for detecting a voice output error by analyzing a mouth shape of a face of video content and extracting a voice by using face recognition technology.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood from the following description.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면은 상기 비디오 컨텐츠의 오디오 신호로부터 주파수 정보를 이용하여 음성 신호의 시작 시각과 종료 시각을 체크하는 음성 추출부, 상기 비디오 컨텐츠에 미리 지정된 자막 영역에서 문자열이 있는지를 검출하고, 상기 문자열이 처음으로 검출된 N 번째 비디오 프레임에 시작 플래그를 설정하며, N+1 번째 비디오 프레임부터 상기 N 번째 비디오 프레임과 자막 영역의 문자열이 변화되는지를 검출하여 상기 자막 영역에 문자열이 처음으로 검출되지 않는 비디오 프레임에 종료 플래그를 설정하는 자막 검출부 및 상기 시작 시각과 상기 시작 플래그를 비교하고 상기 종료 시각과 상기 종료 플래그를 각각 비교하여 미리 지정된 제 1 자막 오류 설정 시간을 초과하는지 판단하는 자막 동기화 판단부를 포함하는 자막 표시 오류 검출 장치를 제공할 수 있다.As a technical means for achieving the above-described technical problem, a first aspect of the present invention is a voice extractor for checking the start time and end time of a voice signal using frequency information from the audio signal of the video content, the video content Detects whether there is a string in a predetermined subtitle area, sets a start flag to the Nth video frame in which the string is first detected, and changes the string of the Nth video frame and the subtitle area from an N + 1th video frame A subtitle detector configured to set an end flag to a video frame in which a character string is not detected for the first time in the subtitle area, and compare the start time with the start flag, and compare the end time with the end flag, respectively, and specify a first predetermined first flag. Determine who has exceeded the caption error setting time A caption display error detection apparatus including a film synchronization determination unit may be provided.

또한, 본 발명의 제 2 측면은 상기 비디오 컨텐츠의 오디오 신호에서 주파수 정보를 이용하여 상기 음성 신호의 시작 시각과 종료 시각을 체크하는 음성 추출부, 상기 비디오 컨텐츠에 등장하는 대화자의 입 모양을 안면 인식 기술에 의해 파악하고, 상기 입 모양이 벌어지는 첫 비디오 프레임의 시작 시각을 체크하는 영상 인식부 및 상기 음성 추출부가 추출한 음성 신호의 시작 시각과 상기 영상 인식부가 체크한 첫 비디오 프레임의 시작 시각을 상호 비교하여 미리 지정된 제 1 음성 오류 설정 시간을 초과하는지 판단하는 음성 동기화 판단부를 포함하는 음성 싱크 오류 검출 장치를 제공할 수 있다.In addition, the second aspect of the present invention is a voice extractor for checking the start time and the end time of the voice signal using the frequency information in the audio signal of the video content, facial recognition of the shape of the mouth of the talker appearing in the video content The image recognition unit which grasps by a technique and checks the start time of the first video frame where the mouth shape is opened, and compares the start time of the audio signal extracted by the voice extractor and the start time of the first video frame checked by the image recognition unit. It is possible to provide a voice sync error detection device including a voice synchronization determination unit for determining whether to exceed a predetermined first voice error setting time.

상기 목적을 달성하기 위하여, 본 발명의 제 3 측면은 (a) 상기 비디오 컨텐츠의 오디오 신호로부터 주파수 정보를 이용하여 음성 신호를 추출하는 단계, (b) 상기 비디오 컨텐츠로부터 추출된 음성 신호의 시작 시각 및 종료 시각을 추출하는 단계, (c) N 번째 비디오 프레임의 미리 지정된 자막 영역에 문자열이 존재하는지를 확인하는 단계, (d) 상기 자막 영역에 문자열이 존재하는 경우에 상기 N 번째 비디오 프레임에 시작 플래그를 설정하고, 상기 자막 영역에 문자열이 사라지는 N+M 번째 비디오 프레임에 종료 플래그를 설정하는 단계 및 (e) 상기 음성 신호의 시작 시각 및 종료 시각과 상기 시작 플래그 및 종료 플래그를 각각 비교하여 자막 표시 오류를 확인하는 단계를 포함하는 자막 표시 오류 검출 방법을 제공할 수 있다.In order to achieve the above object, a third aspect of the present invention is (a) extracting a speech signal using frequency information from the audio signal of the video content, (b) the start time of the speech signal extracted from the video content And extracting an end time; (c) checking whether a character string exists in a predetermined subtitle region of the N th video frame, and (d) a start flag in the N th video frame when the character string exists in the subtitle region. And setting an end flag in the N + M th video frame in which the character string disappears in the subtitle area, and (e) displaying the subtitle by comparing the start time and end time of the voice signal with the start flag and end flag, respectively. It is possible to provide a caption display error detection method comprising the step of identifying an error.

또한, 본 발명의 제 4 측면은 (a) 상기 비디오 컨텐츠의 오디오 신호로부터 주파수 정보를 이용하여 음성 신호를 추출하는 단계, (b) N 번째 비디오 프레임에 등장하는 대화자의 입 모양을 안면 인식 기술에 의해 파악하고, 상기 입 모양이 벌어지는 첫 비디오 프레임의 시작 시각을 체크하는 단계 및 (c) 상기 음성 신호의 시작 시각과 상기 체크된 첫 비디오 프레임의 시작 시각을 상호 비교하여 음성 싱크 오류를 확인하는 단계를 포함하는 음성 싱크 오류 검출 방법을 제공할 수 있다.In addition, a fourth aspect of the present invention is to (a) extracting a speech signal using the frequency information from the audio signal of the video content, (b) to the facial recognition technology in the shape of the speaker's mouth appearing in the N-th video frame Determining the start time of the first video frame in which the mouth shape is opened, and (c) comparing the start time of the voice signal with the start time of the checked first video frame to identify a voice sync error. It may provide a voice sync error detection method comprising a.

상기 목적을 달성하기 위한 구체적인 사항들은 첨부된 도면과 함께 상세하게 후술된 실시예들을 참조하면 명확해질 것이다.BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: FIG.

그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 서로 다른 다양한 형태로 구성될 수 있으며, 본 실시예들은 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다.The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. It is provided to fully inform the owner of the scope of the invention.

전술한 본 발명의 시청각 출력 정보 오류 검출 장치의 과제 해결 수단 중 하나에 의하면, 비디오 컨텐츠의 자막 표시 오류와 음성 싱크 오류를 검출할 수 있다.According to one of the problem solving means of the above-described audiovisual output information error detecting apparatus of the present invention, a caption display error and a voice sync error of video content can be detected.

또한, 전술한 본 발명의 과제 해결 수단에 의하면, 비디오 컨텐츠의 자막 표시 오류와 음성 싱크 오류를 별도의 오류 검사 인력을 필요로 하지 않거나, 소요 시간을 감축시켜 비용을 감소시킬 수 있다.In addition, according to the above-described problem solving means of the present invention, the subtitle display error and the audio sync error of the video content does not require a separate error inspection manpower, or it can reduce the cost by reducing the time required.

또한, 전술한 본 발명의 과제 해결 수단에 의하면, 비디오 컨텐츠의 자막 표시 오류와 음성 싱크 오류를 보다 정확하게 검출할 수 있으므로 컨텐츠의 무결성 검증과 품질 검증에도 활용할 수 있다.In addition, according to the above-described problem solving means of the present invention, it is possible to more accurately detect the subtitle display error and the audio sync error of the video content can be used for the integrity verification and quality verification of the content.

또한, 전술한 본 발명의 과제 해결 수단에 의하면, 현재 비디오 컨텐츠 서비스 업체가 실시하는 비디오 컨텐츠 인코딩 과정, 트랜스코딩 과정, 포맷 변환 과정 등에 폭넓게 활용될 수 있으므로 좀 더 많은 동영상 컨텐츠를 생산하고 유통할 수 있다.In addition, according to the above-described problem solving means of the present invention, since it can be widely used in video content encoding process, transcoding process, format conversion process currently performed by a video content service company, more video content can be produced and distributed. have.

도 1은 종래 동영상 컨텐츠의 시청각 출력 정보 오류를 검출하는 방법을 설명하기 위한 도면이다.
도 2(a)는 본 발명의 일 실시예에 따른 자막 표시 오류 검출 장치를 설명하기 위한 블록도이고, 도 2(b)는 본 발명의 다른 실시예에 따른 음성 싱크 오류 검출 장치를 설명하기 위한 블록도이다.
도 3은 본 발명의 또 다른 실시예에 따른 시청각 출력 정보 오류 검출 장치를 설명하기 위한 블록도이다.
도 4(a)와 도 4(b)는 본 발명의 일 실시예에 따른 비디오 컨텐츠의 자막이 위치하는 영역을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 비디오 프레임 자체에 자막이 있는 예시를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 비디오 프레임에 자막이 추가된 예시를 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 자막의 오류를 검출하는 방법을 설명하기 위한 도면이다.
도 8은 본 발명의 다른 실시예에 따라 얼굴 인식 기술을 이용한 입 모양 인식 방법을 설명하기 위한 도면이다.
도 9는 본 발명의 다른 실시예에 따른 비디오 컨텐츠의 영상과 음성의 오류를 검출하는 방법을 설명하기 위한 도면이다.
도 10과 도 11은 본 발명의 일 실시예에 따른 자막 표시 오류 검출 방법을 설명하기 위한 흐름도이다.
도 12는 본 발명의 다른 실시예에 따른 음성 싱크 오류 검출 방법을 설명하기 위한 흐름도이다.1 is a diagram illustrating a method of detecting an audiovisual output information error of a conventional video content.
2 (a) is a block diagram illustrating a caption display error detection apparatus according to an embodiment of the present invention, Figure 2 (b) is a view for explaining a voice sync error detection apparatus according to another embodiment of the present invention It is a block diagram.
3 is a block diagram illustrating an apparatus for detecting audiovisual output information error according to another embodiment of the present invention.
4 (a) and 4 (b) are diagrams for describing an area where a subtitle of video content is located according to an embodiment of the present invention.
5 is a diagram for explaining an example in which a caption is included in a video frame itself according to an embodiment of the present invention.
FIG. 6 is a diagram for explaining an example of adding caption to a video frame according to an embodiment of the present invention. FIG.
7 is a diagram for describing a method of detecting an error of a caption according to an embodiment of the present invention.
8 is a diagram for describing a mouth shape recognition method using face recognition technology according to another exemplary embodiment of the present invention.
9 is a diagram for describing a method of detecting an error of an image and a sound of video content according to another exemplary embodiment of the present invention.
10 and 11 are flowcharts illustrating a caption display error detection method according to an exemplary embodiment of the present invention.
12 is a flowchart illustrating a voice sync error detection method according to another embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. . In addition, when a part is said to "include" a certain component, which means that it may further include other components, except to exclude other components unless otherwise stated.

또한, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.In addition, when a part is said to "include" a certain component, it means that it may further include other components, except to exclude other components unless specifically stated otherwise.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings.

도 2(a)는 본 발명의 일 실시예에 따른 자막 표시 오류 검출 장치를 설명하기 위한 블록도이고, 도 2(b)는 본 발명의 다른 실시예에 따른 음성 싱크 오류 검출 장치를 설명하기 위한 블록도이다. 2 (a) is a block diagram illustrating a caption display error detection apparatus according to an embodiment of the present invention, Figure 2 (b) is a view for explaining a voice sync error detection apparatus according to another embodiment of the present invention It is a block diagram.

도 2(a)에서 보는 것과 같이, 본 발명의 일 실시예에 따른 자막 표시 오류 검출 장치(200)는 비디오 컨텐츠 입력부(210), 음성 추출부(220), 자막 검출부(230) 및 자막 동기화 판단부(240)를 포함할 수 있다.As shown in FIG. 2 (a), the caption display error detecting apparatus 200 according to an exemplary embodiment of the present invention may determine the video content input unit 210, the voice extractor 220, the caption detector 230, and the caption synchronization. It may include a portion 240.

본 발명에서, "자막 표시 오류"는 특정 음성에 해당하는 자막이 적절한 순간에 화면상에 표시되지 않는 것을 말한다. 즉, 비디오 컨텐츠의 영상이 출력되는 동안 대화자가 말하는 일부 대사의 내국어 또는 외국어의 자막이 뜻하지 않게 빠뜨린 경우가 이에 해당할 수 있다. In the present invention, "subtitle display error" means that a subtitle corresponding to a specific voice is not displayed on the screen at an appropriate moment. That is, this may be the case when the subtitles of the internal language or the foreign language of some dialogue spoken by the talker are accidentally omitted while the image of the video content is output.

예를 들면, 동영상 속의 대화자의 입이 움직이는 시점에 자막이 표시되지 않는 경우, 동영상 내 대화자의 음성이 출력되는 시점에 자막이 표시되지 않는 경우, 자막이 긴 경우에도 불구하고 비디오 컨텐츠 이용자가 자막을 다 읽을 수 없을 정도로 빨리 사라지는 경우, 일반적인 독해 수준의 시청자라면 자막을 다 읽었을 것으로 판단되는 시점임에도 자막이 사라지지 않고 그대로 있는 경우 등을 포함한다. 또한, 비디오 컨텐츠에 자막이 빨라지거나 느려지는 것이 아니라 자막 입력 자체가 되지 않은 경우, 즉, 자막이 실리지 조차 않은 경우를 포함하며, 이 경우에는 자막 표시 오류에 해당시키지 않고 "자막 없음"으로 별도 관리할 수도 있다.For example, if the subtitles are not displayed when the mouth of the speaker in the video moves, when the subtitles are not displayed when the voice of the speaker in the video is output, the user of the video content may If everything disappears too quickly to read, the general reading level viewers may have read the subtitles, but the subtitles do not disappear. In addition, subtitles are not fastened or slowed in the video content, but the subtitles themselves are not input, that is, the subtitles are not loaded. In this case, they are separately managed as "no subtitles" without corresponding to the subtitle display error. You may.

이제 각 구성 요소를 설명하면, 컨텐츠 입력부(210)는 자막 표시 오류 검출 장치(200)와 유선 또는 무선으로 연결된 장치로부터 비디오 컨텐츠 데이터를 입력 받을 수 있다. 입력된 비디오 컨텐츠는 본 발명을 실시하기 위해 도시하지 않은 데이터베이스에 저장될 수 있다.Now, when describing each component, the content input unit 210 may receive video content data from a device connected to the caption display error detecting apparatus 200 by wire or wirelessly. The input video content may be stored in a database not shown to implement the present invention.

여기서, 비디오 컨텐츠는 다양한 형태의 장면들이 복합적으로 구성되어 있는 일반적인 동영상 자료를 말한다. 비디오 컨텐츠에서는 외국어 사용자, 한국어 사용자 등 언어를 말하는 대화자가 등장할 수 있다. 또한, 특정 목적을 위하여 비디오 컨텐츠 내에 자막이 제공될 수 있으며, 예를 들면, 청각 장애인이 컨텐츠의 내용을 이용할 수 있도록 각 장면에 한글 자막이 제공되거나, 표준어를 사용하는 사람을 기준으로 이해하기 힘든 방언을 쉽게 이해할 수 있도록 한글 자막이 제공되거나, 대화자가 전문적이거나 어려운 내용을 말하는 경우 적절하게 요약된 한글 자막을 제공될 수 있다.Here, the video content refers to general video data composed of various types of scenes. In the video content, a dialogue person who speaks a language such as a foreign language user or a Korean language user may appear. In addition, subtitles may be provided in the video content for a specific purpose. For example, Korean subtitles may be provided in each scene so that the hearing impaired person can use the content of the content, or difficult to understand based on a person using a standard language. Korean subtitles may be provided to make it easier to understand the dialect, or appropriately summarized Korean subtitles may be provided when the talker speaks professional or difficult contents.

음성 추출부(220)는 컨텐츠 입력부(210)가 입력 받은 비디오 컨텐츠의 주파수 정보를 이용하여 음성 신호를 추출하고, 비디오 컨텐츠의 각 프레임에 대한 시작 시각과 종료 시각을 체크할 수 있다. 추출된 음성 데이터는 그 음성 데이터가 출력되는 부분에서 자막이 적절한 타이밍에 나타나는지 아니면 시간차를 가지고 발생하는지를 확인하기 위해 필요하다. The voice extractor 220 may extract a voice signal using the frequency information of the video content input by the content input unit 210, and check a start time and an end time of each frame of the video content. The extracted voice data is necessary to confirm whether or not the caption appears at an appropriate timing or occurs with a time difference in the portion where the voice data is output.

만약 음성이 출력되는데 자막이 존재하지 않으면 자막 표시 오류로 구분하지 않고 자막 없음으로 분류하여 별도로 관리할 수 있다. 이렇게 별도로 관리되는 부분은 자막 삽입 작업을 거치고 나서 다시 본 발명의 자막 표시 오류 검출 과정이 적용될 수 있다.If the audio is output but there is no subtitle, it can be classified as no subtitle and managed separately. This separately managed portion may be subjected to the subtitle display error detection process again after the subtitle insertion operation.

여기서, 음성은 비디오 컨텐츠 데이터에서 추출된 오디오 신호로 사람의 음성 데이터가 될 수 있다. 또한, 동물 다큐멘터리, 애니메이션, SF 영화와 같이 사람의 음성 외에도 동물, 외계인, 로봇 등 소리를 낼 수 있는 주체가 등장하는 비디오 컨텐츠라면 어떤 대상이든 그 대상으로부터 그 음성 데이터를 추출하여 이용할 수 있다. 여기서는 음성 데이터가 사람으로부터 추출된 것을 예로 들어 설명할 것이다.Here, the voice is an audio signal extracted from the video content data and may be human voice data. In addition, any video object in which a subject capable of making sounds such as an animal, an alien, and a robot, in addition to a human voice, such as an animal documentary, an animation, and a science fiction movie, can be extracted and used from the object. In the following description, voice data extracted from a person will be described as an example.

또한, 오디오 데이터는 다른 데이터와 주파수가 다른 점을 이용하여 추출한다. 예를 들면, 보통 말하는 사람의 성대에서 나오는 음성 주파수 대역은 약 300 Hz 에서 약 3.5 kHz 사이이다. 따라서, 이 주파수 대역에서 유사한 신호를 추출함으로써 이 발명에서 필요한 오디오 데이터로 사용할 수 있다. 음성 추출 방법은 특정 방법으로 제한되지 않고 다양한 방법이 사용될 수 있다. 또한, 추출된 음성 데이터는 어떤 내용의 음성인지를 분석할 필요는 없고 음성이 존재하는지 여부와 음성이 존재하는 시간만 확인해도 좋다.In addition, the audio data is extracted by using different points from other data. For example, the voice frequency band from a typical speaker's vocal cords is between about 300 Hz and about 3.5 kHz. Therefore, by extracting a similar signal in this frequency band, it can be used as audio data required in the present invention. The speech extraction method is not limited to a specific method and various methods may be used. In addition, the extracted voice data need not be analyzed what kind of voice it is, but only the time when the voice is present and whether the voice is present.

자막 검출부(230)는 비디오 컨텐츠의 데이터로부터 미리 지정된 자막 영역에 문자열이 있는지를 검출하고, 문자열이 처음으로 검출된 제 1 비디오 프레임에 시작 플래그를 설정하며, 다음 비디오 프레임부터 이전 비디오 프레임과의 차이를 이용하여 자막 영역을 검출하고, 자막 영역에 문자열이 존재하지 않는 비디오 프레임에 종료 플래그를 설정할 수 있다.The caption detector 230 detects whether a character string exists in a predetermined caption region from data of video content, sets a start flag in a first video frame in which a character string is first detected, and a difference from a next video frame to a previous video frame. The caption area may be detected using, and an end flag may be set in a video frame in which no character string exists in the caption area.

여기서, 자막 영역은 자막이 표시되는 부분으로 화면의 하단, 상단, 우측면, 좌측면 등 어디든 될 수 있으나 자막이 출현하는 영역을 설정함으로써 자막 인식률을 높일 수 있다. 이 실시예에서는 일반적인 비디오 컨텐츠, 예를 들면, 영화 필름, TV, 블루레이, DVD, CD, VHS 테이프 등 각종 영상 매체가 주로 자막을 표시하는 위치인 영상의 하부 1/4 영역에 위치하는 것으로 설명할 것이다.Here, the caption area is a portion where the caption is displayed, which can be anywhere, such as the bottom, top, right side, left side of the screen, but can increase the caption recognition rate by setting an area where the caption appears. In this embodiment, various video media such as general video contents, for example, movie film, TV, Blu-ray, DVD, CD, VHS tape, etc. are mainly positioned in the lower quarter of the image, which is a position for displaying subtitles. something to do.

도 4(a)와 도 4(b)는 본 발명의 일 실시예에 따른 비디오 컨텐츠의 자막이 위치하는 영역을 설명하기 위한 도면이다.4 (a) and 4 (b) are diagrams for describing an area where a subtitle of video content is located according to an embodiment of the present invention.

도 4(a)는 디지털 표준 텔레비전과 같은 비디오 그래픽 표시 장치에서 사용되는 SD(Standard-Definition)급 화면(410)의 일례이다. SD급 화면이 보통 720 x 480의 해상도를 가지므로 세로로 이 크기의 1/4이 되는 영역인 해당 화면의 하단으로부터 120 픽셀 정도되는 영역을 제 1 자막 영역(415)으로 설정할 수 있다.4A is an example of a standard-definition screen 410 used in a video graphic display device such as a digital standard television. Since the SD class screen usually has a resolution of 720 x 480, an area about 120 pixels from the bottom of the screen, which is a quarter of the vertical length, may be set as the first subtitle area 415.

도 4(b)는 고화질 텔레비전과 같은 비디오 그래픽 표시 장치에서 사용되는 HD(High-Definition)급 화면(420)의 일례이다. HD급 화면이 보통 1920 x 1080의 해상도를 가지므로 세로로 이 크기의 1/4이 되는 영역인 해당 화면의 하단으로부터 270 픽셀 정도되는 영역을 제 2 자막 영역(425)로 설정할 수 있다.4B is an example of a high-definition screen 420 used in a video graphic display device such as a high definition television. Since the HD level screen usually has a resolution of 1920 x 1080, an area of about 270 pixels from the bottom of the screen, which is a quarter of this size vertically, can be set as the second subtitle area 425.

도 5는 본 발명의 일 실시예에 따른 비디오 프레임 자체에 자막이 있는 예시를 설명하기 위한 도면이다.5 is a diagram for explaining an example in which a caption is included in a video frame itself according to an embodiment of the present invention.

도 5는 비디오 프레임 자체에 "대게의 고장 경북 울진에 도착!" 이라는 글씨형상의 그림(510)이 삽입된 경우이다. 이상에서 설명한 것과 같이 영상의 자막 영역에 자막이 삽입된 경우와 달리 비디오 프레임 자체에 다른 영상이 자막의 형태로 있는 경우의 예시이다. 따라서, 이와 같은 경우에 자막 검출부(230)는 자막 영역에 자막이 있는지를 확인할 때 자막이 없는 것으로 판단하여 자막 검출을 하지 않을 수도 있다. 또한, 이와 같은 글씨 형상의 그림(510)은 자막 영역에 대화자의 음성을 나타내고, 그림(510)의 크기와 같이 큰 글씨로 삽입된 자막과는 다른 것이다. 이러한 그림의 다른 예로서 사람의 음성 내용 또는 상황을 알기 쉽고 재미있게 요약한 문구 등도 해당될 수 있다.5, in the video frame itself, "came breakdown of North Korea arrived in Uljin!" This is the case in which the picture 510 of the letter shape is inserted. As described above, unlike the case where the caption is inserted into the caption region of the video, the video frame itself is an example of another video in the form of caption. Therefore, in such a case, the caption detector 230 may determine that there is no caption when checking whether there is a caption in the caption area, and thus may not detect the caption. In addition, such a letter-shaped picture 510 represents the voice of the talker in the subtitle area, and is different from a subtitle inserted in a large font such as the size of the picture 510. Another example of such a picture may be a phrase that summarizes a person's voice content or situation in an easy and understandable way.

자막 검출부(230)는 도 4(a)와 도 4(b)의 각각의 자막 영역에 자막이 존재하는지를 판단하기 위하여 우선 글씨 형상의 그림이 비디오 컨텐츠 자체에 있는 글자인지 아니면 영상에 별도로 삽입된 자막인지를 구분할 수 있다. 영상에 삽입된 자막(이하, 삽입 자막)인지 글씨 형상의 그림(이하, 그림 자막)을 구분하기 위한 방법은 다양하다.The caption detection unit 230 first determines whether a caption exists in each caption area of FIGS. 4A and 4B, and the caption-shaped picture is a character in the video content itself or a caption separately inserted into the image. Perception can be distinguished. There are a variety of methods for distinguishing subtitles (hereinafter referred to as subtitles) embedded in an image or picture-type pictures (hereinafter referred to as subtitles).

우선, 화면상에 보이는 글씨가 나타나는 위치를 보고 삽입 자막인지 그림 자막인지를 판단할 수 있다. 예를 들면, 자막이 화면의 하단에만 존재하는지 여부를 확인하여 하단이 아닌 화면의 중간이나 상부에 있다면 그림 자막으로 판단하도록 할 수 있다. 또한, 자막이 하단에 존재하더라도 글씨가 가지런하게 정렬된 상태로 나열되는지 아니면 불규칙하게 임의로 정렬되는지를 추가로 확인하여 좀 더 정확하게 판단할 수 있다. First, it is possible to determine whether an inserted caption or a picture caption is displayed by looking at a position where letters appear on the screen. For example, it may be determined whether the caption exists only at the bottom of the screen, and if the caption is in the middle or top of the screen rather than the bottom, it may be determined as a picture caption. In addition, even if the subtitle is present at the bottom, it is possible to determine more precisely by further checking whether the letters are arranged neatly or randomly arranged.

또한, 화면상에 보이는 글씨의 크기로 삽입 자막인지 그림 자막인지를 판단할 수 있다. 예를 들면, SD급의 경우 한 글자 당 25 x 25 화소 이내, HD급의 경우 50 x 50 화소 이내가 되면 삽입 자막으로 볼 수 있다. 하지만, 화면상의 글씨가 너무 작거나 크기가 일정하지 않다면 삽입 자막이 아닌 것으로 볼 수 있다.In addition, it is possible to determine whether the caption is a subtitle or a picture caption by the size of the text displayed on the screen. For example, in case of SD class, less than 25 x 25 pixels per character, and in case of HD class, less than 50 x 50 pixels can be viewed as embedded subtitles. However, if the font on the screen is too small or the size is not constant, it can be regarded as not an embedded subtitle.

또한, 화면상에 보이는 글씨가 어느 정도의 시간적 유사도 즉 상관 관계(correlation)를 가지고 나타나는지에 따라 삽입 자막인지 그림 자막인지를 판단할 수 있다. 즉, 화면상의 자막 영역에 표시된 문자가 시간적으로 최소 1초(30 프레임) 이상은 움직이지 않고 동일한 위치에 있어야 삽입 자막으로 판단할 수 있다.In addition, it is possible to determine whether the text displayed on the screen is inserted with subtitles or picture captions depending on how much temporal similarity, that is, correlation. That is, the character displayed in the caption area on the screen must be at the same position without moving for at least 1 second (30 frames) in time to be determined as the inserted caption.

또한, 화면상에 보이는 글씨가 수평 방향으로 어느 정도 기울어져 있는가로 삽입 자막인지 그림 자막인지를 판단할 수 있다. 일반적으로 삽입 자막은 미리 지정된 자막 영역에서 반복적으로 나타나기 때문에 시청자가 편하게 볼 수 있도록 수평 방향과 직각으로 글자를 나열한다는 점을 이용한 방법이다.In addition, it is possible to determine whether the text displayed on the screen is inclined in the horizontal direction to determine whether the caption is an inserted caption or a picture caption. In general, since the inserted subtitles appear repeatedly in a predetermined subtitle area, the letters are arranged in a direction perpendicular to the horizontal direction so that the viewer can easily view the subtitles.

이상에서 설명한 여러 기준은 비디오 컨텐츠의 특성에 따라 제작자가 임의로 정할 수 있고 모든 비디오 컨텐츠에 무조건 일률적으로 적용되는 것은 아니므로 경우에 따라서 변화될 수 있다.The various criteria described above may be arbitrarily determined by the producer according to the characteristics of the video content and may not be applied uniformly to all the video contents.

또한, 본 발명에서 자막 영역을 검출하는 기술은 경계 지도(edge map) 사용법, 스트록 필터링, 컬러와 휘도값을 사용하는 방법 등 다양한 방법과 기술을 사용할 수 있으며 특정 자막 영역 검출 기술에 제한되지 않는다.In addition, in the present invention, a technique for detecting a caption region may use various methods and techniques such as edge map usage, stroke filtering, and a method of using color and luminance values, and is not limited to a specific caption region detection technique.

도 6은 본 발명의 일 실시예에 따른 비디오 프레임에 자막이 추가된 예시를 설명하기 위한 도면이다.FIG. 6 is a diagram for explaining an example of adding caption to a video frame according to an embodiment of the present invention. FIG.

도 6에서 보는 것과 같이, 일반 삽입 자막은 화면의 하단의 자막 영역(425)의 위치에 나타난다. 삽입된 자막의 크기가 작은 경우를 예로 든 것이며, 보통 자막의 크기는 가로 x 세로가 각각 25 ~ 50 화소가 된다. As shown in FIG. 6, the general embedded subtitle appears at the position of the subtitle area 425 at the bottom of the screen. For example, the size of the inserted subtitle is small. Usually, the size of the subtitle is 25 x 50 pixels in width x height.

자막 동기화 판단부(240)는 음성 추출부(220)에서 추출된 음성 신호의 시작 시각과 종료 시각을 자막 검출부(230)에서 검출된 자막 영역의 문자열의 시작 플래그 및 종료 플래그와 각각 비교하여 미리 지정된 제 1 자막 오류 설정 시간을 초과하는지를 판단할 수 있다.The subtitle synchronization determiner 240 compares the start time and the end time of the voice signal extracted by the voice extractor 220 with the start flag and the end flag of the character string of the subtitle area detected by the subtitle detector 230, respectively. It may be determined whether the first subtitle error setting time is exceeded.

먼저, 비디오 컨텐츠의 자막 오류를 검출하기 위해 영상의 하단의 미리 지정된 부분, 예를 들면, 세로 크기의 1/4의 영역에서 문자 검출을 실시하고 오디오 데이터에서 추출된 음성과 상호 비교함으로써 자막 오류를 검출하는 알고리즘이 사용될 수 있다. 그러한 방법을 구체적으로 나타낸 도면이 도 7이다.First, in order to detect a subtitle error of video content, character detection is performed in a predetermined portion of the lower part of the image, for example, a quarter of a vertical size, and the subtitle error is compared by mutual comparison with voice extracted from audio data. An algorithm for detecting can be used. Figure 7 specifically illustrates such a method.

도 7은 본 발명의 일 실시예에 따른 자막의 오류를 검출하는 방법을 설명하기 위한 도면이다.7 is a diagram for describing a method of detecting an error of a caption according to an embodiment of the present invention.

도 7은 음성 추출부(220)에서 추출된 음성 신호와 자막 검출부(230)에서 검출된 자막의 시작과 끝을 비교하는 방법을 나타낸다. 자막 영역을 검출한 첫 프레임에 시작 플래그(flag)를 설정하고 그 다음 프레임부터는 이전 영상과의 차이를 이용해서 자막 영역을 검출할 수 있다. 자막 영역이 끝나면 종료 플래그를 설정함으로써 음성 추출부(220)으로부터 추출된 음성 신호의 시작 플래그 및 종료 플래그를 각각 비교하는 알고리즘이 사용되는 경우이다.7 illustrates a method of comparing the start and end of the subtitle detected by the subtitle detector 230 with the voice signal extracted by the voice extractor 220. The start flag may be set in the first frame where the caption area is detected, and the caption area may be detected from the next frame by using a difference from the previous image. When the caption region ends, an end flag is set to use an algorithm for comparing the start flag and the end flag of the voice signal extracted from the voice extractor 220, respectively.

자막 시작 플래그(710) 및 자막 끝 플래그(720) 사이에서 표시된 자막은 음성이 출력된 시간과 동일하므로 A 구간 즉 자막이 정상적으로 표시되는 구간의 예를 나타낸다. 즉, 음성의 시작과 끝이 정상적으로 자막의 시작 끝과 거의 일치함을 나타내는 구간이다. 정상 자막과 오류 자막을 구분하기 위해 음성 파일의 시간과 끝 시간의 오차 한계를 초단위로 설정할 수 있다. 예를 들면, 약 0.5초(15 프레임) 차이로 정할 수 있다.The caption displayed between the caption start flag 710 and the caption end flag 720 is the same as the time at which the audio is output, and thus shows an example of the section A, that is, the section in which the caption is normally displayed. That is, it is a section indicating that the start and end of the voice are almost identical to the start end of the subtitle normally. To distinguish between normal subtitles and error subtitles, an error limit of the time and end time of the audio file may be set in seconds. For example, it can be set at a difference of about 0.5 seconds (15 frames).

여기서, 음성 신호의 시작과 끝의 시점을 파악할 수 있는 시간은 보통 오디오 파일의 PTS(Presentation Time Stamp)를 이용하여 알 수 있다. 또한, 본 발명에서 적용될 수 있는 음성 신호의 시작과 끝의 시간 계산법은 특정 방법에 제한되지 않고 다양한 방법이 이용될 수 있다. In this case, the time at which the start and end points of the voice signal can be determined can be known using a PTS (Presentation Time Stamp) of the audio file. In addition, the time calculation method of the start and end of the speech signal that can be applied in the present invention is not limited to a specific method and various methods may be used.

B 구간과 C 구간은 각각 오류 자막이 나타나는 자막 오류 구간을 나타내는 예를 나타낸다. Sections B and C each represent an example of a caption error section in which an error caption appears.

B 구간은 음성 신호는 출력되지만 자막이 없는 경우를 나타낸다. 따라서, 자막 동기화 판단부(240)는 음성 신호가 있음에도 자막이 표시되지 않는 시간이 미리 정해진 임계값(threshold)을 초과하여 지속된다면 자막 오류로 판단하여 검출할 수 있다. 이때, 자막 표시가 없는 시간의 임계값 즉 문턱 시간은 예를 들면, 2초 정도로 설정할 수 있으며 이는 비디오 컨텐츠의 유형에 따라 다른 시간으로 설정될 수 있다. 즉, 추출된 음성 신호의 시작과 자막 영역의 시작 플래그 간에 미리 지정된 시간 차이가 발생할 경우(예를 들어, 2초) 음성 신호는 있지만 자막 표시는 없는 자막 표시 오류로 판단하는 알고리즘이 사용될 수 있다.Section B represents a case where the audio signal is output but there is no subtitle. Accordingly, the caption synchronization determiner 240 may determine that the caption error is detected if the time when the caption is not displayed despite the voice signal continues beyond the predetermined threshold. In this case, the threshold value, that is, the threshold time of the time without the subtitle display may be set to, for example, about 2 seconds, which may be set to a different time according to the type of video content. That is, when a predetermined time difference occurs between the start of the extracted voice signal and the start flag of the subtitle region (for example, 2 seconds), an algorithm for determining that the subtitle display error is present without the subtitle display but with the subtitle display may be used.

C 구간은 B 구간과 반대로 음성 신호는 없고 자막은 표시되는 경우의 예시이다. 자막 시작 플래그(730) 및 자막 끝 플래그(740) 사이에서 자막이 표시됨에도 음성 신호는 출력되지 않는 경우이다. 이 경우도 자막 오류로서 취급될 수 있다. 이와 같은 자막 오류는 한 비디오 컨텐츠에서 대략 1% 미만의 확률로 발생하고 자막 분량으로 보면 오류가 발생한 자막의 한 두 소절에 해당한다. 따라서, 발생 빈도가 더 높은 B 구간의 자막 오류에 비중이 더 있는 알고리즘을 사용하는 것도 경우에 따라서 효율적일 수 있다. 왜냐하면, 실제 자막 오류 보다 더 많은 자막 오류를 검출할 수도 있기 때문이다. 따라서, 본 발명에서 B 구간의 자막 오류를 검출하기 위한 알고리즘은 포함시키고 C 구간의 자막 오류는 상황에 따라서 포함 여부를 결정할 수도 있다.In contrast to section B, section C has no audio signal and subtitles are displayed. The audio signal is not output even when the caption is displayed between the caption start flag 730 and the caption end flag 740. This case can also be treated as a subtitle error. Such subtitle errors occur with a probability of less than about 1% in one video content, and the subtitles correspond to one or two measures of the subtitles in which the error occurs. Therefore, it may be efficient in some cases to use an algorithm that has a higher proportion of caption error in the section B, which is more frequent. This is because more subtitle errors may be detected than actual subtitle errors. Therefore, in the present invention, an algorithm for detecting a caption error in section B may be included, and whether or not the caption error in section C is included according to circumstances.

이어서 도 2(b)를 참조하여 본 발명의 일 실시예에 따른 음성 싱크 오류 검출 장치(250)에 대하여 설명한다. 도 2(a)에서 이미 설명된 구성부에는 동일한 부호를 부여하고 중복된 설명은 생략한다.Next, a voice sync error detection apparatus 250 according to an embodiment of the present invention will be described with reference to FIG. 2 (b). The same reference numerals are given to the components already described in FIG. 2 (a), and redundant descriptions are omitted.

도 2(b)에서 보는 것과 같이, 본 발명의 다른 실시예에 따른 음성 싱크 오류 검출 장치(250)는 비디오 컨텐츠 입력부(210), 음성 추출부(220), 영상 인식부(235) 및 음성 동기화 판단부(245)를 포함할 수 있다.As shown in FIG. 2 (b), the audio sync error detection apparatus 250 according to another embodiment of the present invention may include a video content input unit 210, a voice extractor 220, an image recognition unit 235, and a voice synchronization. The determination unit 245 may be included.

본 발명에서, 음성 싱크 오류는 음성이 적절한 순간에 소리로 출력되지 않는 것을 의미하는 것으로 립 싱크(lip synchronization) 오류가 해당된다. 예를 들면, 동영상 속의 대화자의 입이 움직이는 타이밍과 동시에 음성이 출력되지 않는 경우, 동영상 속의 대화자의 입이 움직이기 전이나 후에 음성이 출력되는 경우, 동영상 속의 대화자의 입이 움직임에도 음성이 출력되지 않는 경우, 동영상 속의 대화자의 입이 움직이지 않는데도 음성이 출력되는 경우 등을 포함한다.In the present invention, the voice sync error means that the voice is not output as a sound at an appropriate moment, and corresponds to a lip synchronization error. For example, if the voice of the speaker in the video is not output at the same time as the voice of the speaker is moving, if the voice is output before or after the mouth of the speaker in the video, the voice of the speaker in the video is not output. If not, the voice may be output even though the speaker in the video does not move.

음성 싱크 오류가 발생하면 비디오 컨텐츠에서 비디오와 오디오 간에 동기가 맞지 않기 때문에 한 부분에서 동기가 틀어지면 이후에 나타나는 비디오와 오디오 간에도 영향을 받는다. 즉, 동기가 틀어진 순간부터 비디오 컨텐츠가 끝날 때까지 지속된다. 일반적으로 음성 싱크 오류는 검출하는 것이 용이하지 않지만 음성이 출력되기 시작하는 타이밍(대화 시작 순간)을 정확히 파악할 수 있다면 오류를 검출하는 것이 가능하다.When a voice sync error occurs, there is a mismatch between the video and the audio in the video content, so if the sync is lost in one part, the video and audio appear later. That is, it continues from the moment when synchronization is lost until the end of the video content. In general, a voice sync error is not easy to detect, but it is possible to detect an error if the timing at which the voice starts to be output (the instant of conversation start) can be accurately understood.

컨텐츠 입력부(210)는 음성 싱크 오류 검출 장치(250)와 유선 또는 무선으로 연결된 장치로부터 비디오 컨텐츠 데이터를 입력 받을 수 있고, 음성 추출부(220)는 컨텐츠 입력부(210)가 입력 받은 비디오 컨텐츠의 주파수 정보를 이용하여 음성 신호를 추출하고, 비디오 컨텐츠의 각 프레임에 대한 시작 시각과 종료 시각을 체크할 수 있다.The content input unit 210 may receive video content data from a device connected to the voice sync error detection device 250 by wire or wirelessly, and the voice extractor 220 may be a frequency of the video content input by the content input unit 210. The audio signal may be extracted using the information, and a start time and an end time of each frame of the video content may be checked.

영상 인식부(235)는 비디오 컨텐츠의 데이터로부터 안면 인식 기술을 이용하여 대화자의 입 모양을 인식하고, 상기 입 모양이 처음으로 벌어지는 비디오 프레임의 시작 시각과 종료 시각을 체크할 수 있다. 따라서, 영상 데이터로부터 음성이 적절히 출력되는 시점인지를 파악하기 위해서는 영상 데이터의 대화자의 얼굴을 분석하는 절차가 필요하다. 도 8을 참고하여 이와 관련된 설명을 한다.The image recognizing unit 235 may recognize the shape of the talker's mouth using face recognition technology from the data of the video content, and check the start time and the end time of the video frame in which the mouth shape first opens. Therefore, in order to determine whether the voice is properly output from the image data, a procedure of analyzing the face of the dialogue person of the image data is required. Referring to Figure 8 will be described in this regard.

도 8은 본 발명의 다른 실시예에 따라 안면 인식 기술을 이용한 입 모양 인식 방법을 설명하기 위한 도면이다.8 is a view for explaining a mouth shape recognition method using face recognition technology according to another embodiment of the present invention.

도 8에서 보는 것과 같이, 영상 인식부(235)는 전체 영상에서 대화자의 얼굴을 배경이나 신체 부위와 구분해서 인식하고 얼굴의 입 부분의 모양을 구별하여 인식할 수 있다. 얼굴과 입을 다른 영상과 구분하는 방법은 영상 속의 그래픽과 사람의 얼굴 템플릿(예를 들면, 동그라미와 눈, 코, 입 형상)과 사람의 일반 피부색을 상호 비교하여 판단할 수 있다. 이와 같은 기술은 디지털 카메라, CCTV, 안면 인식 소프트웨어 등에서 다양하게 사용되고 있으므로 특정 안면 인식 기술에 국한되지 않고 다양한 방법이 본 발명에서 사용될 수 있다.As shown in FIG. 8, the image recognition unit 235 may recognize the face of the talker in the entire image by distinguishing it from a background or a body part, and may recognize the shape of the mouth part of the face. The method of distinguishing the face and the mouth from other images may be determined by comparing the graphic in the image with the human face template (for example, the shape of a circle, eyes, nose, and mouth) and the general skin color of the person. Since such a technique is widely used in digital cameras, CCTVs, facial recognition software, etc., various methods can be used in the present invention without being limited to a specific facial recognition technique.

예를 들면, 안면 인식 기술을 통해 대화자의 얼굴(810)이 인식되고 대화자의 입(820)이 인식된다. 대화자의 입이 처음에 一 자 모양으로 다물어져 있다가 입이 벌어지는 순간이 말을 시작하는 순간으로 파악할 수 있다. 입이 벌어지는 프레임부터 대화 시작 타이밍으로 간주하고 이에 해당하는 오디오 데이터로부터 추출한 음성 신호가 시작되는 타이밍을 상호 비교할 수 있다. 각 타이밍을 상호 비교하는 방법을 나타내는 방법이 도 9에 나타나 있다.For example, the face 810 of the talker is recognized and the mouth 820 of the talker is recognized through face recognition technology. You can see the moment when the mouth of the talker is closed in the first shape and the mouth opens. The timing at which the voice signal extracted from the audio data corresponding to the conversation start timing from the frame where the mouth is opened can be compared with each other. 9 shows a method of comparing each timing with each other.

도 9는 본 발명의 다른 실시예에 따른 비디오 컨텐츠의 영상과 음성 간에 출력 오류를 검출하는 방법을 설명하기 위한 도면이다.9 is a diagram for describing a method of detecting an output error between a video and audio of video content according to another embodiment of the present invention.

음성 동기화 판단부(245)는 음성 추출부(220)에서 추출된 음성 신호의 시작 시각과 종료 시각을 영상 인식부(230)에서 검출된 비디오 프레임의 시작 시각과 종료 시각과 각각 비교하여 미리 지정된 제 1 음성 오류 설정 시간을 초과하는지 판단할 수 있다. The voice synchronization determiner 245 compares the start time and the end time of the voice signal extracted by the voice extractor 220 with the start time and end time of the video frame detected by the image recognition unit 230, respectively. 1 It may be determined whether the voice error setting time is exceeded.

앞서 설명한 바와 같이, 영상에서 얼굴과 입을 각각 인식하고 입이 一 자로 있다가 벌리기 시작하는 시점의 시각을 검출하고, 추출된 음성 신호의 시작 시각을 검출하여 일정 기준 시간 (예를 들어 165 msec 또는 5 프레임)을 벗어나면 음성 싱크 오류로 판단하는 알고리즘을 만들 수 있다.As described above, the face and the mouth are respectively recognized in the image, and the time at which the mouth starts to open after the first letter is detected, and the start time of the extracted voice signal is detected to detect a predetermined reference time (for example, 165 msec or 5). Frame), you can create an algorithm that determines a voice sync error.

도 9는 추출된 음성 신호의 시작 시간과 대화자의 대화 시작 시간과의 오차를 표시한 것이다. 이 오차가 미리 정해진 기준 시간 즉 문턱 시간 보다 앞서거나 뒤쳐지면 음성 싱크 오류로 간주할 수 있다. 여기서 문턱 시간은, 예를 들면, 플러스 마이너스 165 msec (5 프레임)가 될 수 있다.9 illustrates an error between the start time of the extracted voice signal and the talk start time of the talker. If this error is ahead or behind a predetermined reference time, that is, the threshold time, it can be regarded as a voice sync error. Here, the threshold time may be, for example, plus minus 165 msec (5 frames).

구간 D(910)는 추출된 음성의 시작 시간과 대화 시작 시간과의 차이가 앞서 언급한 문턱 시간을 초과하고 미리 지정된 시간, 예를 들면, 2초(60 프레임) 미만인 경우이다. 이런 경우에 음성 동기화 판단부(245)는 음성 싱크 오류로 판단한다. 반면에, 구간 E(920)의 경우 추출된 음성의 시작 시간과 대화 시작 시간과의 차이가 2초(60 프레임) 이상으로 이것은 음성 싱크 오류라고 하기에는 시간 차이가 너무 크다. 따라서, 이러한 경우 음성 동기화 판단부(245)는 음성 싱크 오류일 확률 보다 추출된 음성 신호가 대화자 외에 다른 사람의 음성으로 판단할 수 있다. The interval D 910 is a case where the difference between the start time of the extracted voice and the talk start time exceeds the aforementioned threshold time and is less than a predetermined time, for example, less than 2 seconds (60 frames). In this case, the voice synchronization determiner 245 determines that the voice sync error. On the other hand, in the case of the interval E 920, the difference between the start time of the extracted voice and the start time of the conversation is more than 2 seconds (60 frames), which is too large to be referred to as a voice sync error. Therefore, in this case, the voice synchronization determiner 245 may determine that the extracted voice signal is the voice of another person other than the conversation person rather than the probability of the voice sync error.

또한, 음성 동기화 판단부(245)가 대화자의 대화 시작 시간을 잘못 인식한 경우에 해당될 확률도 있으므로 2초 이상 차이가 나는 경우는 음성 싱크 오차에 포함시키지 않을 수 있다. 즉, 입 모양이 움직이기 시작하는 최초의 시점 즉 대화 시작 시간과 추출된 음성 신호의 시작 시간이 미리 지정된 기준 시간, 예를 들어, 2초 이상 차이가 나면 이것은 음성 싱크 오류로 판단하지 않는 알고리즘을 구현할 수도 있다.In addition, since there is a possibility that the voice synchronization determiner 245 incorrectly recognizes the conversation start time of the conversation, the voice synchronization determination unit 245 may not include the voice sync error when the difference is more than 2 seconds. That is, if the first time the mouth starts to move, i.e., the start time of the conversation and the start time of the extracted speech signal differ by more than a predetermined reference time, e. It can also be implemented.

또한, 앞서 설명한 안면 인식의 기술은 얼굴 인식이 가능한 각도가 한정되어 있으므로 인식률에 한계가 있고 특히 대화자가 대화를 시작하면 인식률이 더욱 더 떨어질 수 밖에 없다. 하지만, 음성 싱크 오류는 일회성으로 끝나는 오류가 아니라 한 번 발생하면 비디오 컨텐츠의 재생이 종료될 때까지 모든 음성 신호에 영향을 주어 모든 음성 싱크가 맞지 않게 된다. 따라서, 대화자가 대화를 시작하는 순간의 인식률이 떨어지더라도 음성 싱크 오류가 발생한 시점부터 언제든 안면 인식이 용이한 영상에서 얼굴과 입 모양을 인식하면 문제가 해결될 수 있다. In addition, the above-described facial recognition technology has a limited recognition rate because the angle of face recognition is limited, and in particular, the recognition rate is further deteriorated when the talker starts a conversation. However, the voice sync error is not a one-time error but occurs once and affects all voice signals until the end of playback of the video content, so that all voice syncs are not matched. Therefore, even if the recognition rate decreases when the talker starts the conversation, the problem may be solved by recognizing the shape of the face and mouth in the image that is easy to recognize the face at any time from the time of the voice sync error.

도 3은 본 발명의 또 다른 실시예에 따른 시청각 출력 정보 오류 검출 장치(300)를 설명하기 위한 블록도이다.3 is a block diagram illustrating an audiovisual output information error detecting apparatus 300 according to another exemplary embodiment of the present invention.

이상에서 설명한 자막 표시 오류 검출 장치(200)와 음성 싱크 오류 검출 장치(250)를 결합하면 도 3과 같은 시청각 출력 오류 검출 장치(300)가 된다. 컨텐츠 입력부(210)가 입력 받은 비디오 컨텐츠는 음성 데이터, 자막 데이터, 영상 데이터로 구분되어 각각 음성 추출부(220), 자막 검출부(230), 영상 인식부(235)에서 분석된다. 자막 동기화 판단부(240)는 음성 추출부(220)와 자막 검출부(230)에서 추출한 각각의 데이터를 이용하여 자막 표시가 음성에 맞게 나타나는지 판단하고, 음성 동기화 판단부(245)는 음성 추출부(220)와 영상 인식부(235)에서 추출한 각각의 데이터를 이용하여 음성 싱크가 영상에 맞게 나타나는지 판단한다. 그리고 자막 동기화 판단부(240)와 음성 동기화 판단부(245)는 공통으로 사용되는 음성 신호를 기준으로 자막과 영상을 적절한 타이밍에 출력할 수 있는 정정된 비디오 컨텐츠를 출력할 수 있다. 각 구성부에 대한 구체적인 설명은 앞서 설명한 바와 같으므로 중복된 설명은 생략한다.When the caption display error detection apparatus 200 described above and the voice sync error detection apparatus 250 are combined, the audio visual output error detection apparatus 300 as shown in FIG. 3 is obtained. The video content received by the content input unit 210 is divided into voice data, caption data, and image data, and analyzed by the voice extractor 220, the caption detector 230, and the image recognizer 235, respectively. The caption synchronization determiner 240 determines whether the caption display is appropriate for the speech using respective data extracted by the speech extractor 220 and the caption detector 230, and the speech synchronization determiner 245 uses the speech extractor ( The audio sync is determined according to the image using the respective data extracted by the image recognition unit 220 and 220. The subtitle synchronization determiner 240 and the voice synchronization determiner 245 may output corrected video content capable of outputting a subtitle and an image at an appropriate timing based on a commonly used voice signal. Detailed description of each component is the same as described above, and thus duplicated description is omitted.

도 10과 도 11은 본 발명의 일 실시예에 따른 자막 표시 오류 검출 방법을 설명하기 위한 흐름도이다.10 and 11 are flowcharts illustrating a caption display error detection method according to an exemplary embodiment of the present invention.

먼저, 단계(S1010)에서 자막 표시 오류 검출 장치(200)는 비디오 컨텐츠를 입력 받고, 단계(S1020)에서 입력된 비디오 컨텐츠에서 음성 신호를 추출한다.First, in operation S1010, the caption display error detecting apparatus 200 receives video content and extracts a voice signal from the input video content in operation S1020.

단계(S1020)에서 자막 표시 오류 검출 장치(200)는 비디오 컨텐츠의 N 번째 프레임부터 재생을 시작하게 되는데 이때 해당 프레임의 자막 영역에서 자막이 검출되는지를 확인한다. 만약 자막이 검출되지 않으면 다음 프레임인 N+1 번째 프레임으로 이동을 하고(S1045) 자막이 검출되면 해당 프레임에 자막이 시작됨을 알리는 플래그를 1로 설정한다(S1050). In operation S1020, the caption display error detecting apparatus 200 starts playback from the Nth frame of the video content. At this time, the caption display error detecting apparatus 200 checks whether caption is detected in the caption region of the corresponding frame. If the caption is not detected, it moves to the N + 1 th frame, which is the next frame (S1045). If a caption is detected, a flag indicating that the caption is started is set to 1 (S1050).

단계(S1060)에서 자막 시작이 설정된 프레임과 그 프레임에서 하나 증가한 순번의 프레임을 비교한다. 자막 영역에서 자막이 변경되지 않으면 다음 프레임으로 이동하여 다시 해당 프레임과 그 프레임보다 하나 증가한 순번의 프레임을 비교한다. 이러한 비교 방법을 반복하여 자막을 검출할 수 있다. 자막은 보통 대화 내용이므로 몇 초간 지속될 수 있으며 일반적인 비디오 컨텐츠의 경우 초당 30 프레임에 해당한다. 한 번 검출된 자막은 보통 30 ~150 프레임 정도 동일하게 유지된다.In step S1060, the frame in which the subtitle start is set is compared with the frame of which the frame number is incremented by one. If the subtitles are not changed in the subtitle area, the next frame is moved and the frame is compared with the frames which are increased by one. This comparison method can be repeated to detect subtitles. Subtitles are usually conversational content, so they can last for a few seconds, or 30 frames per second for typical video content. Once detected, subtitles usually remain the same for 30 to 150 frames.

이상에서 설명한 연속된 프레임과 비교하면서 자막의 유무를 검출하는 방법은 n 번째 자막 영역과 N+1 번째 자막 영역의 차이를 구하고 나서 다양한 알고리즘을 이용하여 검출하는 것이 가능하다. The method for detecting the presence or absence of subtitles as compared to the continuous frames described above can be detected using various algorithms after obtaining the difference between the nth subtitle area and the N + 1th subtitle area.

단계(S1080)에서 자막 표시 오류 검출 장치(200)는 자막이 끝났음을 검출한 후에는 자막이 끝났음을 표시하기 위해 플래그를 1로 설정한다. 앞서 설명한 반복 과정을 통하여 자막 영역에 문자열이 사라지는 N+M 번째 비디오 프레임에서 종료가 된다고 하고 종료 플래그를 설정할 수 있다. 여기서, 자막 시작과 자막 끝을 알리는 플래그는 오디오에서 추출된 음성의 시작과 끝의 시간을 비교하기 위해서 필요하다.In operation S1080, the caption display error detecting apparatus 200 sets a flag to 1 to indicate that the caption is over after detecting that the caption is over. Through the above-described repetition process, the end flag is terminated at the N + M th video frame in which the character string disappears in the subtitle area. Here, a flag indicating the start of the subtitle and the end of the subtitle is necessary for comparing the time of the start and the end of the audio extracted from the audio.

단계(S1090)에서 자막 표시 오류 검출 장치(200)는 해당 프레임이 마지막 프레임인지 확인하고, 만약 마지막 프레임이라면 비디오 컨텐츠의 재생을 종료하고(S1100) 그렇지 않다면 다시 다음 프레임으로 이동하여(S1095) 그 프레임의 자막 영역에 자막이 검출되는지를 판단한다.In operation S1090, the caption display error detecting apparatus 200 determines whether the corresponding frame is the last frame, and if it is the last frame, terminates playback of the video content (S1100), otherwise moves to the next frame again (S1095). It is determined whether a subtitle is detected in the subtitle area of the.

단계(S1110)에서 추출된 음성 신호와 플래그 설정된 비디오 컨텐츠를 동시에 재생을 시작한다. 그리고 나서 음성 신호의 재생이 시작된 후에 미리 지정된 시간 이내에 자막 있음을 알리는 플래그가 설정되었는지를 확인한다(S1120). 만약 설정이 없으면 해당 위치에서 동기화 오류를 체크하고(S1135), 계속 재생을 하여 음성 신호의 재생이 시작되고 나서 일정 시간 내에 자막 플래그의 설정 유무를 확인한다. The voice signal extracted in step S1110 and the flagged video content are simultaneously started. Then, it is checked whether a flag indicating that there is a subtitle within a predetermined time after the reproduction of the audio signal starts (S1120). If there is no setting, a synchronization error is checked at the corresponding position (S1135), and playback continues to check whether the caption flag is set within a predetermined time after the playback of the audio signal is started.

단계(S1130)에서 자막 표시 오류 검출 장치(200)는 신호의 재생이 시작되고 나서 일정 시간 내에 자막 플래그의 설정이 되어 있는 경우에 자막 플래그를 발견한 후에 일정 시간 내에 음성 시작 신호가 있는지를 파악한다. 만약 음성 시작 신호가 해당 시간 내에 있으면 해당 위치에서 동기화 오류를 체크하고(S1135), 해당 신간 내에 없으면 비디오 컨텐츠의 재생 완료 여부를 결정한다(S1140). 결정할 수 없으면 계속 재생을 하여 다시 단계(S1120)로 가고 결정할 수 있으면 재생을 종료한다.In operation S1130, the caption display error detecting apparatus 200 determines whether there is a voice start signal within a predetermined time after detecting the caption flag when the caption flag is set within a predetermined time after the signal is started playing. . If the audio start signal is within the corresponding time, the synchronization error is checked at the corresponding position (S1135). If it cannot be determined, playback continues and the process returns to step S1120. If it is determined, playback is terminated.

도 12는 본 발명의 다른 실시예에 따른 음성 싱크 오류 검출 방법을 설명하기 위한 흐름도이다.12 is a flowchart illustrating a voice sync error detection method according to another embodiment of the present invention.

단계(S1210)에서 음성 싱크 오류 검출 장치(250)는 비디오 컨텐츠를 입력 받고, 비디오 컨텐츠에서 음성 신호를 추출한다(S1220). 그 다음에 추출된 음성 신호와 비디오 컨텐츠가 동시에 재생을 시작하고(S1230), 화면상에 나타난 대화자가 말을 시작하는 시점 즉 대화 시작 시각을 체크한다(S1240). In operation S1210, the voice sync error detecting apparatus 250 receives video content, and extracts a voice signal from the video content (S1220). Then, the extracted voice signal and the video content start to be played simultaneously (S1230), and the time point at which the talker displayed on the screen starts talking, that is, the start time of the conversation, is checked (S1240).

단계(S1250)에서 대화 시작 시각과 음성 신호의 시작 시각의 차이가 미리 정해놓은 제 1 음성 오류 설정 시간을 초과하는지 확인한다. 제 1 음성 오류 설정 시간을 초과한다면 대화 시작 시각과 음성 신호의 시작 시각의 차이가 미리 정해놓은 제 2 음성 오류 설정 시간 이내인지를 다시 확인한다(S1260).In step S1250, it is checked whether the difference between the start time of the conversation and the start time of the voice signal exceeds a predetermined first voice error setting time. If the first voice error setting time is exceeded, it is again checked whether the difference between the start time of the conversation and the start time of the voice signal is within the predetermined second voice error setting time (S1260).

단계(S1270)에서 음성 싱크 오류 검출 장치(250)는 제 2 음성 오류 설정 시간 이내에 대화 시작 시간과 음성 신호의 시작 시간의 차이가 발생하는 경우, 해당 위치에 동기화 오류를 체크한다. 그리고 나서 비디오 컨텐츠의 재생을 중지한다(S1280). In operation S1270, when a difference between the conversation start time and the start time of the voice signal occurs within the second voice error setting time, the voice sink error detecting apparatus 250 checks for a synchronization error at the corresponding position. Then, playback of the video content is stopped (S1280).

단계(S1290)에서 음성 싱크 오류 검출 장치(250)는 대화 시작 시각과 음성 신호의 시작 시각의 차이가 미리 정해놓은 제 1 음성 오류 설정 시간을 초과하는 경우 비디오 컨텐츠 재생을 완료할 지를 결정한다. 비디오 컨텐츠의 재생이 완료되지 않으면 계속 재생을 하여 단계(S1240)부터 다시 실행하고 완료되었으면 실행을 종료한다.In operation S1290, the voice sink error detecting apparatus 250 determines whether to finish playing the video content when the difference between the start time of the conversation and the start time of the voice signal exceeds a predetermined first voice error setting time. If the playback of the video content is not completed, playback continues to be performed again from step S1240, and if it is completed, execution ends.

또한, 자막 표시 오류 검출 방법과 음성 싱크 오류 검출 방법을 모두 실행하는 시청각 출력 정보 오류 검출 방법에 대해서는 도면으로 도시하지 않았으나, 자막 표시 오류 검출 방법과 음성 싱크 오류 검출 방법에서 각각 유도하여 결합하여 실행하는 방법은 당업자에게 자명하므로 그에 대한 설명은 생략한다.In addition, although the audiovisual output information error detection method for executing both the caption display error detection method and the voice sync error detection method is not shown in the drawing, the caption display error detection method and the voice sync error detection method are induced and combined respectively. Since the method is obvious to those skilled in the art, a description thereof will be omitted.

본 발명의 각 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체 및 통신 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. 통신 매체는 전형적으로 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈, 또는 반송파와 같은 변조된 데이터 신호의 기타 데이터, 또는 기타 전송 메커니즘을 포함하며, 임의의 정보 전달 매체를 포함한다. Each embodiment of the present invention can also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by the computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media may include both computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave, or other transmission mechanism, and includes any information delivery media.

본 발명의 방법 및 장치는 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and apparatus of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and it will be understood by those skilled in the art that the present invention may be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is shown by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. do.

200: 자막 표시 오류 검출 장치
210: 컨텐츠 입력부
220: 음성 추출부
230: 자막 검출부 235: 영상 인식부
240: 자막 동기화 판단부 245: 음성 동기화 판단부
250: 음성 싱크 오류 검출 장치
300: 시청각 출력 오류 검출 장치200: caption display error detection device
210: content input unit
220: voice extraction unit
230: Caption detection unit 235: Image recognition unit
240: subtitle synchronization determiner 245: voice synchronization determiner
250: voice sync error detection device
300: audio visual output error detection device

Claims

An apparatus for detecting a subtitle display error of video content, the apparatus comprising:
A voice extraction unit which checks a start time and an end time of the voice signal using frequency information from the audio signal of the video content;
Detect whether the video content has a character string in a predetermined subtitle area, set a start flag to an N th video frame in which the character string is first detected, and start the N + 1 th video frame from the N th video frame and the subtitle area. A caption detector for detecting whether a string is changed and setting an end flag to a video frame in which the string is not detected for the first time in the caption area;
A caption synchronization determining unit comparing the start time with the start flag and comparing the end time with the end flag to determine whether the first caption error setting time is exceeded;
Subtitle display error detection device comprising a.

The method of claim 1,
And the caption detection unit classifies the caption area according to a position where the character string is displayed.

The method of claim 1,
And the caption detection unit classifies the caption area according to the size of the character string.

The method of claim 1,
And the caption detection unit classifies the caption area based on a temporal similarity in which the character string is displayed.

The method of claim 1,
And the caption detection unit classifies the caption area by an angle at which the character string is inclined with respect to a horizontal direction.

The method of claim 1,
The caption synchronization determining unit may determine that the caption error is present but the character string does not exist when the time difference between the start time of the sound signal and the start flag of the caption area exceeds the first caption error setting time. Subtitle display error detection device.

An apparatus for detecting a voice sync error of video content,
A voice extraction unit which checks a start time and an end time of the voice signal using frequency information in the audio signal of the video content;
An image recognition unit which grasps the shape of the mouth of the talker appearing in the video content by a face recognition technology and checks the start time of the first video frame in which the mouth shape is opened;
The voice synchronization determining unit which compares the start time of the voice signal extracted by the voice extractor and the start time of the first video frame checked by the image recognition unit to determine whether the first predetermined voice error setting time is exceeded.
Voice sync error detection device comprising a.

The method of claim 7, wherein
And the voice synchronization determining unit does not determine a voice error when a time difference between the start time of the voice signal and the start time of the first video frame exceeds the first voice error setting time.

A method of detecting subtitle display error of video content, the method comprising:
(a) extracting a voice signal from the audio signal of the video content by using frequency information;
(b) extracting a start time and an end time of the audio signal extracted from the video content;
(c) checking whether a string exists in a predetermined subtitle region of the Nth video frame,
(d) setting a start flag in the N-th video frame when the character string exists in the caption area, and setting an end flag in the N + M-th video frame in which the character string disappears in the caption area;
(e) checking a caption display error by comparing a start time and an end time of the voice signal with the start flag and the end flag, respectively;
Subtitle display error detection method comprising a.

The method of claim 9,
In step (d),
(d1) checking a synchronization error at a position without the start flag within a first predetermined subtitle error setting time after a start time of the voice signal;
Subtitle display error detection method comprising a.

The method of claim 9,
In step (d),
(d2) checking for a synchronization error at a position without the end flag within a first predetermined subtitle error setting time after a start time of the voice signal;
Subtitle display error detection method comprising a.

The method of claim 9,
And the caption detection unit classifies the caption area according to a position where the character string is displayed.

The method of claim 9,
And the caption detection unit classifies the caption area according to the size of the character string.

The method of claim 9,
The caption detection unit, the caption display error detection method for classifying the caption area by the temporal similarity (correlation) in which the character string is displayed.

The method of claim 9,
And the caption detection unit classifies the caption area by an angle at which the character string is inclined relative to a horizontal direction.

In the method for detecting a voice sync error of video content,
(a) extracting a voice signal from the audio signal of the video content by using frequency information;
(b) identifying the shape of the mouth of the talker appearing in the N-th video frame by facial recognition technology and checking the start time of the first video frame in which the mouth shape occurs;
(c) checking a voice sync error by comparing the start time of the voice signal with the start time of the checked first video frame.
Voice sync error detection method comprising a.

17. The method of claim 16,
(c) step,
(c1) determining whether a difference between the conversation start time and the start time of the voice signal is less than or equal to a second predetermined voice error setting time when the first voice error setting time is exceeded after the start time of the voice signal.
Voice sync error detection method comprising a.

17. The method of claim 16,
(c1) step,
(c-1) checking a synchronization error at a location not exceeding the video content if a difference between the conversation start time and the start time of the voice signal does not exceed the second voice error setting time;
Voice sync error detection method further comprising.