KR100828166B1

KR100828166B1 - Method of extracting metadata from result of speech recognition and character recognition in video, method of searching video using metadta and record medium thereof

Info

Publication number: KR100828166B1
Application number: KR1020070057478A
Authority: KR
Inventors: 고한석; 정석영; 박수인; 윤종성; 김동준; 박성춘; 하명환; 김건희
Original assignee: 고려대학교 산학협력단; 한국방송공사
Priority date: 2007-06-12
Filing date: 2007-06-12
Publication date: 2008-05-08

Abstract

A method for extracting metadata through speech recognition and caption recognition of a video, a video searching method using metadata, and a recording medium recording the same are provided to extract metadata information as a speech recognition result and an opened caption recognition result, reduce archiving work time by manual work, and enable broadcasting manufacturers to manufacture contents of good quality by automatically executing contents management and index work for mass broadcasting data. A method for extracting metadata through speech recognition and caption recognition of a video comprises the following steps of: inputting the video including the metadata, and extracting a start frame and a screen conversion frame of the inputted video(110); displaying the extracted start frame and screen conversion frame as a thumbnail image, and storing the displayed thumbnail image and time information of the thumbnail image(120); recognizing a voice of a speaker according to a phoneme of a voice included in the inputted video, and converting the recognized voice data into text data before extracting a keyword from the converted text data(130); extracting a caption through caption recognition from the inputted video(140); extracting metadata and title from the keyword and caption included in thumbnail images of a start shot, an end shot and between the start shot and the end shot if the start shot and the end shot are designated by a user in the start frame and the screen conversion frame of the displayed video(150); and displaying the extracted metadata, time information of the start shot, time information of the end shot and the title(160).

Description

Method of extracting metadata from result of speech recognition and character recognition in video, method of searching video using metadta and record medium kind}

도 1은 본 발명의 일 실시예에 따른 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출 방법의 흐름도이다.1 is a flowchart of a metadata extraction method through voice recognition and subtitle recognition of a video according to an embodiment of the present invention.

도 2는 본 발명에 적용되는 연속어 음성인식 탐색공간을 도시한 것이다.2 illustrates a continuous speech recognition space searched to be applied to the present invention.

도 3은 도 1의 영상의 음성으로부터 키워드를 추출하는 단계(130 과정)의 상세 흐름도이다.3 is a detailed flowchart of an operation (130) of extracting a keyword from the audio of the image of FIG.

도 4는 도 1의 자막 인식을 통하여 자막을 추출하는 단계(140 과정)의 상세흐름도이다.FIG. 4 is a detailed flowchart of a step (step 140) of extracting captions through the caption recognition of FIG. 1.

도 5는 도 1의 동영상의 시작 프레임 및 화면 전환 프레임 추출 단계(110 과정)의 상세 흐름도이다.FIG. 5 is a detailed flowchart of the start frame and screen change frame extraction step 110 of the video of FIG. 1.

도 6은 도 1의 메타데이터와 타이틀을 추출하는 단계(150 과정)의 상세 흐름도이다.FIG. 6 is a detailed flowchart of the step 150 of extracting the metadata and the title of FIG. 1.

도 7은 본 발명의 다른 실시예에 따른 메타데이터를 이용한 동영상 탐색 방법의 흐름도이다. 7 is a flowchart illustrating a video searching method using metadata according to another embodiment of the present invention.

도 8은 XML 문서로 저장되는 동영상 데이터의 그래픽 유저 인터페이스 환경을 도시한 것이다.8 illustrates a graphical user interface environment of moving picture data stored in an XML document.

도 9는 XML 문서로 저장되는 동영상 데이터의 검색 윈도우를 도시한 것이다.9 illustrates a search window of moving image data stored as an XML document.

본 발명은 데이터 추출에 관한 것으로서, 더욱 상세하게는 일반 동영상 혹은 실시간으로 방송되는 영상을 정해진 구간 혹은 주제별 프레임 단위로부터의 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출 방법, 메타데이터를 이용한 동영상 탐색 방법 및 이를 기록한 기록매체에 관한 것이다.The present invention relates to data extraction, and more particularly, to a method of extracting metadata through speech recognition and caption recognition of a video from a predetermined section or a frame of each subject, and to searching a video using metadata. It relates to a method and a recording medium recording the same.

현대의 정보의 홍수 속에서 가독성과 속도성이 높은 텍스트 위주의 자료들이 멀티미디어 데이터로 변화되어 가는 추세에 있다. 이에 따라서 대용량의 멀티미디어 데이터의 처리를 위하여 많은 시간과 노력이 필요하게 되었다.In the flood of modern information, text-oriented materials with high readability and speed are changing to multimedia data. Accordingly, a lot of time and effort is required for processing a large amount of multimedia data.

더욱이, 눈과 귀가 동시에 필요한 멀티미디어 정보 처리에 부담을 느끼는 사용자들이 증가하면서 적절한 요약 정보의 필요성이 대두 되었으며, 이를 위한 요약 정보 필요성이 대두 되었으며, 이를 위한 요약 정보 제공 기술들이 제안되었다.In addition, as the number of users who are burdened with processing multimedia information that requires both eyes and ears has increased, the necessity of appropriate summary information has emerged, the necessity of summary information has emerged, and summary information providing techniques have been proposed.

예를 들어, 디지털 비디오 콘텐츠(Contents)는 전체를 재생하여 시청할 수도 있지만 경우에 따라서 비디오 전체를 보지 않고도 그 내용을 이해할 수 있도록 요약된 형태의 하이라이트 동영상이 프로그램 공급자에 의하거나 혹은 사용자 시스템 자체에서 자동 생성하는 형태로 제공되기도 한다.For example, digital video content can be viewed and played in its entirety, but in some cases summary video highlights are automatically generated by the program provider or by the user's system itself so that the content can be understood without having to watch the entire video. It may also be provided in the form of creation.

하이라이트 동영상은 저장된 비디오 중에서 부분만을 재생하기 위한 것으로, 해당 비디오 스트림을 대표하는 특징이 있다. 하이라이트 동영상은 비디오 스트림(stream)의 특정 구간을 따라 저장 또는 재생하기 위한 것으로 제한된 시간 동안 디지털 비디오 녹화기에 저장된 여러 개의 비디오 중에서 하나를 사용자가 선택하여 보기를 원할 때, 사용자는 각 비디오 스트림의 하이라이트 동영상만을 재생하여 원하는 비디오 내용을 검색하는데 걸리는 시간을 절약할 수 있다. 또한 하이라이트는 저장된 비디오 스트림의 요약 정보 이외에도 디지털 비디오 저장 장치에서 사용자가 녹화할 비디오를 선택하는데 필요한 프로그램 가이드 장치 등에서 사용할 수 있는 프리뷰(preview) 기능을 제공할 수 있다.The highlight video plays only a part of the stored video, and has a characteristic of representing the corresponding video stream. The highlight video is for storing or playing along a specific section of the video stream. When the user wants to select one of several videos stored in the digital video recorder for a limited time, the user highlights the highlight video of each video stream. By playing back only, you save time searching for the video content you want. In addition to the summary information of the stored video stream, the highlight may provide a preview function that can be used in a program guide device for selecting a video to be recorded by the user in the digital video storage device.

그러나, 이는 사용자에게 비디오의 내용을 대표할 만한 의미가 있는 부분을 따로 추출해 내야 하므로 하이라이트를 생성할 구간의 설정은 상당히 까다로운 작업이다. However, since it is necessary to extract a portion that is meaningful to represent the content of the video to the user, setting the section to generate highlights is a very difficult task.

종래의 대한민국 특허공보 제0404322호에 의하면, 이는 폐쇄 자막과 함께 제공되는 뉴스 비디오를 뉴스 기사 단위로 분할하고 분할된 뉴스 기사 단위별로 뉴스 비디오를 압축하는 방법으로서, 폐쇄 자막 데이터를 기반으로 음성인식기법을 사용한다. 상기 특허 발명에서 사용된 자막방송 기반의 폐쇄 자막은 음성인식기의 인식률 저하를 방지하기 위한 방안으로 사용되었으나, 이를 위해서는 자막방송용 속기록이 1차적으로 준비되어야 하기 때문에 일반적으로 녹화되거나 자막방송이 생성되거나 저장되지 않은 방송 동영상에 대해서는 바로 적용하기에는 문제점이 있다.According to Korean Patent Publication No. 0404322, which is a method of dividing a news video provided with a closed caption into news article units and compressing the news video by the divided news article units, a speech recognition method based on closed caption data is provided. Use The closed captions based on the closed captions used in the patent invention were used as a way to prevent the recognition rate of the voice recognizer from being lowered. However, since the closed captions need to be prepared first, the closed captions are generally recorded or closed captions are generated or stored. There is a problem in that it is not immediately applied to broadcast video that is not.

또한, 대한민국 특허공보 제0540735호에 의하면, 이는 프레임별 자막 추출 및 인식을 통하여 각 프레임 자막의 인식 결과를 그 프레임(frame)과 인덱싱(Indexing) 시켜 영상 프레임을 검색할 수 있게 하는 방법이다. 이는 자막이 포함된 영상의 위치를 찾도록 자막과 그 자막이 포함된 영상 프레임을 연결한 것으로 영상에서 자막을 검출하여 인식하지만 그 대상을 하나의 영상 프레임의 인덱싱에 국한하였고, 그로 인해 전체적인 기사 내용과 관련 없는 자막이 포함된 경우일지라도 자막이 일치하면 검색의 대상에 포함되는 경우가 발생할 수 있다.In addition, according to Korean Patent Publication No. 0540735, this is a method for retrieving an image frame by indexing the recognition result of each frame subtitle through the frame extraction and recognition of each frame subtitle. It connects the subtitle and the video frame containing the subtitle to find the position of the video containing the subtitle.It detects and recognizes the subtitle in the video, but the object is limited to the indexing of one video frame. Even if the subtitles are irrelevant, the subtitles may be included in the search target if the subtitles match.

그리고, 1999년 IEEE 학회 논문인 "Automated generation of News content hierarchy by integrating audio, video and text information, 3025∼3028p, Qian Huang외 4인"에 의하면, 상기 논문에서는 동영상에서 폐쇄 자막 검출, 얼굴 인식, 앵커샷 검출, 음성 인식을 통하여 뉴스에 대한 검색 정보를 제공하는 방법을 제안하고 있으나, 이는 폐쇄 자막에 한정되었다.In addition, according to a 1999 IEEE academic paper, "Automated generation of News content hierarchy by integrating audio, video and text information, 3025-3028p, Qian Huang et al., 4", the paper described in this paper, closed caption detection, facial recognition, anchor A method of providing search information about news through shot detection and speech recognition is proposed, but this is limited to closed captions.

상기와 같이 종래의 요약 정보 제공 방법은, 비디오의 내용을 대표할 만한 의미가 있는 부분을 따로 추출해 내야 하므로 하이라이트를 생성할 구간의 설정 영상에서 자막을 검출하여 인식하지만 그 대상을 하나의 영상 프레임의 인덱싱에 국한되는 문제가 있고, 전체적인 기사 내용과 관련 없는 자막이 포함된 경우일지라도 자막이 일치하면 검색의 대상에 포함되는 경우가 발생하여 잘못된 요약 정보를 제공할 수 있으며, 폐쇄 자막을 통한 요약 정보 제공 방법은 자막방송용 속기록이 1차적으로 준비되어야 하기 때문에 일반적으로 녹화되거나 자막방송이 생성되거나 저장되지 않은 방송 동영상에 대해서는 바로 적용하기에는 문제점이 있다.As described above, in the conventional method of providing summary information, since a portion having a meaning that is representative of the content of the video must be extracted separately, the caption is detected and recognized in the set image of the section in which the highlight is to be generated, but the target of the image frame is detected. Even if the problem is limited to indexing, and subtitles that are not related to the overall article content are included, matching subtitles may be included in the search target, providing incorrect summary information, and providing summary information through closed captions. Since the fast recording for caption broadcasting should be prepared first, there is a problem in that the method is not directly applied to broadcast video in which recording, caption broadcasting is not generated or stored in general.

따라서, 본 발명이 이루고자 하는 첫 번째 기술적 과제는, 음성 인식의 결과뿐만 아니라 폐쇄 자막의 유무에 상관없이 동영상의 영상정보로부터 자막을 인식하여 음성 인식과 함께 핵심어를 추출할 수 있는 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출 방법을 제공하는 것이다.Accordingly, the first technical problem to be achieved by the present invention is to recognize a voice of a video which can extract a key word with voice recognition by recognizing the subtitle from the video information of the video regardless of the result of the voice recognition as well as the closed caption. It is to provide a metadata extraction method through caption recognition.

또한, 본 발명이 이루고자 하는 두 번째 기술적 과제는, 상기 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출 방법을 적용한 메타데이터를 이용한 동영상 탐색 방법을 제공하는 것이다.In addition, a second technical problem to be achieved by the present invention is to provide a video search method using metadata applying the metadata extraction method through the voice recognition and subtitle recognition of the video.

상기 첫 번째 기술적 과제를 달성하기 위하여 본 발명은,The present invention to achieve the first technical problem,

메타데이터를 포함하는 동영상을 입력하고, 상기 입력된 동영상의 시작 프레임 및 화면 전환 프레임을 추출하는 단계, 상기 추출된 시작 프레임 및 화면 전환 프레임을 손톱 영상으로 디스플레이하고, 상기 디스플레이된 손톱 영상 및 상기 손톱 영상의 시간 정보를 저장하는 단계, 상기 입력된 동영상에 포함된 음성의 음소에 따라 화자의 음성을 인식하고, 상기 인식된 음성 데이터를 문자 데이터로 변환하고, 상기 변환된 문자데이터로부터 키워드를 추출하는 단계, 상기 입력된 동영상으로부터 자막을 검출하고, 상기 검출된 자막으로부터 자막 인식을 통하여 자막을 추출하는 단계, 사용자가 상기 디스플레이된 동영상의 시작 프레임 및 화면 전환 프레임 중 시작 샷과 끝 샷을 지정하면 상기 지정된 시작 샷, 끝 샷 및 상기 시작 샷과 끝 샷 사이의 손톱 영상에 포함된 키워드 및 자막으로부터 메타데이터와 타이틀을 추출하는 단계, 및 상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 표시하는 단계를 포함하는 동영상의 음성 인식과 자막 인식을 통한 메타데이터 추출 방법을 제공한다.Inputting a video including metadata, extracting a start frame and a screen switching frame of the input video, displaying the extracted start frame and the screen switching frame as a nail image, and displaying the displayed nail image and the nail Storing time information of an image, recognizing a speaker's voice according to a phoneme of a voice included in the input video, converting the recognized voice data into text data, and extracting a keyword from the converted text data Detecting a caption from the input video and extracting the caption from the detected caption through caption recognition; when a user designates a start shot and an end shot among the displayed frame and the transition frame of the displayed video, Specified start shot, end shot and hand between the start shot and end shot Extracting metadata and a title from keywords and subtitles included in the video, and displaying the extracted metadata, time information of the start shot, time information of the end shot, and the title. It provides metadata extraction method through recognition and subtitle recognition.

상기 두 번째 기술적 과제를 달성하기 위하여 본 발명은,The present invention to achieve the second technical problem,

동영상을 입력하고, 상기 입력된 동영상의 시작 프레임 및 화면 전환 프레임을 추출하는 하여 상기 시작 프레임 및 화면 전환 프레임을 손톱 영상으로 디스플레이하고, 상기 디스플레이된 손톱 영상 및 상기 손톱 영상의 시간 정보를 저장하는 단계, 상기 입력된 동영상에 포함된 음성의 음소에 따라 화자의 음성을 인식하고, 상기 인식된 음성 데이터를 문자 데이터로 변환하고, 상기 변환된 문자데이터로부터 키워드를 추출하는 단계, 상기 입력된 동영상으로부터 자막을 검출하고, 상기 검출된 자막으로부터 자막 인식을 통하여 자막을 추출하는 단계, 상기 디스플레이된 동영상의 시작 프레임 및 화면 전환 프레임 중 시작 샷과 끝 샷을 지정하여 상기 지정된 시작 샷과 끝 샷 사이의 지정 동영상에 포함된 키워드 및 자막으로부터 메타데이터와 타이틀을 추출하고, 상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 포함하는 동영상 데이터를 디스플레이하는 단계, 상기 디스플레이된 동영상 데이터를 XML 문서로 저장하는 단계, 및 특정 사용자가 탐색하고자 하는 검색어를 입력하면, 상기 XML 문서로 저장된 동영상의 저장 시간, 타이틀, 재생 길이 및 메타데이터를 출력하는 단계를 포함하는 메타데이터를 이용한 동영상 탐색 방법을 제공한다.Inputting a video, extracting a start frame and a screen switching frame of the input video to display the start frame and the screen switching frame as a nail image, and storing time information of the displayed nail image and the nail image; Recognizing the speaker's voice according to the phoneme of the voice included in the input video, converting the recognized voice data into text data, and extracting a keyword from the converted text data, subtitles from the input video. Extracting a caption from the detected caption through caption recognition; designating a start shot and an end shot among the start frame and the screen change frame of the displayed video to designate a designated video between the designated start shot and the end shot; Metadata and titles from keywords and subtitles contained within Extracting and displaying video data including the extracted metadata, time information of the start shot, time information of the end shot, and the title, storing the displayed video data as an XML document, and When a specific user inputs a search word to search, the present invention provides a video searching method using metadata, including outputting a storage time, a title, a playback length, and metadata of a video stored in the XML document.

이하, 본 발명의 바람직한 실시예를 첨부도면에 의거하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

그러나, 다음에 예시하는 본 발명의 실시예는 여러 가지 다른 형태로 변형할 수 있으며, 본 발명의 범위가 다음에 상술하는 실시예에 한정되는 것은 아니다. 본 발명의 실시예는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위하여 제공된다.However, embodiments of the present invention illustrated below may be modified in various other forms, and the scope of the present invention is not limited to the embodiments described below. Embodiments of the invention are provided to more fully illustrate the invention to those skilled in the art.

본 발명은 다양한 등장인물과 시청각 정보량이 많은 동영상을 대상으로 영상 및 음성 인식 기술을 활용하여 검색에 활용될 수 있는 메타데이터를 자동 추출한다.The present invention automatically extracts metadata that can be used for search using video and voice recognition technology for videos with a large amount of various characters and audiovisual information.

이는 동영상을 재생하여 동영상에 포함된 음성 및 자막 인식을 통하여 각각의 키워드를 추출한 뒤, 추출된 핵심어의 융합을 통해 메타데이터를 추출한다.This extracts each keyword by recognizing the voice and subtitle included in the video by playing the video, and then extracts the metadata through the fusion of the extracted key words.

이와 같이, 추출된 메타데이터를 동영상의 시간 정보와 함께 XML 형식의 파일로 저장하여 동영상의 기사 검색에 대한 정보로써 사용할 수 있다. As such, the extracted metadata may be stored as a file in XML format along with time information of the video and used as information on searching for articles in the video.

이에 있어서 본 발명에 핵심적으로 요구되는 기술 분야는 샷 검출 기술, 화자 독립 대어휘 연속어 음성 인식 기술과 영상에 대한 자막 검출 및 문자 인식 기술, 영상과 음성 인식을 통하여 얻은 결과로부터 핵심어 추출 및 융합하는 기술이다.In this regard, the core technical field required for the present invention is to extract and fuse keywords from a result obtained through shot detection technology, speaker independent large vocabulary speech recognition technology, image caption detection and character recognition technology, and image and speech recognition. Technology.

본 발명은 동영상의 재생으로부터 시작된다. 동영상을 재생하게 되면, 샷 검출 기술을 이용하여 동영상의 시작 장면 및 화면이 전환되는 주요 장면들을 샷의 시간 정보와 함께 화면에 손톱 영상으로 제시해 준다.The present invention begins with the playback of a moving picture. When the video is played, the start scene of the video and the main scenes to which the screen is switched are presented using the shot detection technology as the nail image on the screen along with the time information of the shot.

기본적으로 샷 검출 기술로는 30프레임마다 한 번씩 영상 프레임을 가져와서 이전에 획득한 영상 프레임과 히스토그램 분포 변동을 비교하여, 변동값이 일정값 이상 클 경우, 화면 전환 프레임으로 판단하고, 이렇게 검출된 화면 전환 프레임은 다양한 동영상을 포함하는 시청각 자료의 각 영역에 대한 잣대로 작용하게 하여 화면에 포함된 손톱영상을 클릭함으로써 영역의 시작 샷과 마지막 샷을 지정해주어 동영상의 구간을 지정할 수 있다.Basically, the shot detection technique takes an image frame once every 30 frames and compares a previously obtained image frame with a histogram distribution variation. When the variation value is larger than a predetermined value, it is determined as a screen switching frame. The screen transition frame acts as a criterion for each area of the audiovisual material including various videos, and clicks the nail image included in the screen to designate the start and end shots of the area to designate the video section.

예를 들어, 뉴스에서 기사 내용이 시작할 때, 첫 장면의 우측 상단에 기사에 대한 제목 자막을 표시하는 뉴스 특성을 이용하여 음성과 자막의 인식이 완료되고 난 후 뉴스 기사에 대한 시작 샷과 마지막 샷을 지정할 경우, 시작 샷에 해당하는 시간 정보를 가져와 시작 샷에 포함되어 있는 뉴스 제목 자막에 대한 자막 인식 결과를 찾아 기사 내용에 대한 제목으로 화면에 표현하여 주도록 하였다. 이를 통하여 주요 샷과 제목을 화면에 표현하여 기사 내용을 한눈에 파악할 수 있다.For example, when the article content starts in the news, the start and last shots of the news article after the recognition of the voice and subtitles are completed using the news feature that displays the title caption for the article in the upper right corner of the first scene. In the case of designation, the time information corresponding to the starting shot was taken to find the subtitle recognition result of the news title subtitle included in the starting shot and displayed on the screen as the title of the article content. Through this, the main shots and titles can be expressed on the screen to understand the contents of the article at a glance.

핵심적으로 요구되는 음성 인식 기술 분야는 화자 독립 음성 인식 기술 및 연속 핵심어 인식 기술(Keyword spotting)이다. 방송에서 사용된 오디오 정보 중에 불특정 다수의 문장을 인식하기 위해서는 '화자 독립 음성 인식 기술'이 요구되고, 문장 중에서 메타데이터로 사용하기 위한 핵심 어휘를 추출하기 위하여 '연속 핵심어 인식 기술'이 필요하다. 본 발명에서는 보다 신뢰성 있는 핵심어 인식기 개발을 위해 먼저 대어휘 연속 음성 인식기(Large vocabulary continuous speech recognition)를 통하여 얻어진 전사(description) 데이터를 후처리 과정을 통하여 핵심어를 검출하는 방법을 사용하기 위하여 대어휘 연속 음성 인식기를 적용한다.The core areas of speech recognition technology required are speaker independent speech recognition technology and continuous keyword spotting technology. In order to recognize an unspecified number of sentences among audio information used in broadcasting, 'speaker independent speech recognition technology' is required, and 'continuous keyword recognition technology' is needed to extract key words for use as metadata from sentences. In the present invention, in order to develop a more reliable key word recognizer, in order to use a method of detecting key words through post-processing of transcription data obtained through large vocabulary continuous speech recognition, Apply a speech recognizer.

본 발명은 다양한 뉴스, 오락 프로그램 등의 동영상 데이터로부터 신뢰성 있는 음성 인식 성능을 얻기 위하여 불특정 다수의 음성을 인식할 수 있는 화자 독립 음성인식 기술을 사용한다. 이를 위하여 다양한 출생지의 1000명이 발음한 음성 데이터로부터 음향 모델을 생성하여, 어떠한 화자가 발성한 음성이라도 인식에 문제가 없도록 디자인한다. 또한, 음성 인식 성능 향상을 위하여 문맥에 따른 조음효과를 반영하기 위하여 음성 인식의 단위를 현재 음소의 앞, 뒤 음소를 함께 포함하는 문맥 종속형 모델인 프라이-폰(tri-phone)을 기본 단위로 하며, 대어휘 인식시스템에 적합하도록 압축에 다른 손실이 없는 높은 성능을 지닌 음성 인식 알고리즘인 연속 분포 은닉 마르코프 모델(continuous density Hidden Markov Model:continuous density HMM)을 사용한다.The present invention uses a speaker independent speech recognition technology capable of recognizing a plurality of unspecified voices in order to obtain reliable speech recognition performance from moving image data such as various news and entertainment programs. To this end, an acoustic model is generated from voice data pronounced by 1000 people of various birthplaces and designed so that recognition by any speaker is not a problem. In addition, the basic unit is a tri-phone, which is a context-dependent model that includes both front and rear phonemes of the current phoneme to reflect the modulation effects according to the context in order to improve the voice recognition performance. We use the continuous density hidden Markov model (continuous density HMM), which is a high performance speech recognition algorithm with no loss of compression, suitable for large vocabulary recognition systems.

또한, 동영상 데이터로부터 메타데이터를 추출하기 위하여 대어휘 연속어 음성인식 기술을 개발하여 문장 단위의 음성 데이터를 문자 데이터인 전사 데이터로 변환한다.In addition, in order to extract metadata from video data, a large vocabulary continuous word speech recognition technology is developed to convert speech data in sentence units into transcription data which is text data.

도 1을 참조하면, 우선, 메타데이터를 포함하는 동영상을 입력하고, 상기 입력된 동영상의 시작 프레임 및 화면 전환 프레임을 추출한다(110 과정).Referring to FIG. 1, first, a video including metadata is input, and a start frame and a screen change frame of the input video are extracted (step 110).

본 발명에 적용되는 동영상은 음성 정보와 문자 정보를 포함하는 동영상일 수 있다. 상기 동영상을 입력하면 우선 동영상의 시작 프레임을 추출한다. 그리고, 사용자에 의해 미리 설정된 영상 프레임 단위마다 영상 프레임을 저장한다.The video applied to the present invention may be a video including voice information and text information. When the video is input, a start frame of the video is first extracted. The image frame is stored for each image frame unit preset by the user.

상기 미리 설정된 영상 프레임 단위는 30프레임으로 설정할 수 있으며, 이는 동영상의 품질, 재생 환경, 사용자의 실시 형태에 따라서 설정 및 변경이 가능하 다.The preset video frame unit may be set to 30 frames, which may be set and changed according to the quality of the video, the playback environment, and the user's embodiment.

그러면, 미리 설정된 영상 프레임 단위별로 영상 프레임을 저장하고, 상기 저장된 영상 프레임과 직전에 저장된 이전 영상 프레임의 히스토그램 분포를 비교하여 히스토그램 분포의 변동 값이 사용자에 의해 미리 설정된 임계값을 초과하는 경우, 상기 저장된 영상 프레임을 화면 전환 프레임으로 추출한다.Then, the image frame is stored in units of preset image frames, and if the variation value of the histogram distribution exceeds a threshold set by the user by comparing the histogram distribution of the previously stored image frame with the stored image frame. Extract the saved video frame as the screen change frame.

예를 들면, 다양한 뉴스를 포함하는 동영상에서 앵커의 멘트와 앵커의 멘트에 따르는 영상 정보 간 영상의 히스토그램의 분포가 다르고, 동영상의 내용이 다를 경우 이에 따른 각각의 히스토그램 분포가 다르게 된다.For example, in a video including a variety of news, the histogram distribution of the image is different between the anchor comment and the image information according to the anchor comment. If the contents of the video are different, the histogram distribution is different.

따라서, 영상 프레임 단위별로 영상 프레임을 로드한 후, 이전에 저장된 영상 프레임과의 히스토그램 분포를 비교하여 화면 전환 프레임을 추출하면 영상의 성격에 따라 개별적인 정보의 추출이 가능하게 된다.Therefore, after loading image frames for each image frame unit and extracting a screen change frame by comparing histogram distribution with previously stored image frames, individual information can be extracted according to the characteristics of the image.

한편, 미리 설정된 임계값은 다양한 뉴스, 오락프로그램, 드라마 등 과거의 동영상의 화면 전환 시간과 히스토그램 변화 값을 상호 비교하고, 상기 화면 전환 시간에 따른 히스토그램 변화 값의 평균값으로 임계값을 설정할 수 있다.The preset threshold may compare a screen switching time and a histogram change value of a past video such as various news, entertainment programs, dramas, and set a threshold as an average value of the histogram change value according to the screen change time.

그 다음, 상기 추출된 시작 프레임 및 화면 전환 프레임을 손톱 영상으로 디스플레이하고, 상기 디스플레이된 손톱 영상 및 상기 손톱 영상의 시간 정보를 저장한다(120 과정).Next, the extracted start frame and the screen change frame are displayed as a nail image, and time information of the displayed nail image and the nail image is stored (step 120).

즉, 동영상의 시작 프레임 및 상기 히스토그램 변화 값에 따라 화면 전환 프레임을 손톱 영상으로 디스플레이한다.That is, the screen switching frame is displayed as a nail image according to the start frame of the video and the change in the histogram.

상기 디스플레이된 손톱 영상은 상기 손톱 영상의 시작 시간과 그 다음 손톱 영상의 시작 시간과 동일하도록 상기 손톱 영상의 끝 시간을 설정하여 저장한다.The displayed nail image is stored by setting an end time of the nail image to be equal to a start time of the nail image and a start time of the next nail image.

이와 같이, 상기 손톱 영상에 시간 정보를 저장하게 함으로써 동영상을 각각 분리할 수 있게 된다.In this way, by storing the time information in the nail image it is possible to separate each video.

그 다음, 상기 입력된 동영상에 포함된 음성의 음소에 따라 화자의 음성을 인식하고, 상기 인식된 음성 데이터를 문자 데이터로 변환하고, 상기 변환된 문자데이터로부터 키워드를 추출한다(130 과정).Next, the speaker's voice is recognized according to the phoneme of the voice included in the input video, the recognized voice data is converted into text data, and a keyword is extracted from the converted text data (step 130).

실제로, 동영상에 등장하는 언어는 일반적인 대화체에서 시작하여 전문적인 용어까지 포함하는 대용량의 어휘를 필요로 한다. 그리고 음성의 인식은 어떤 단위로 인식하느냐에 따라서 탐색하여야 하는 공간의 크기와 인식에 소요되는 탐색 시간이 결정된다. 그리고, 방송 뉴스와 같은 대용량 어휘의 연속어 인식의 경우에는 어떤 인식 단위를 사용하느냐에 따라서 탐색 공간과 탐색 시간이 크게 좌우된다.Indeed, the language that appears in videos requires a large vocabulary that begins with a general conversation and includes technical terms. The size of the space to be searched and the search time required for the recognition are determined according to the unit of speech recognition. In the case of continuous word recognition of a large vocabulary such as broadcast news, a search space and a search time largely depend on which recognition unit is used.

화자의 독립적인 음성 인식을 위하여 다양한 화자의 음성 특징에 대해 확률적 분포를 추정하고, 이를 인식과정에 이용하는 것이 필요하다. 일반적으로 확률 기반의 음향 모델링에 가장 많이 사용되는 알고리즘은 은닉 마르코프 모델(Hidden Markov Model:HMM) 알고리즘이다.For independent speech recognition of speakers, it is necessary to estimate the probability distribution of speech features of various speakers and use them in the recognition process. In general, the most commonly used algorithm for probability-based acoustic modeling is the Hidden Markov Model (HMM) algorithm.

HMM도 다양한 종류(DHMM, SCHMM, GMM 등)가 있지만 그 중 구현된 시스템의 환경이 임베디드 시스템과 같이 메모리나 CPU 성능에 크게 영향을 받지 않으면서 인식 성능을 최대로 하는 알고리즘은 확률의 분포를 연속분포로 사용하는 연속 밀도 은닉 마르코브 모델(Continuous Density Hidden Markov Model:CDHMM)을 사용할 수 있다. 이는 상태 내 확률 분포 표현에 연속된 가우시안 분포를 이용하므로, 어 떠한 특징값에 대해서도 확률을 산출할 수 있다. 특히 특징이 입력 되었을 경우, 그에 해당하는 확률 분포 표현에 연속된 가우시안 분포(Gaussian mixture)를 이용할 수 있으므로 어떠한 특징 값에 대해서도 확률을 산출할 수 있다.There are various types of HMMs (DHMM, SCHMM, GMM, etc.), but the algorithm that maximizes the recognition performance while the environment of the implemented system is not influenced by the memory or CPU performance like the embedded system is continuous. The Continuous Density Hidden Markov Model (CDHMM) can be used as a distribution. Since it uses a continuous Gaussian distribution to represent the probability distribution in the state, the probability can be calculated for any feature value. In particular, when a feature is input, a continuous Gaussian mixture can be used to represent the probability distribution, so that the probability can be calculated for any feature value.

본 발명에서는 화자의 통계적 음향 모델을 양자화 에러 없이 보다 상세히 표현하기 위해 CHMM 알고리즘을 통해 음향모델을 생성한다. 또한 화자에 독립적인 통계적 특성을 모델링하기 위해 복수의 화자가 발성한 음성 데이터베이스로부터 문맥의 조음효과를 고려한 트라이-폰(tri-phone) 단위의 음소 단위로 음향 모델을 생성할 수 있다.In the present invention, the acoustic model is generated through the CHMM algorithm in order to express the speaker's statistical acoustic model in more detail without quantization error. In addition, in order to model statistical characteristics independent of the speaker, an acoustic model may be generated in a tri-phone unit of phonemes considering contextual articulation effects from a voice database generated by a plurality of speakers.

또한, 상기 화자의 음성은 문장과 문장 사이에 시간적인 갭(gap)이 존재하는 것을 이용하여 문장 단위로 인식될 수 있다.In addition, the speaker's voice may be recognized in units of sentences by using a time gap between sentences.

예를 들어, 도 2를 참조하면, '취나물에서 이 농약이 기준치보다 육백 배가 넘게 나왔습니다'라는 단위 문장이 인식되었을 경우, 매 단어마다 서브워드(subword) 단위인 음소열로 단어를 구성하고, 각 서브워드인 음소는 이미 훈련된 음향모델의 CDHMM 파라미터로 대체하여 트라이-폰(tri-phone) 기반의 서브워드 노드로 구성한다. 또한 입력된 음성으로부터 각 단어 쌍의 확률인 바이-그램(Bi-gram)으로부터 단어간의 연결과 바이-그램 값을 적용하여 단어 간 네트워크를 구성한다.For example, referring to FIG. 2, when a unit sentence of 'this pesticide is more than six hundred times higher than the reference value' is recognized in the Chinese herb, each word is composed of a phoneme string which is a subword unit. The phoneme, which is a subword, is composed of a tri-phone based subword node by substituting the CDHMM parameter of an already trained acoustic model. In addition, a word-to-word network is constructed by applying the word-to-word connection and the bi-gram value from the bi-gram which is the probability of each word pair from the input voice.

이와 같이, 워드 네트워크의 각 단어마다 발음사전과 음향 모델을 참조하여 서브워드 모델인 트라이-폰 단위로 확장하여 전체 탐색 공간을 생성하게 된다.In this way, each word of the word network is extended to a tri-phone unit, which is a subword model, by referring to a pronunciation dictionary and an acoustic model to generate a whole search space.

한편, 음성 인식에 따라 키워드를 추출하기 위하여 기존의 단어에 대한 중요 도를 판단하는 방법인 핵심어 후보 단어를 포함하는 데이터 베이스를 구성하여, 각 단어에 대하여 단어 빈도수(Term Frequency:TF) 및 역문헌 빈도수(Inverse Document Frequency:IDF)에 따라 키워드를 추출하는 방법을 사용할 수 있다.Meanwhile, in order to extract keywords according to speech recognition, a database including key word candidate words, which is a method of determining the importance of an existing word, is constructed, and the word frequency (TF) and reverse literature for each word are constructed. The keyword extraction method can be used according to the inverse document frequency (IDF).

이를 위하여, 예를 들어 과거의 음성으로부터 텍스트로 변환하는 음성 전사(Description) 자료을 이용하여 3만 단어 급의 핵심어 후보 단어의 데이터베이스를 구성하고, 상기 구성된 핵심어 후보 단어의 데이터베이스를 이용하여 사용자가 지정한 구간의 음성 인식 결과에서 계산된 단어 빈도 수를 곱한 값을 단어의 중요도에 대한 척도로 사용할 수 있다.To this end, for example, a database of 30,000 words of key word candidate words is constructed by using voice transcription data that is converted from text to text in the past, and a section designated by a user using the configured key word candidate word database. The value multiplied by the word frequency calculated from the speech recognition result of can be used as a measure of the importance of the word.

그 다음, 상기 입력된 동영상으로부터 자막을 검출하고, 상기 검출된 자막으로부터 자막 인식을 통하여 자막을 추출한다(140 과정).Next, a caption is detected from the input video, and the caption is extracted from the detected caption through caption recognition (step 140).

실제로, 영상에서 사용된 자막의 경우 해당 프레임에 대한 요약 및 주제가 담겨있는 경우가 많으므로, 대부분의 자막을 메타데이터로 사용할 확률이 높다. In fact, the subtitles used in the video often contain a summary and a subject of the corresponding frame, so that most of the subtitles are likely to be used as metadata.

이를 제한된 영역에 대해 자막 인식을 통하여 자막 검출 및 자막 인식을 수행하여 자막 인식 결과를 얻게 된다.The caption recognition and caption recognition are performed through caption recognition on the restricted region, thereby obtaining caption recognition results.

그 다음, 사용자가 상기 디스플레이된 동영상의 시작 프레임 및 화면 전환 프레임 중 시작 샷과 끝 샷을 지정하면 상기 지정된 시작 샷, 끝 샷 및 상기 시작 샷과 끝 샷 사이의 손톱 영상에 포함된 키워드 및 자막으로부터 메타데이터와 타이틀을 추출한다(150 과정).Next, when a user designates a start shot and an end shot among the start frame and the screen transition frame of the displayed video, the specified start shot, the end shot, and keywords and subtitles included in the nail image between the start shot and the end shot. The metadata and the title are extracted (step 150).

이는 사용자가 디스플레이된 동영상으로부터 추출된 시작 프레임 및 화면 전환 프레임 중 원하는 구간의 시작 샷과 끝 샷을 지정하면 상기 지정된 시작 샷, 끝 샷 및 상기 시작 샷과 끝 샷 사이의 손톱 영상으로부터 추출된 키워드 및 자막으로부터 메타데이터와 타이틀을 추출한다.When the user designates a start shot and an end shot of a desired section among the start frame and the screen transition frame extracted from the displayed video, the keyword is extracted from the specified start shot, the end shot, and the nail image between the start shot and the end shot. Extract metadata and titles from captions.

상세하게는, 상기 키워드와 자막이 일치하는 단어가 존재하는 경우 메타데이터로서의 중요도가 높을 수 있으므로, 상기 일치하는 단어를 메타데이터로 우선적으로 추출할 수 있다.In detail, when there is a word that matches the keyword and the subtitle, the importance as metadata may be high, so that the matching word may be first extracted as metadata.

이는 상기 키워드와 자막이 일치하는 단어가 존재하는 경우, 상기 단어에 가중치를 부여하고, 상기 가중치가 부여된 단어를 메타데이터로 우선적으로 추출할 수 있고, 동일한 가중치가 부여된 단어가 복수 개 존재하는 경우, '가나다'순으로 메타데이터의 단어를 추출할 수 있다. When there is a word that matches the keyword and the subtitle, the word may be weighted, the weighted word may be first extracted as metadata, and a plurality of words with the same weight may exist. In this case, the words of the metadata may be extracted in the order of 'in alphabetical order'.

한편, 상기 타이틀은 상기 지정된 시작 샷에 포함된 자막으로 설정할 수 있다. The title may be set to a subtitle included in the designated start shot.

이는 시작 샷에 포함된 자막은 실제 동영상에서 핵심 단어를 포함하는 경우가 많으므로 이를 반영하여 상기 타이틀은 상기 지정된 시작 샷에 포함된 자막으로 설정하는 것이며, 사용자의 설정에 따라 상기 타이틀은 수정 변경이 가능할 수 있다. This is because the subtitle included in the start shot often includes key words in the actual video, and the title is set to the subtitle included in the designated start shot by reflecting this. It may be possible.

마지막으로, 상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 표시하여 사용자에게 상기 동영상의 특정 구간에 대한 동영상 정보를 제공한다(160 과정).Finally, the extracted metadata, the time information of the start shot, the time information of the end shot, and the title are displayed to provide the user with video information on a specific section of the video (step 160).

우선, 입력된 동영상으로부터 화자의 음성을 인식한다(331 과정).First, the speaker's voice is recognized from the input video (step 331).

본 발명에 적용되는 음성 인식에는 화자 독립 음성 인식 방법 및 연속 핵심어 인식 방법이 적용될 수 있다.The speaker independent speech recognition method and the continuous keyword recognition method may be applied to speech recognition applied to the present invention.

이는 동영상에 포함된 오디오 정보 중에 불특정 다수의 문장을 인식하기 위해서는 화자 독립 음성인식 방법이 요구되며, 문장 중에서 메타데이터로 사용하기 위한 키워드를 추출하기 위하여 연속 핵심어 인식 방법이 요구된다.In order to recognize an unspecified number of sentences among audio information included in a video, a speaker independent speech recognition method is required, and a continuous keyword recognition method is required to extract keywords for use as metadata from sentences.

본 발명에서는 보다 신뢰성 있는 메타데이터를 추출하기 위하여 먼저 대어휘 연속어 음성 인식(Large vocabulary continuous speech recognition) 방법을 통하여 얻어진 전사(description) 데이터로부터 키워드를 검출하게 된다.In the present invention, in order to extract more reliable metadata, a keyword is first detected from description data obtained through a large vocabulary continuous speech recognition method.

실제로 연속어 음성 인식 방법에 따른 연속 인식 탐색 공간 생성을 위하여 바이-그램(bi-gram) 기반의 탐색 공간을 생성하는데, 이는 기존의 동영상으로부터 텍스트화 된 어휘를 추출하여 바이-그램을 생성한다.In order to generate a continuous recognition search space according to the continuous speech recognition method, a bi-gram based search space is generated, which generates a bigram by extracting a textualized vocabulary from an existing video.

그리고, 대어휘 인식을 위하여 인식 시스템의 런-타임(run-time)시 생성 및 할당되는 메모리의 사용을 정적 할당으로 변환하여 불필요한 하드웨어 접근 시간을 극소화 하고, 확률 계산시 필수적인 지수, 로그 함수를 테이블 룩 어헤드(table-look ahead) 방식을 통하여 복잡한 수식 연산을 미리 연산하여 근사화 함으로써 연산속도를 줄일 수 있다.And, to minimize the unnecessary hardware access time by converting the use of the memory created and allocated at run-time of the recognition system to static allocation for large vocabulary recognition, and minimizing unnecessary hardware access time, the table of exponential and log functions essential for probability calculation Through table-look ahead, complex mathematical operations can be precomputed and approximated to reduce computation speed.

그 다음, 상기 추출된 화면 전환 프레임에 따라 음성 인식 결과 구간을 설정한다(332 과정).Next, a voice recognition result section is set according to the extracted screen change frame (step 332).

상술한 바와 같이 화면 전환 프레임이 추출되면, 추출된 화면 전환 프레임은 다음 화면 전환 프레임이 추출되기 전까지의 영상 정보를 저장하게 되므로, 상기 추출된 화면 전환 프레임 사이의 구간을 음성 인식 결과 구간으로 설정할 수 있다.As described above, when the screen switching frame is extracted, the extracted screen switching frame stores the image information until the next screen switching frame is extracted, so that the interval between the extracted screen switching frames can be set as the voice recognition result section. have.

이와 같이, 추출된 화면 전환 프레임을 기반으로 음성 인식 결과 구간의 음성 정보로부터 키워드를 추출하게 된다.As such, the keyword is extracted from the voice information of the voice recognition result section based on the extracted screen change frame.

마지막으로, 설정된 음성 인식 결과 구간의 음성 정보로부터 키워드를 추출한다(333 과정). Finally, a keyword is extracted from the speech information of the set speech recognition result section (step 333).

본 발명에 있어서, 음성 정보로부터 키워드를 추출하는 방법은 통계적인 각 단어의 빈도수에 기반한다. 이는 어떤 단어가 설정된 음성 인식 결과 구간에 자주 나타나는 단어라면 일반적으로 중요한 단어라고 판단할 수 있다. 그러나, '오늘은'이나, '∼기자입니다.'와 같은 의례적으로 모든 음성 인식 결과 구간에 나타나기 쉬운 단어에 있어서 빈도수 만을 이용하여 키워드를 추출한다면, 실제 키워드에 더해서 필요없는 단어까지 같이 검출이 되는 문제가 있다.In the present invention, a method of extracting a keyword from speech information is based on the frequency of each statistical word. If a word is a word that frequently appears in the set speech recognition result section, it can be determined that the word is generally important. However, if a keyword is extracted using frequency only in words that are likely to appear in all speech recognition result areas, such as 'today' or 'to journalist.' There is a problem.

따라서, 이를 동시에 고려하는 것이 TF-IDF(Term Frequency and Inverted Document Frequency)로서 하기의 수학식 1로 표현될 수 있다.Therefore, considering this at the same time can be represented by Equation 1 below as TF-IDF (Term Frequency and Inverted Document Frequency).

하기 수학식 1에서 N은 구성된 데이터베이스에 존재하는 모든 단어 수를 의미하며, n은 구성된 데이터 베이스에 존재하는 특정 단어의 빈도수를 의미하고, p는 키워드를 추출하고자 하는 현재 동영상 구간에 존재하는 전체 단어의 수를 의미하고, t는 특정 단어의 현재 동영상 구간에서의 빈도 수를 의미한다. In Equation 1, N denotes the number of all words existing in the configured database, n denotes the frequency of a specific word existing in the configured database, and p denotes the total words existing in the current video section to extract keywords. The number of and t refers to the frequency of the current video section of the specific word.

즉, 상기 수학식 1에 따라 역문헌 빈도 수인 해당 단어의 구성된 데이터베이스에서의 빈도의 역수를 취하고, 해당 동영상 구간에서는 빈도 수를 곱함으로써 다른 동영상에서는 빈도 수가 낮으면서, 해당 동영상 구간에 빈도 수가 높은 단어가 TF-IDF에 따른 키워드로 추출되게 된다.That is, according to Equation 1, the inverse of the frequency in the database composed of the word corresponding to the bibliographic frequency is taken, multiplied by the frequency in the video section, the frequency is low in the other video, the word having a high frequency in the video section Is extracted as a keyword according to TF-IDF.

도 4를 참조하면 우선, 입력 영상의 전처리로 가우시안 필터링 및 컬러도메인에서의 필터링을 수행하여 자막 후보 영역을 검출한다(441 과정). Referring to FIG. 4, first, a caption candidate region is detected by performing Gaussian filtering and filtering in a color domain by preprocessing an input image (step 441).

즉, 우선적으로 자막이 포함된 자막 프레임의 검출을 위하여 자막이 가지는 고주파 성분과 배경과의 강한 대조비를 이루는 특성을 고려하여 샘플된 프레임에서 자막 후보 영역을 검출한다.That is, in order to detect the caption frame including the caption, the caption candidate region is detected in the sampled frame in consideration of the characteristic of forming a strong contrast ratio between the high frequency component of the caption and the background.

여기서, 강한 고주파 성분을 추출하기 위하여 각 프레임의 미분 연산자(Gradient operator)를 사용하며, 컬러 정보의 손실을 방지하기 위하여 각 프레임에서 컬러 정보를 가지고 있는 빨강, 녹색, 파랑(Red, Green, Blue:RGB) 성분에 대해서 연산자를 사용한다.Here, a gradient operator of each frame is used to extract a strong high frequency component, and red, green, and blue having color information in each frame to prevent loss of color information. Operator for the RGB) component.

컬러미분연산자(Color gradient operator)를 사용함으로써 자막이 가지고 있 는 고주파 성분 추출 및 강한 대조비를 그대로 유지시킬 수 있다.By using the color gradient operator, the high frequency component extraction and strong contrast ratio of the subtitles can be maintained.

이는 흑백(Gray-level) 영상 변환 후 1차원 성분에서의 미분연산자를 사용하는 경우보다 3차원의 도메인을 가지는 컬러 영상에 대한 정보를 보존하면서도 상기 컬러 영상을 이용할 수 있기 때문에 자막이 가지는 고주파 성분 추출에 강인함을 보일 수 있다.This is because the color image can be used while preserving information on the color image having a three-dimensional domain, rather than using a differential operator in one-dimensional component after gray-level image conversion. Can show toughness.

각 3차원(R, G, B) 도메인에서 얻은 미분 연산값으로부터 벡터연산을 통하여 강한 대조비를 가지는 성분을 추출하고, 미분연산과 벡터 연산을 통하여 배경성분을 최대한 배제하면서 자막 성분을 추출할 수 있는 특징 영상(Edge image)을 구할 수 있다.From the derivative operation values obtained in each of three-dimensional (R, G, B) domains, components with strong contrast ratio can be extracted through vector operations, and subtitle components can be extracted while maximizing the background components through differential operations and vector operations. A feature image can be obtained.

상술한, 자막의 특징이 있는 배경이 존재할 수 있으므로 특징 영상에서 다시 미분연산을 한 후, 특징 영상의 증감변화에 대한 쌍대성(Edge pair) 및 자막이 가지는 거리성분을 고려하여 수평방향 및 수직방향의 좌표값을 얻을 수 있다.Since the background having the feature of the subtitle may exist as described above, after differential operation is performed on the feature image again, the horizontal direction and the vertical direction are considered in consideration of an edge pair for the change of the feature image and the distance component of the subtitle. You can get the coordinate value of.

그 다음, 자막 후보 영역에 대한 검증을 수행한다(442 과정).Next, the subtitle candidate region is verified (step 442).

이는 자막 후보 영역에는 배경에 존재하는 글자 혹은 자막이 가지는 특징과 유사한 성질을 가지는 배경이 존재할 수 있기 때문에 자막 후보 영역에 대한 검증 과정이 필요하기 때문이다. 자막 후보 영역의 검증은 정상 영역(실제 인위적인 자막 영역)만을 분류하고 이를 실제 자막으로 추출할 수 있는 영역과 배경에 존재하는 문자 혹은 기타 영역을 분리하여 자막 후보 영역만을 정확하게 추출하기 위함이다. 이로 인하여 자막 부분에 대한 검출율을 향상시킬 수 있다.This is because a subtitle candidate region may have a background having properties similar to those of characters or subtitles present in the background, and thus a verification process for the subtitle candidate region is required. The verification of the subtitle candidate area is to classify only the normal area (actually artificial subtitle area) and to extract only the subtitle candidate area accurately by separating the area that can be extracted as the actual subtitle and the text or other area existing in the background. As a result, the detection rate of the subtitle portion can be improved.

자막 후보 영역에 대한 검증을 위하여 투 클래스(Two-Class)인 자막 영역과 비 자막 영역으로 분류하는 것이 아니라, 실제 자막 영역만을 가지고 데이터 도메인에 대한 클래스를 구성하는 원 클래스(One-Class) 문제로 분석한다.To verify the subtitle candidate area, it is not classified as a two-class subtitle area and a non-subtitle area, but a one-class problem that constitutes a class for the data domain using only the actual subtitle area. Analyze

즉, 훈련(Training)을 통해 정상 영역에 대한 자막 후보 영역을 구성하고, 상기 자막 후보 영역이 구성된 영역 밖에 존재하는 영상 프레임 내에서의 영역은 아웃라이어(Outlier)로 간주하여 자막 후보 영역을 검증하게 된다.That is, a subtitle candidate region for a normal region is formed through training, and an area within an image frame existing outside the region in which the subtitle candidate region is configured is considered an outlier to verify the subtitle candidate region. do.

이와 같은 자막 후보 영역의 검증은 훈련 데이터의 수를 상당히 줄일 수 있는 장점과 인위적인 자막(Superimposed caption)과 상당히 유사한 배경에 존재하는 글자를 포함하는 글자영역을 구별할 수 있다.Such verification of the caption candidate region can distinguish between an advantage of significantly reducing the number of training data and a character region including characters existing in a background very similar to an artificial caption.

마지막으로, 검출된 자막 영역의 이진화(Binarization), 자막 영역에서의 글자단위 분리, 한글 자모음 유형 분류 및 자소 분리, 자소단위 인식 단계를 거쳐 자막을 인식한다(443 과정).Finally, the caption is recognized through the binarization of the detected caption area, character unit separation in the caption area, Korean consonant type classification and phoneme separation, and phoneme unit recognition steps (step 443).

이는, 이진화의 성능에 따라서 문자 인식의 성능이 좌우되므로, 자막 영역의 휘도 성분을 추출하여 히스토그램을 분석하는 방법을 사용하여 정교한 이진화를 실시할 수 있다. 휘도 성분의 히스토그램 분석을 위하여 동적 분산 값과 동적 필터 크기를 적용한 평활화를 적용하여 배경과 자막 성분과 관련된 두 개의 첨점(Peak)과 계곡값(Valley)을 이용하여 1차 동적 문턱값(threshold)을 구한다.Since the performance of character recognition depends on the performance of binarization, sophisticated binarization can be performed using a method of extracting a luminance component of a caption region and analyzing a histogram. For histogram analysis of luminance components, we apply smoothing with dynamic variance and dynamic filter size, and use the first and second peaks and valleys related to the background and subtitles to determine the first-order dynamic threshold. Obtain

그리고, 자막의 경계 부분의 정확한 이진화를 위하여 자막의 높이를 고려하여 동적인 마스크를 생성하여 마스크내의 자막과 배경성분의 값을 분석하여 2차 동적 문턱값을 생성한다.In order to accurately binarize the boundary of the caption, a dynamic mask is generated in consideration of the caption height, and the secondary dynamic threshold is generated by analyzing the caption and background components in the mask.

그런 다음, 이진화된 자막 후보 영역 내의 글자 단위의 비에 따른 분리를 수 행하는데, 이는 글자의 높이 대 폭을 분석하고, 글자의 경계 부근의 수직방향의 투영값을 분석하여 자막성분의 글자의 분리를 수행한다.Then, separation is performed according to the ratio of letter units in the binarized subtitle candidate area, which analyzes the height-to-width of the letter, and analyzes the projection value in the vertical direction near the letter boundary to separate the letter of the subtitle component. Perform

하나의 글자에서 자음과 모음이 분리되어 수직방향 투영시에는 2개의 글자로 분리되는 경우를 대비하여 높이 대 폭을 분석한 뒤, 수직 성분 유무를 판단하여 정확한 영역 내의 글자단위 분리를 실시한다.When consonants and vowels are separated from one letter, and the vertical projection is performed, the height-to-width is analyzed in case of splitting into two letters.

글자 단위 분리 후 수직, 수평성분의 투영값의 최대, 최소, 평균 등을 특징벡터로 추출하여 한글 유형 분류를 하고, 자모음 유형분류 후 얻어진 모음 성분과 종성 성분 유무의 정보를 가지고 자소 분리 과정을 수행한다. 각 유형별 모음성분을 고려하여 투영값의 최대값, 최소값 및 변화율 등을 분석하여 자소 분리를 수행한다. After separating the character units, classify the Hangul type by extracting the maximum, minimum, average, etc. of the projected values of the vertical and horizontal components as feature vectors, and perform the phoneme separation process with the information on the vowel and final components obtained after the Jamo collection type classification. Perform. The phoneme separation is performed by analyzing the maximum, minimum, and rate of change of the projection value, taking into account the vowel components of each type.

상기 분리된 자음과 모음으로부터 중심점을 계산하여 정규화된 크기로 투영하여 중심점으로부터의 일정거리 성분 및 방향성분의 분포를 특징벡터로 사용하여 각 분리된 자소를 신경망을 통해 인식하고, 상기 인식된 자소인식결과를 조합하여 글자 단위로 자막을 인식하고, 상기 인식된 자막을 추출한다.The center point is calculated from the separated consonants and vowels and projected to a normalized size to recognize each separated phoneme through a neural network using a distribution of a constant distance component and a direction component from the center point as a feature vector, and the recognized phoneme recognition. By combining the results, the captions are recognized in letter units, and the recognized captions are extracted.

도 5는 도 1의 입력 동영상의 시작 프레임 및 화면 전환 프레임 추출 단계(110 과정)의 상세 흐름도이다.FIG. 5 is a detailed flowchart of the start frame and screen change frame extraction step 110 of the input video of FIG. 1.

도 5를 참조하면 우선, 입력된 동영상을 재생한다(511 과정). Referring to FIG. 5, first, an input video is played (step 511).

상기 입력 동영상은 실시간 동영상 및 녹화된 동영상일 수 있으며, 바람직하게는 상기 입력 동영상은 MPEG-2 표준화 규격에 따른 동영상을 포함할 수 있다.The input video may be a real time video and a recorded video. Preferably, the input video may include a video according to the MPEG-2 standard.

MPEG-2 표준화 규격에 따른 동영상은 텔레비전 방송, 통신, 오디오·비디오 기기 등 광범위한 적용 분야를 대상으로 하는 고품질의 규격을 가지는 동영상 표준으로 고선명 텔레비전(HDTV)의 화질을 감안하여 4~100Mbps고화질의 규격을 가지기 때문에 영상의 전송, 디지털 비디오 디스크(DVD) 등의 디지털 저장 매체의 개발과 소프트웨어 제작에 활용도가 높으므로 본 발명에 쉽게 접목할 수 있다.The video based on the MPEG-2 standard is a video standard with a high quality standard for a wide range of applications such as television broadcasting, telecommunications, audio and video equipment, etc., and 4 to 100 Mbps high quality standard considering the image quality of high definition television (HDTV). Because of the high utilization in the development of digital storage media such as image transmission, digital video disk (DVD), and software production, it can be easily incorporated into the present invention.

그 다음, 상기 입력된 동영상의 시작 프레임의 화면을 손톱 영상으로 디스플레이하고(512 과정), 사용자에 의해 설정된 프레임 단위마다 영상 프레임을 저장한다(513 과정).Next, the screen of the input frame of the input video is displayed as a nail image (step 512), and the image frame is stored for each frame unit set by the user (step 513).

한편, 사용자에 의해 설정된 프레임 단위는 영상에 따라서 영상의 화면 변화의 동적 변화 정도 및 사용 상태에 따라 임의로 설정할 수 있으나, 바람직하게는, 영상의 움직임을 더욱 세밀하게 묘사할 수 있는 30 프레임 단위로 설정할 수 있다.Meanwhile, the frame unit set by the user may be arbitrarily set according to the dynamic change degree and the use state of the screen change of the image according to the image. Preferably, the frame unit set by the user is set to 30 frame units that can describe the movement of the image in more detail. Can be.

그 다음, 저장된 영상 프레임의 히스토그램 분포 정보를 획득한다(514 과정). 히스토그램은 이미지에서 밝기 수치의 구분을 그래픽으로 설명한 것으로 가로축은, 어두운 화소는 왼쪽으로, 밝은 화소는 오른쪽으로 향하는 밝기의 수치(0에서 255)를 나타내게 된다. 세로축은 각 단계의 밝기에서 화소들의 비율을 나타낸다. 따라서, 영상 프레임의 히스토그램을 확인함으로써 사용자는 이미지의 전반적인 노출을 측정할 수 있으므로, 히스토그램이 왼쪽으로 치우쳐서 높게 나타나면 이미지가 대부분 어두운 화소로 구성되며 이미지가 어둡게 나타나는 특성을 가지는 바, 영상의 히스토그램 분포를 살펴보면 영상의 특성을 알 수 있다.Next, the histogram distribution information of the stored image frame is obtained (step 514). The histogram is a graphical description of the division of brightness values in the image, with the horizontal axis representing the brightness values (0 to 255), with dark pixels to the left and bright pixels to the right. The vertical axis represents the ratio of pixels at each level of brightness. Therefore, by checking the histogram of the image frame, the user can measure the overall exposure of the image. Therefore, if the histogram is shifted to the left, the image is mostly composed of dark pixels and the image is dark. Looking at it, we can know the characteristics of the image.

그 다음, 저장된 영상 프레임의 히스토그램 분포와 직전에 저장된 영상 프레임의 히스토그램 분포를 비교하여 히스토그램의 변화 값이 사용자에 의해 미리 설 정된 임계값보다 큰 값을 가지는 지의 여부를 판단하여(515 과정), 상기 히스토그램 분포 변화가 임계값보다 큰 경우, 손톱 영상으로 상기 저장된 영상 프레임을 디스플레이하고(516 과정), 상기 히스토그램 분포 변화가 임계값보다 작은 경우 다시 사용자에 의해 설정된 프레임 단위별로 영상 프레임을 로드하여 저장하게 된다(513 과정).Next, the histogram distribution of the stored image frame is compared with the histogram distribution of the immediately stored image frame to determine whether the change value of the histogram has a value greater than a threshold set by the user (step 515). If the change in the histogram distribution is greater than the threshold value, the stored image frame is displayed as a nail image (step 516). If the change in the histogram distribution is smaller than the threshold value, the image frame is loaded and stored for each frame unit set by the user. (Step 513).

실제로, 상술한 바와 같이 다양한 뉴스를 포함하는 동영상에서 앵커의 멘트와 앵커의 멘트에 따르는 영상 정보 간 영상의 히스토그램의 분포가 다르고, 동영상의 내용이 다를 경우 이에 따른 각각의 히스토그램 분포가 다르게 된다.As described above, in the video including the various news, the distribution of the histogram of the image is different between the anchor comment and the image information according to the anchor comment, and when the contents of the video are different, the distribution of each histogram is different.

도 6을 참조하면, 우선 사용자에 의해 손톱 영상으로 디스플레이된 동영상의 시작 프레임 및 화면 전환 프레임 중 시작 샷과 끝 샷을 지정한다(651 과정).Referring to FIG. 6, first, a start shot and an end shot are designated among a start frame and a screen change frame of a video displayed as a nail image by a user (step 651).

즉, 사용자는 동영상의 재생시에 히스토그램의 변화값으로 추출된 화면 전환 프레임과 디스플레이된 동영상으로부터 추출된 시작 프레임 중 메타데이터를 추 출하고자 하는 시작 샷과 끝 샷을 지정한다. That is, the user designates a start shot and an end shot for extracting metadata among the screen change frame extracted as the change value of the histogram and the start frame extracted from the displayed video when the video is played.

그 다음, 화면 전환 프레임 중 원하는 구간의 시작 샷과 끝 샷을 지정하면 상기 지정된 시작 샷 및 끝 샷의 시간 정보를 획득한다(652 과정). Next, if a start shot and an end shot of a desired section of the screen change frame are designated, time information of the designated start shot and end shot is obtained (step 652).

상기 시간 정보를 획득함으로써, 지정된 시작 샷 및 끝 샷 사이에 포함되는 화면 전환 프레임의 데이터를 함께 로드한다.By acquiring the time information, data of the screen switching frame included between the designated start shot and the end shot are loaded together.

그 다음, 상기 지정된 시작 샷의 자막 인식 결과를 획득한다(653 과정). 상기 자막 인식 결과는 상술한 자막 추출 및 인식 과정에 의할 수 있다.Next, a caption recognition result of the designated start shot is obtained (step 653). The caption recognition result may be based on the caption extraction and recognition process described above.

마지막으로, 상기 획득한 시작 샷의 자막 인식 결과를 타이틀로 설정하고, 지정된 시작 샷 및 끝 샷의 메타데이터를 추출한다(654 과정). Finally, the caption recognition result of the obtained start shot is set as a title, and metadata of the designated start shot and end shot is extracted (step 654).

이는 전술한 바와 같이, 시작 샷에 포함된 자막은 실제 동영상에서 핵심 단어를 포함하는 경우가 많으므로 이를 반영하여 상기 타이틀은 상기 지정된 시작 샷에 포함된 자막으로 설정하는 것이며, 사용자의 설정에 따라 상기 타이틀은 수정 변경이 가능할 수 있다. As described above, since the subtitle included in the start shot often includes key words in the actual video, the title is set to the subtitle included in the designated start shot by reflecting this. The title may be modified.

한편, 상기 메타데이터의 추출은 자막과 키워드를 융합하여 생성하되, 상기 자막이 문장형으로 추출될 경우 상기 자막 중 명사인 단어를 자막으로 추출할 수 있다.On the other hand, the extraction of the metadata is generated by fusing a subtitle and a keyword, if the subtitle is extracted in a sentence form can be extracted a subtitle word of the subtitle as a subtitle.

일반적으로 문장 내에 포함되어 있는 명사가 물체의 행동 혹은 상태를 서술함으로써 의미를 전달한다. 즉, 문장에서 전달하고자 하는 주요 의미는 명사에 대한 표현에 집중되어 있다. 따라서, 문장에서 표현하고자 하는 명사가 그 문장의 핵심어를 대부분 포함하고 있는 것을 알 수 있다. 이러한 자막을 추출하기 위하여 문 장형으로 표현되는 자막에 대하여 형태소 분석을 통하여 문장 중에 명사로 판단되는 단어를 자막으로 추출한다.In general, nouns in sentences convey meaning by describing the behavior or state of an object. In other words, the main meaning to convey in the sentence is concentrated on the expression of nouns. Thus, it can be seen that the noun to be expressed in the sentence includes most of the key words of the sentence. In order to extract such subtitles, the subtitles extracted from the sentence are extracted through the morphological analysis.

아울러, 명사를 추출함과 동시에 자막으로 추출된 성분에 대하여 불필요한 단어를 제거할 수 있다.In addition, while extracting nouns, unnecessary words may be removed with respect to components extracted as subtitles.

이는 본 발명에 있어서 자막은 한글 자막에 대한 것으로, 숫자나 특수 기호 뒤에 올 수 있는 단어에 대해 핵심어로 출력되지 않도록 할 수 있다.In the present invention, the subtitles are for the Korean subtitles, and may not be output as a key word for a word that may follow a number or a special symbol.

즉, 숫자 다음에 올 수 있는 명사의 품사를 가지지만 의미상 핵심어로 판단되지 않는 '단위 표현'인 경우, 자막 성분에서 제외할 수 있다. In other words, a 'unit expression' that has a part-of-speech noun that can follow a number but is not considered to be a key word in meaning may be excluded from the subtitle component.

우선, 동영상을 입력하고, 상기 입력된 동영상의 시작 프레임 및 화면 전환 프레임을 추출하여 상기 시작 프레임 및 화면 전환 프레임을 손톱 영상으로 디스플레이하고, 상기 디스플레이된 손톱 영상 및 상기 손톱 영상의 시간 정보를 저장한다(710 과정).First, a video is input, the start frame and the screen change frame of the input video are extracted, the start frame and the screen change frame are displayed as a nail image, and the displayed nail image and time information of the nail image are stored. (710 courses).

그 다음, 상기 입력된 동영상에 포함된 음성의 음소에 따라 화자의 음성을 인식하고, 상기 인식된 음성 데이터를 문자 데이터로 변환하고, 상기 변환된 문자데이터로부터 키워드를 추출한다(720 과정).Next, the speaker's voice is recognized according to the phoneme of the voice included in the input video, the recognized voice data is converted into text data, and a keyword is extracted from the converted text data (step 720).

그 다음, 상기 입력된 동영상으로부터 자막을 검출하고, 상기 검출된 자막으로부터 자막 인식을 통하여 자막을 추출한다(730 과정).Next, a caption is detected from the input video, and the caption is extracted from the detected caption through caption recognition (step 730).

그 다음, 상기 디스플레이된 동영상의 시작 프레임 및 화면 전환 프레임 중 시작 샷과 끝 샷을 지정하여 상기 지정된 시작 샷과 끝 샷 사이의 지정 동영상에 포함된 키워드 및 자막으로부터 메타데이터와 타이틀을 추출하고, 상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 포함하는 동영상 데이터를 디스플레이한다(740 과정).Next, a start shot and an end shot are designated among the start frame and the screen transition frame of the displayed video to extract metadata and a title from keywords and subtitles included in the designated video between the designated start shot and the end shot. The extracted metadata, time information of the start shot, time information of the end shot, and moving image data including the title are displayed (step 740).

상기 710과정 내지 740과정은 전술한 바와 동일하므로 중복된 상술은 생략하기로 한다.Processes 710 to 740 are the same as described above, and thus redundant description will be omitted.

그 다음, 상기 디스플레이된 동영상 데이터를 XML 문서로 저장한다(750 과정).In operation 750, the displayed moving image data is stored as an XML document.

상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 포함하는 동영상 데이터를 디스플레이된 정보를 그래픽 유저 인터페이스(Graphic User Interface:GUI) 상에서 사용자에 의해 XML 문서로 저장할 수 있다.Displaying the extracted metadata, the time information of the start shot, the time information of the end shot, and the moving picture data including the title, as an XML document by the user on a graphical user interface (GUI). Can be.

또한, 상기 XML 문서는 MPEG-7의 표준에 맞춘 XML 문서를 포함할 수 있다. In addition, the XML document may include an XML document conforming to the standard of MPEG-7.

도 8을 함께 참조하면, 상기 추출된 메타데이터, 상기 시작 샷의 시간 정보, 상기 끝 샷의 시간 정보 및 상기 타이틀을 포함하는 동영상 데이터를 그래픽 유저 인터페이스(Graphic User Interface:GUI) 상에서 확인할 수 있으며 사용자에 의해 XML 문서로 쉽게 임포팅(importing)이 가능하다.Referring to FIG. 8, the extracted metadata, the time information of the start shot, the time information of the end shot, and the moving picture data including the title may be confirmed on a graphic user interface (GUI). This makes it easy to import into an XML document.

마지막으로, 특정 사용자가 탐색하고자 하는 검색어를 입력하면, 상기 XML 문서로 저장된 동영상의 저장 시간, 타이틀, 재생 길이 및 메타데이터를 출력한다(760 과정).Finally, when a specific user inputs a search word to search for, a storage time, a title, a playback length, and metadata of a video stored as the XML document are output (step 760).

도 9를 함께 참조하면, 도 9에서 볼 수 있는 바와 같이, 특정 사용자가 탐색하고자 하는 검색어를 입력하면 웹 브라우저 상에서 상기 검색어와 동일한 단어를 메타 데이터로 하는 동영상의 정보를 출력하고, 이를 통하여 특정 사용자는 상기 검색어에 해당하는 동영상의 정보를 미리 알 수 있으며, 실제로 상기 출력된 동영상 정보에 해당하는 동영상을 재생할 수 있다.Referring to FIG. 9, as shown in FIG. 9, when a specific user inputs a search word to search for, a web browser outputs information of a video having the same word as the search word as metadata, and through this, a specific user May know in advance information of a video corresponding to the search word, and may actually play a video corresponding to the output video information.

본 발명에 따른 상기의 각 단계는 일반적인 프로그래밍 기법을 이용하여 소프트웨어적으로 또는 하드웨어적으로 다양하게 구현할 수 있다는 것은 이 분야에 통상의 기술을 가진 자라면 용이하게 알 수 있는 것이다.Each of the above steps according to the present invention can be implemented in a variety of software or hardware using a general programming technique will be readily appreciated by those skilled in the art.

그리고 본 발명의 일부 단계들은, 또한, 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, CD-RW, 자기 테이프, 플로피디스크, HDD, 광 디스크, 광자기 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드로 저장되고 실행될 수 있다.And some steps of the invention may also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data that can be read by a computer system is stored. Examples of computer-readable recording media include ROM, RAM, CD-ROM, CD-RW, magnetic tape, floppy disks, HDDs, optical disks, magneto-optical storage devices, and carrier wave (eg, Internet It also includes the implementation in the form of). The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 본 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 상기의 설명에 포함된 예들은 본 발명에 대한 이해를 위해 도입된 것이며, 이 예들은 본 발명의 사상과 범위를 한정하지 않는다. 상기의 예들 외에도 본 발명에 따른 다양한 실시 태양이 가능하다는 것은, 본 발명이 속한 기술 분야에 통상의 지식을 가진 사람에게는 자명할 것이다. 본 발명의 범위는 전술한 설명이 아니라 청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다. Those skilled in the art will appreciate that the present invention can be implemented in a modified form without departing from the essential features of the present invention. Therefore, the disclosed embodiments should be considered in descriptive sense only and not for purposes of limitation. Examples included in the above description are introduced for the understanding of the present invention, and these examples do not limit the spirit and scope of the present invention. It will be apparent to those skilled in the art that various embodiments in accordance with the present invention in addition to the above examples are possible. The scope of the present invention is shown not in the above description but in the claims, and all differences within the scope will be construed as being included in the present invention.

상술한 바와 같이 본 발명에 의하면, 음성 인식 결과와 동영상 자체의 개방형 자막 인식 결과인 방송 콘텐츠에 포함되어 있는 메타 데이터 정보를 자동으로 추출하고, 새로운 방송 자료들의 수작업에 의한 아카이빙(archiving) 작업시간을 단축할 수 있으며, 과거의 방대한 양의 방송 데이터들에 대한 콘텐츠 관리 및 색인 작업을 자동으로 실행하여 방송 제작자들에게 양질의 콘텐츠를 제작할 수 있게 함으로써, 과거의 자료를 이용한 콘텐츠 제작에 소요되는 시간과 비용을 절약할 수 있도록 할 수 있을 뿐만 아니라 현재 광범위하게 적용되고 있는 인터넷 검색 사이트와 연동하여 실시간으로 제작되는 콘텐츠에 대한 멀티미디어 자료들의 검색을 가능하게 하고, 과거의 자료들에 대한 열람을 용이하게 함으로써, 사용자의 콘텐츠 사용의 편의성을 제공하는 효과가 있다.As described above, according to the present invention, the metadata information included in the broadcast content that is the result of the speech recognition and the open caption recognition of the video itself is automatically extracted, and the archiving time by manual operation of the new broadcast materials is performed. It is possible to shorten and automatically execute content management and indexing of huge amounts of broadcast data in the past so that broadcast producers can produce high-quality contents. Not only can it save money, but it also works with internet search sites that are widely applied. It also enables the searching of multimedia materials in real time and facilitates reading of past materials. To make it easier for users to use your content. It works.

Claims

Inputting a video including metadata and extracting a start frame and a screen switching frame of the input video;

Displaying the extracted start frame and the screen switching frame as a nail image and storing time information of the displayed nail image and the nail image;

Recognizing a speaker's voice according to a phoneme of a voice included in the input video, converting the recognized voice data into text data, and extracting a keyword from the converted text data;

Extracting a caption from the input video through caption recognition;

When the user designates a start shot and an end shot among the displayed start frame and the screen transition frame of the displayed video, metadata from the keywords and subtitles included in the designated start shot, the end shot, and the nail image between the start shot and the end shot is displayed. Extracting a title; And

And displaying the extracted metadata, the time information of the start shot, the time information of the end shot, and the title.

The method of claim 1,

The speaker's voice is,

A method of extracting metadata through speech recognition and caption recognition of a video, characterized in that each sentence unit is recognized.

The method of claim 1,

Extracting the metadata and title,

And if the words coincide with the subtitles, weighting the matching words, and first extracting the weighted words as metadata. Metadata extraction method through.

The method of claim 1,

The title is

The metadata extraction method through the voice recognition and subtitle recognition of the video, characterized in that the caption of the start shot is set.

The method of claim 1,

Extracting the caption through the caption recognition,

Detecting a caption candidate region from a frame of the input video and verifying the detected caption candidate region; And

And performing caption recognition from the verified caption candidate region.

The method of claim 5, wherein

The subtitle candidate area is

And a method for extracting metadata through speech recognition and subtitle recognition of the video, characterized in that it is detected by using a derivative operation value according to a color differential operator in a color domain after Gaussian filling to the input video.

The method of claim 5, wherein

The performing of caption recognition may include:

Binarizing the caption candidate region using a first order dynamic threshold value and a second order dynamic threshold value; And

Character-separated and phoneme-separated subtitles existing in the binarized subtitle candidate region, and then recognizing the phoneme-separated phonemes, and recognizing the subtitles by letter units according to the recognized phonemes. Metadata extraction method using speech recognition and subtitle recognition of video.

The method of claim 1,

Extracting the metadata and title,

Extracting metadata by fusing the subtitles and keywords, and extracting a word that is a noun among the subtitles as a subtitle when the subtitle is extracted in a sentence form. How to extract metadata.

The method of claim 1,

Extracting a keyword from the converted character data,

Constructing a database including key word candidate words for determining importance for words; And

Extracting the keyword according to the frequency number of the converted character data in the video and the frequency number of the converted text data in the configured database. How to extract metadata.

The method of claim 1,

The video is a metadata extraction method through speech recognition and subtitle recognition of the video, characterized in that it comprises a video according to the MPEG-2 standard.

Inputting a video, extracting the start frame and the screen switching frame of the input video to display the start frame and the screen switching frame as a nail image, and storing time information of the displayed nail image and the nail image; ;

Detecting a caption from the input video and extracting the caption from the detected caption through caption recognition;

Specifying a start shot and an end shot among the displayed start frame and the screen transition frame to extract metadata and a title from keywords and subtitles included in the designated video between the designated start shot and the end shot, and extract the metadata. Displaying moving image data including data, time information of the start shot, time information of the end shot, and the title;

Storing the displayed moving image data as an XML document; And

Outputting a storage time, a title, a playback length, and metadata of the video stored as the XML document when a specific user inputs a search word to search for.

The method of claim 11,

Extracting the caption through the caption recognition,

Setting a specific region of a frame of the video as a caption candidate region;

Detecting characters included in an image frame with respect to the set caption candidate region; And

Extracting a subtitle from the detected character, and extracting a word which is a noun from the extracted subtitle.

The method of claim 11,

Extracting a keyword from the converted character data,

Extracting the keyword according to the frequency of the converted text data in the video and the frequency of the converted text data in the configured database.

The method of claim 11,

The XML document is a video search method using metadata, characterized in that it comprises a document according to the standard of MPEG-7.

The method of claim 11,

The title is

The video search method using metadata, characterized in that the caption of the start shot is set.

A computer-readable recording medium having recorded thereon a program for executing the method of any one of claims 1 to 15.