201039149 六、發明說明: 【發明所屬之技術領威】 [0001] 本發明係關於/種強健性自動影片文字資訊萃取與問答 檢索方法,用以對影片之中萃取其文字資訊,提供輸入 介面’讓使用者能以自然語言方式查詢問題,得到問題 答案在對應的影片與其播放起始時間資訊,並得以觀看 該段落之影片内容。 【先前技術】 [0002] 本發明係關於一種強健性自動影片文字資訊萃取與問答 檢索方法’實際上此間題奪涉到幾個領域知識如:(1)多 媒體處理:影像處理(Image processing),影像文字 定位、擷取(Text L0caiizatiori and Extraction) ,光學字元辨識(Optical character Recognition) ,(2)文件訊息處理:資訊檢索(Inf〇rmati〇n Re_ trieval )中英文件處理(Chinese/Engl ish Text201039149 VI. Description of the Invention: [Technical Leadership of the Invention] [0001] The present invention relates to a robust and automatic film text information extraction and question-and-answer retrieval method for extracting text information from a movie and providing an input interface. Allows the user to query the question in a natural language, get the answer to the question in the corresponding movie and its playback start time information, and be able to view the video content of the paragraph. [Previous Technology] [0002] The present invention relates to a robust automatic film text information extraction and question-and-answer retrieval method. In fact, this topic involves several areas of knowledge such as: (1) multimedia processing: image processing (Image processing), Text L0caiizatiori and Extraction, Optical Character Recognition, (2) Document Message Processing: Information Retrieval (Inf〇rmati〇n Re_ trieval) Chinese and English Document Processing (Chinese/Engl ish Text
Processing)等項目。就整體架構設計精神而論,是過 去國内外技術所能作到的技術能力皆具一定水平,但尚 未正&為系統!·生架構而且也未針對自然語言輸入之問 題解決,大多是以關鍵字檢索為主之技術W本發明之 作者之過去發表著作外),但對於此項整合性發明與其 ^ ^自動問答次算法之設計是未曾見過。以下是以國際 觀點來談論過去的相關技術與研究報告。 f年來對於元片之中的内容檢索的相關學術研究已陸續 提出例如使用影片先學文字識別(V ideoOCR)、語音識 別技術,以支持影片上之文字檢索。早期,國内林 川傑博士等人)曾 、林 098112787 表單編號A0101 簡早的同彙加權方法應用在影片 第4頁/共19頁 0982〇21335-〇 201039149 ❹ 檢索上。他們只對影片中之”白色”字幕進行辨識,並 且手動建立關鍵字列表的詞彙以提高加權值。在2003年 ,國外學者提出了一個相當複雜且只針對英文新聞影片 的自動問答系統,名為VideoQA。在他們的系統中集合了 非常龐大外部資源,如WordNet關連詞典、淺層句法解析 器(shallow syntactic parser),專有名詞辨識器 (Named entity tagger)、網際網路上之電子新聞資料 、人為規則、與自建的Ontology本體系統等。如同前文 作法,他們僅是以關鍵詞權重方法來找出最有可能出現 之答案。通常,將該系統移植到另一種語言或領域,這 些主要所使用之外部資源都得重新準備,而且不是每個 領域或語系都有像英文那麼多可用之外部資源。所以對 移植中文而言這將是一項非常難鉅的任務。而且中文仍 有斷(分)詞問題尚未解決。實際上中文斷詞結果往往 會影響所擷取出之關鍵字品質,一旦未知詞或斷錯詞發 生時,關鍵字將永遠找不到。最重要的一點是,過去的 ο 研究發現,是不去解決中文 過去五年内,本發明人曾設計了全世界第一個跨語言( 英文到中文)影片字幕自動問答系統。其主要解決之問 題為特定人事時地物的專有名詞上之答案回覆,透過該 系統,使用者之能以英文問題去檢索出中文影片。而其 背後所使用之技術為自行發展之段落檢索與答案對應演 算法。該研究是以一個機器自動翻譯之機制,將所有中 文字幕轉換為英文,接著再以英文之專有名詞辨識器, 去指出可能答案。明顯地,這方法並不適用於純中文語 系下的中文對中文查詢,而且其發展之段落檢索法仍是 098112787 表單編號A0101 第5頁/共19頁 0982021335-0 201039149 以詞為主,所以對於中文而言,斷詞問題仍是須克服之 第一要素。 雖然這些已在學術上發表過的刊物與本發明之目標十分 類似,但本質上仍有著極大差異。其中最重要的幾個地 方是,過去的研究是將影片視為廣播新聞,忽略其在畫 面上出現之字幕訊息。另外雖然有些影片是有字幕檔, 不須額外抽取這些資訊,不過對現在絕大部分的影片或 在電視上放送之Video而言,是十分不可行。尤其是在 台灣地區的影片或新聞内容,大多是以字幕來傳達影片 訊息,所以抽取這些影片字幕或場景中之文字,是非常 符合現在影片所要表達的意念。 除此之外,過去的文獻所著眼的是如何移轉現有之技術 ,並且採用一般現有的文件檢索方式來搜尋使用者所須 之影片資訊。然而,文件檢索之技術雖然十分成熟,且 廣為應用在教育研究、與產經業界,其原先之設計方式 就是針對人為所打出的文件或是幾乎沒有錯字的文章進 行檢索。很明顯地,直接使用文件檢索之技術只能看出 一般現有技術對這些辨識出之文字之檢閱率如何,單純 地不作任何調整就應用在此領域之中。有別於以往傳統 文件,影片之中所辨識出的文字(不論是由字幕抽取或 聲音中抽離),往往格式與一般人類打好的文件不同; 更甚者,許多錯字是無法避免的,因為現在的光學字元 辨識在良好環境下僅達90%準確度,而語音辨識則只能在 相同人說話時到達70%之效果。在這樣的文件下直接使用 傳統文件檢索之方法顯然是會有問題的,因為有許多的 詞是找不到或辨識錯誤。而且對中文來說,由文章之中 098112787 表單編號A0101 第6頁/共19頁 0982021335-0 201039149 分出詞彙也是一個極大的挑戰》 【發明内容】 [0003] Ο 〇 098112787 本發明之目的乃是在於提出一個新穎的強健性自動影片 内容問答系統,以本發明所提出的文字資訊萃取技術, 將新聞、廣告、連續劇等多類型影片中的字幕資訊辨識 出’並留下其相對應出現所在時間資訊,再以本發明所 提出的自動影片問答技術,解析查詢問題或關鍵字,最 後將能回答問題或相關之影片回應給使用者。 本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題,得到 問題答案在對應的影片與其播放起始時間:資訊,並得以 觀看該段落之影片内容。 广:… '丨 '' '觀屬 本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題,並能 直接檢索其影片字幕辨識之結果内容。 本發明之目的也在於能提出一個自動影片内容問答系統 ’能讓使用者觀看其與查詢相關之影片段落,與精確的 播放起始時間,並能直接對該影片作快轉、倒帶、暫停 、停止、播放等操作動作。 本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題,並能 直接檢索其影片字幕辨識之結果内容。 本發明之目的也在於能提出一個自動影片内容問答系統 ’整合影片文字字幕辨識技術(本發明所發展)、檢索 使用行為等分析’使用者能以常見使用搜尋引擎的行為 方式,對影片字幕或文字進行檢索瀏覽。 表單編號A_1 第7頁/共19頁 0982021335-0 201039149 【實施方式】 [〇〇〇4]請參考第一圖,該圖為本發明之整體運作圖以及操作流 程圖。第一圖包含本發明中幾個重要核心模組、與使用 者輸入系統輪出概念圖。本發明實施例之操作流程將說 明如下。 以整體系統觀點而言,(100~103)是與使用者相關 之模組,(100)為使用者所查詢問題或關鍵字輸入至系統 ,而(101)代表的是整個影片自動問題系統之代碼; (10 2)是系統自動判別出所要回應給使用者的答案影片, 至於(103)影片庫則是事先準備好的影片來源資料庫。 首先先就影片庫之文字辨識來說明本發明之實施方 法。(130)代表的是影片文字辨識系統之整體系統。當要 辨識(103)影片庫内容時,會先將影片中每11張1?1^1^的 影像抽出,再對該影像進行(131)文字區塊偵測與切割, 將所有文字區塊偵測並精確切割出其所在影像位置。接 著將每一個區塊都經由本發明所發展孓(132)過濾非文字 區塊方法’將不是文字的區塊再行一次過滤。由於影格 是連續的’(133)文字追蹤模組負責追蹤每—文字出現在 多少張Frames ’並將這些出現的同一文字内容予以合 併(134)文子區塊二值化則是對這彩色字圖作抽取文字 顏色之動作,將文字顏色轉為黑色,背景轉為白色。 而(135)辨識肋將所輸人之二值化後的字圖,作文字之 辨認’所識別出之結果為(137)韻文字檔1最後一階 段則是合併Fr繼之文字為段落,所採用的是(ι 切割模組。 098112787 第二部分則是⑴0)中文字詞擷取,其主要是將中文 表單編號 A0101 H R "S' / α 1 a 0982021335-0 201039149 Ο 字串的詞組部分分離’因為中文不像英文,有空白作為 詞之邊界。當令(111)中文詞切分是一個斷詞方法,再本 發明之中,是以固定的(112) 1字詞(113) 2字詞(114) 3字詞為切割單位,不須準備字典,而直接對字串作這三 種方式的斷詞。 本發明於第三部分是在這步驟中(120)初始檢索系 統,以變形化ΒΜ-25為主要權重配分方式,達成一個 (121)簡易檢索模型以取出(122) ΤορΝ個段落,作為答 案候選者。 最後在(140)答案比對模組上’本發明先設計出—個 (141)最佳匹配法,列出所有可能之比對模式,接著再 〇 以動態規劃法(142)匹配結果,找出最佳一對一的查詢字 串與文章段落配對組合,接下來對此匹配組合,作(143) 再切割字串,最後則是將所細切出之字串,配合本發明 之權重配置計分法(144)段落權重配分與檢出,將所有結 果依此權重重新排名,並呈現給予使用者除了相對應之 答案影片 '出現於影片内之時間資訊、以及影片文字辨 識結果。 [0005] 【圖式簡單說明】 第一圖為本發明之系統架構示意圖 第二圖為影片文字辨識模組之示意圖 第二圖為答案比對模組之示意圖 第四圖為本發明之系統查詢介面圖 第五圖為本發明之系統瀏覽介面圖 第六圖為本發明之影片觀看介面圖 098112787 第七圖為本發明網路介面之檢余圖 表單編號Α0101 第9頁/共19頁 0982021335-0 201039149 【主要元件符號說明】 [0006] 1 0 1〜1 44 :系統步驟流程 2 01〜2 0 6 :答案比對演算法步驟流程 [0007] 100 使用者查詢問題或關鍵字 101 影片自動問題系統 102 答案影片 103 影片庫 110 中文字詞擷取 111 中文詞切分 112 1字詞 113 2字詞 114 3字詞 120 初始檢索系統 121 簡易檢索模型 122 ΤορΝ個段落 130 影片文字辨識系統 131 文字區塊偵測與切割 132 過濾非文字區塊 133 文字追蹤 134 文字區塊二值化 135 辨識 136 段落切割 137 辨識文字 140 答案比對 141 最佳匹配法 表單編號A0101 第10頁/共19頁 098112787 0982021335-0 201039149 142匹配結果 143再切割 144段落權重配分與檢出 [0008] 201 ΤορΝ個段落 202快速匹配字串演算法 203尋找最佳匹配位置組合演算法 204重切割演算法 205權重配置演算法 Ο 206計算整體段落排名分數 Ο 098112787 表單編號Α0101 第11頁/共19頁 0982021335-0Processing) and other projects. As far as the overall architectural design spirit is concerned, in the past, the technical capabilities that can be achieved by domestic and foreign technologies have a certain level, but they have not been positively developed for the system and have not been solved for natural language input. The keyword-based technology is the author of the author of the present invention, but the design of this integrated invention and its automatic question-and-answer algorithm has never been seen. The following is an international perspective on past related technologies and research reports. In the past years, relevant academic research on content retrieval in meta-films has been proposed, for example, using video vocabulary recognition (V ideoOCR) and speech recognition technology to support text retrieval on video. Early, domestic Dr. Lin Chuanjie et al.) Zeng, Lin 098112787 Form No. A0101 The simple same weighting method applied in the film Page 4 of 19 0982〇21335-〇 201039149 检索 Search. They only identify the "white" subtitles in the movie and manually create the vocabulary of the keyword list to increase the weighting. In 2003, foreign scholars proposed a fairly complex automated question answering system for English news videos called VideoQA. They have a huge collection of external resources in their systems, such as the WordNet relational dictionary, the shallow syntactic parser, the Named entity tagger, the electronic news materials on the Internet, artificial rules, And the self-built Ontology ontology system. As in the previous approach, they only use the keyword weighting method to find the most likely answer. Often, the system is ported to another language or domain, and the external resources used primarily are re-prepared, and not every domain or language has as many external resources as English. So this will be a very difficult task for transplanting Chinese. Moreover, the problem of broken (sub)words in Chinese has not yet been resolved. In fact, Chinese word-breaking results often affect the quality of the keywords that are extracted. Once an unknown word or a broken word occurs, the keyword will never be found. The most important point is that the past ο research found that it is not to solve Chinese. In the past five years, the inventor has designed the world's first cross-language (English to Chinese) film subtitle automatic question answering system. The main problem to be solved is the answer to the answer on the proper noun of the specific person's time. Through the system, the user can retrieve the Chinese film in English. The technology behind it is a self-developed paragraph retrieval and answer algorithm. The study uses a machine-automatic translation mechanism to convert all Chinese subtitles into English, followed by a proper noun recognizer in English to indicate possible answers. Obviously, this method is not suitable for Chinese-Chinese query in pure Chinese language, and the development of the paragraph search method is still 098112787 Form No. A0101 Page 5 / 19 pages 0982021335-0 201039149 Based on words, so for In Chinese, the problem of word breaking is still the first element to be overcome. Although these academically published publications are very similar to the objectives of the present invention, there are still significant differences in nature. The most important of these is that the past study used the film as a broadcast news, ignoring the subtitles that appeared on the screen. In addition, although some videos have subtitle files, there is no need to extract this information. However, it is very infeasible for most of the current videos or videos that are broadcast on TV. In particular, the film or news content in Taiwan is mostly based on subtitles to convey the video message. Therefore, the subtitles or texts in the scenes are very consistent with the ideas expressed in the current film. In addition, the literature in the past focuses on how to transfer existing technologies and uses the usual existing document retrieval methods to search for video information required by users. However, although the technology of document retrieval is very mature and widely used in educational research and the industry of origin, its original design method is to search for documents written by humans or articles with few typos. Obviously, the technique of directly using document retrieval can only see how the review rate of these recognized texts in the prior art is applied in this field without any adjustment. Different from the traditional documents in the past, the words recognized in the film (whether it is extracted by subtitles or extracted from the sound) are often different in format from the files that are generally played by human beings; even more, many typos are unavoidable. Because the current optical character recognition is only 90% accurate in a good environment, speech recognition can only reach 70% of the effect when the same person speaks. The direct use of traditional file retrieval methods under such documents is clearly problematic because there are many words that cannot be found or identified. Moreover, for Chinese, the vocabulary of 098112787 Form No. A0101 Page 6/19 pages 0982021335-0 201039149 is also a great challenge. [Invention] [0003] Ο 〇 098112787 The purpose of the present invention is It is to propose a novel robust automatic video content question answering system. With the text information extraction technology proposed by the present invention, the subtitle information in multiple types of movies such as news, advertisements, serials and the like can be identified as 'and the corresponding time is left. Information, and then use the automatic film question and answer technology proposed by the present invention to analyze query questions or keywords, and finally respond to questions or related videos to the user. The object of the present invention is also to provide an automatic video content question answering system 'for the user to query the question by keyword or natural language, to get the answer to the question in the corresponding movie and its playing start time: information, and to view the paragraph Video content.广:... '丨'' 'View' The purpose of the present invention is also to provide an automatic video content question answering system' for users to query questions in a keyword or natural language, and to directly retrieve the results of their movie caption recognition. . The object of the present invention is also to provide an automatic video content question answering system that allows the user to view the video passages related to the query, and the precise playback start time, and can directly fast, rewind, pause the movie. , stop, play, etc. It is also an object of the present invention to provide an automatic video content question answering system ’ for users to query questions in a keyword or natural language manner and to directly retrieve the result content of their movie caption recognition. The object of the present invention is also to provide an automatic video content question answering system 'integrated film text subtitle identification technology (developed by the present invention), search usage behavior and the like analysis 'users can use the common search engine behavior, subtitles or The text is searched and browsed. Form No. A_1 Page 7 of 19 0982021335-0 201039149 [Embodiment] [〇〇〇4] Please refer to the first figure, which is an overall operation diagram and operation flow chart of the present invention. The first figure contains several important core modules of the present invention, and a conceptual diagram of the user input system. The operational flow of the embodiment of the present invention will be explained as follows. From the perspective of the overall system, (100~103) is a module related to the user, (100) is input to the system for the question or keyword queryed by the user, and (101) represents the entire automatic problem system of the movie. Code; (10 2) is the system automatically determines the answer to the user's answer to the video, as for (103) the film library is a prepared source database. First, the implementation of the present invention will be described with respect to the character recognition of the video library. (130) represents the overall system of the film text recognition system. When you want to identify (103) the contents of the video library, you will first extract every 11 images of 1?1^1^ in the movie, and then perform (131) text block detection and cutting on the image, and block all the text blocks. Detect and accurately cut the position of the image where it is located. Each of the blocks is then filtered by the non-text block method of the development of the present invention (132). Since the frame is continuous '(133) the text tracking module is responsible for tracking how many frames each text appears in and combining the same text content that appears (134) the binning of the text block is the color word map The action of extracting the color of the text turns the text color to black and the background to white. And (135) the identification rib will be the binarized word map of the input person, for the recognition of the text 'the result identified is (137) rhyme text file 1 the last stage is the combination of Fr followed by the text as a paragraph, The word used is (ι cutting module. 098112787, the second part is (1) 0) Chinese character word retrieval, which is mainly the Chinese form number A0101 HR "S' / α 1 a 0982021335-0 201039149 Ο string phrase Partial separation 'Because Chinese is not like English, there is a blank as the boundary of the word. When the (111) Chinese word segmentation is a word-breaking method, in the present invention, the fixed (112) 1 word (113) 2 word (114) 3 words are used as the cutting unit, and the dictionary is not required to be prepared. And directly make a three-word break word for the string. In the third part of the present invention, in the step (120), the initial retrieval system uses the deformed ΒΜ-25 as the main weighting division method to achieve a (121) simple retrieval model to extract (122) ΤορΝ paragraphs as candidates for the answer. By. Finally, on the (140) answer comparison module, the present invention first designs a (141) best matching method, lists all possible comparison modes, and then uses the dynamic programming method (142) to match the results. The best one-to-one query string is paired with the article paragraph, and then the matching combination is made (143) and then the string is cut, and finally the finely cut string is matched with the weight allocation of the present invention. The scoring method (144) paragraph weights are divided and detected, and all results are re-ranked according to this weight, and the time information given to the user in addition to the corresponding answer film 'appears in the video, and the video text recognition result are presented. BRIEF DESCRIPTION OF THE DRAWINGS [Fig. 1 is a schematic diagram of a system architecture of the present invention. The second diagram is a schematic diagram of a video text recognition module. The second diagram is a schematic diagram of an answer comparison module. The fourth diagram is a system query of the present invention. The fifth diagram of the interface is the system browsing interface diagram of the present invention. The sixth diagram is the video viewing interface diagram of the present invention. 098112787. The seventh diagram is the remnant diagram of the network interface of the present invention. Form number Α0101, page 9/19 pages 0992021335- 0 201039149 [Main component symbol description] [0006] 1 0 1~1 44 : System step flow 2 01~2 0 6 : Answer comparison algorithm step flow [0007] 100 User query question or keyword 101 Video auto problem System 102 Answer Video 103 Video Library 110 Chinese Words 111 Chinese Words Segment 112 1 Words 113 2 Words 114 3 Words 120 Initial Search System 121 Simple Search Model 122 ΤορΝ Paragraph 130 Video Character Recognition System 131 Text Area Block Detection and Cutting 132 Filtering Non-Text Blocks 133 Text Tracking 134 Text Block Binarization 135 Identification 136 Section Cutting 137 Identification Text 140 Answer Alignment 141 Best Match Method Form No. A0101 Page 10 / Total 19 Page 098112787 0982021335-0 201039149 142 Match Result 143 Re-cut 144 Paragraph Weight Assignment and Detection [0008] 201 ΤορΝParagraph 202 Quick Match String Algorithm 203 Search Best matching position combination algorithm 204 re-cutting algorithm 205 weighting algorithm Ο 206 Calculating the overall paragraph ranking score Ο 098112787 Form number Α 0101 Page 11 / Total 19 page 0982021335-0