TW201039149A

TW201039149A - Robust algorithms for video text information extraction and question-answer retrieval

Info

Publication number: TW201039149A
Application number: TW98112787A
Authority: TW
Inventors: Yu-Chieh Wu
Original assignee: Yu-Chieh Wu
Priority date: 2009-04-17
Filing date: 2009-04-17
Publication date: 2010-11-01

Abstract

This invention proposes a robust algorithm for video closed caption extraction and retrieval. the algorithm can automatically detect text information, localize text areas, segment text fragments, track text in multiple frames, and recognize the text. by means of setting reasonable video frame size, it is able to identify the text which is in reasonable size and is also above certain threshold of height and width. also the first appearing time stamp of each recognized text will be record for further retrieving. on the basis of video text extraction algorithm and the stored time stamp information, a fixed size passage segmentation (grouping) method is applied to group neighboring sentences. each g consecutive sentences will be merged in order to form a passage. then, a robust retrieval algorithm is designed to search the recognized video ocr transcripts. regardless of the chinese word segmentation, this algorithm tends to give higher weight to the passage that contains dense and longer n-gram matching patterns.

Description

201039149 六、發明說明：【發明所屬之技術領威】 [0001] 本發明係關於/種強健性自動影片文字資訊萃取與問答檢索方法，用以對影片之中萃取其文字資訊，提供輸入介面’讓使用者能以自然語言方式查詢問題，得到問題答案在對應的影片與其播放起始時間資訊，並得以觀看該段落之影片内容。【先前技術】 [0002] 本發明係關於一種強健性自動影片文字資訊萃取與問答檢索方法’實際上此間題奪涉到幾個領域知識如：（1)多媒體處理：影像處理（Image processing)，影像文字定位、擷取（Text L0caiizatiori and Extraction) ，光學字元辨識（Optical character Recognition) ，（2)文件訊息處理：資訊檢索（Inf〇rmati〇n Re_ trieval )中英文件處理（Chinese/Engl ish Text201039149 VI. Description of the Invention: [Technical Leadership of the Invention] [0001] The present invention relates to a robust and automatic film text information extraction and question-and-answer retrieval method for extracting text information from a movie and providing an input interface. Allows the user to query the question in a natural language, get the answer to the question in the corresponding movie and its playback start time information, and be able to view the video content of the paragraph. [Previous Technology] [0002] The present invention relates to a robust automatic film text information extraction and question-and-answer retrieval method. In fact, this topic involves several areas of knowledge such as: (1) multimedia processing: image processing (Image processing), Text L0caiizatiori and Extraction, Optical Character Recognition, (2) Document Message Processing: Information Retrieval (Inf〇rmati〇n Re_ trieval) Chinese and English Document Processing (Chinese/Engl ish Text

Processing)等項目。就整體架構設計精神而論，是過去國内外技術所能作到的技術能力皆具一定水平，但尚未正&為系統!·生架構而且也未針對自然語言輸入之問題解決，大多是以關鍵字檢索為主之技術W本發明之作者之過去發表著作外），但對於此項整合性發明與其 ^ ^自動問答次算法之設計是未曾見過。以下是以國際觀點來談論過去的相關技術與研究報告。 f年來對於元片之中的内容檢索的相關學術研究已陸續提出例如使用影片先學文字識別（V ideoOCR)、語音識別技術，以支持影片上之文字檢索。早期，國内林川傑博士等人）曾、林 098112787 表單編號A0101 簡早的同彙加權方法應用在影片第4頁/共19頁 0982〇21335-〇 201039149 ❹ 檢索上。他們只對影片中之”白色”字幕進行辨識，並且手動建立關鍵字列表的詞彙以提高加權值。在2003年，國外學者提出了一個相當複雜且只針對英文新聞影片的自動問答系統，名為VideoQA。在他們的系統中集合了非常龐大外部資源，如WordNet關連詞典、淺層句法解析器（shallow syntactic parser)，專有名詞辨識器 (Named entity tagger)、網際網路上之電子新聞資料、人為規則、與自建的Ontology本體系統等。如同前文作法，他們僅是以關鍵詞權重方法來找出最有可能出現之答案。通常，將該系統移植到另一種語言或領域，這些主要所使用之外部資源都得重新準備，而且不是每個領域或語系都有像英文那麼多可用之外部資源。所以對移植中文而言這將是一項非常難鉅的任務。而且中文仍有斷（分）詞問題尚未解決。實際上中文斷詞結果往往會影響所擷取出之關鍵字品質，一旦未知詞或斷錯詞發生時，關鍵字將永遠找不到。最重要的一點是，過去的 ο 研究發現，是不去解決中文過去五年内，本發明人曾設計了全世界第一個跨語言（英文到中文）影片字幕自動問答系統。其主要解決之問題為特定人事時地物的專有名詞上之答案回覆，透過該系統，使用者之能以英文問題去檢索出中文影片。而其背後所使用之技術為自行發展之段落檢索與答案對應演算法。該研究是以一個機器自動翻譯之機制，將所有中文字幕轉換為英文，接著再以英文之專有名詞辨識器，去指出可能答案。明顯地，這方法並不適用於純中文語系下的中文對中文查詢，而且其發展之段落檢索法仍是 098112787 表單編號A0101 第5頁/共19頁 0982021335-0 201039149 以詞為主，所以對於中文而言，斷詞問題仍是須克服之第一要素。雖然這些已在學術上發表過的刊物與本發明之目標十分類似，但本質上仍有著極大差異。其中最重要的幾個地方是，過去的研究是將影片視為廣播新聞，忽略其在畫面上出現之字幕訊息。另外雖然有些影片是有字幕檔，不須額外抽取這些資訊，不過對現在絕大部分的影片或在電視上放送之Video而言，是十分不可行。尤其是在台灣地區的影片或新聞内容，大多是以字幕來傳達影片訊息，所以抽取這些影片字幕或場景中之文字，是非常符合現在影片所要表達的意念。除此之外，過去的文獻所著眼的是如何移轉現有之技術，並且採用一般現有的文件檢索方式來搜尋使用者所須之影片資訊。然而，文件檢索之技術雖然十分成熟，且廣為應用在教育研究、與產經業界，其原先之設計方式就是針對人為所打出的文件或是幾乎沒有錯字的文章進行檢索。很明顯地，直接使用文件檢索之技術只能看出一般現有技術對這些辨識出之文字之檢閱率如何，單純地不作任何調整就應用在此領域之中。有別於以往傳統文件，影片之中所辨識出的文字（不論是由字幕抽取或聲音中抽離），往往格式與一般人類打好的文件不同；更甚者，許多錯字是無法避免的，因為現在的光學字元辨識在良好環境下僅達90%準確度，而語音辨識則只能在相同人說話時到達70%之效果。在這樣的文件下直接使用傳統文件檢索之方法顯然是會有問題的，因為有許多的詞是找不到或辨識錯誤。而且對中文來說，由文章之中 098112787 表單編號A0101 第6頁/共19頁 0982021335-0 201039149 分出詞彙也是一個極大的挑戰》【發明内容】 [0003] Ο 〇 098112787 本發明之目的乃是在於提出一個新穎的強健性自動影片内容問答系統，以本發明所提出的文字資訊萃取技術，將新聞、廣告、連續劇等多類型影片中的字幕資訊辨識出’並留下其相對應出現所在時間資訊，再以本發明所提出的自動影片問答技術，解析查詢問題或關鍵字，最後將能回答問題或相關之影片回應給使用者。本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題，得到問題答案在對應的影片與其播放起始時間:資訊，並得以觀看該段落之影片内容。广：… '丨 '' '觀屬本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題，並能直接檢索其影片字幕辨識之結果内容。本發明之目的也在於能提出一個自動影片内容問答系統 ’能讓使用者觀看其與查詢相關之影片段落，與精確的播放起始時間，並能直接對該影片作快轉、倒帶、暫停、停止、播放等操作動作。本發明之目的也在於能提出一個自動影片内容問答系統 ’以供使用者以關鍵字或自然語言方式查詢問題，並能直接檢索其影片字幕辨識之結果内容。本發明之目的也在於能提出一個自動影片内容問答系統 ’整合影片文字字幕辨識技術（本發明所發展）、檢索使用行為等分析’使用者能以常見使用搜尋引擎的行為方式，對影片字幕或文字進行檢索瀏覽。表單編號A_1 第7頁/共19頁 0982021335-0 201039149 【實施方式】 [〇〇〇4]請參考第一圖，該圖為本發明之整體運作圖以及操作流程圖。第一圖包含本發明中幾個重要核心模組、與使用者輸入系統輪出概念圖。本發明實施例之操作流程將說明如下。以整體系統觀點而言，（100~103)是與使用者相關之模組，（100)為使用者所查詢問題或關鍵字輸入至系統，而（101)代表的是整個影片自動問題系統之代碼； (10 2)是系統自動判別出所要回應給使用者的答案影片，至於（103)影片庫則是事先準備好的影片來源資料庫。首先先就影片庫之文字辨識來說明本發明之實施方法。（130)代表的是影片文字辨識系統之整體系統。當要辨識（103)影片庫内容時，會先將影片中每11張1?1^1^的影像抽出，再對該影像進行（131)文字區塊偵測與切割，將所有文字區塊偵測並精確切割出其所在影像位置。接著將每一個區塊都經由本發明所發展孓（132)過濾非文字區塊方法’將不是文字的區塊再行一次過滤。由於影格是連續的’（133)文字追蹤模組負責追蹤每—文字出現在多少張Frames ’並將這些出現的同一文字内容予以合併（134)文子區塊二值化則是對這彩色字圖作抽取文字顏色之動作，將文字顏色轉為黑色，背景轉為白色。而（135)辨識肋將所輸人之二值化後的字圖，作文字之辨認’所識別出之結果為（137)韻文字檔1最後一階段則是合併Fr繼之文字為段落，所採用的是（ι 切割模組。 098112787 第二部分則是⑴0)中文字詞擷取，其主要是將中文表單編號 A0101 H R "S' / α 1 a 0982021335-0 201039149 Ο 字串的詞組部分分離’因為中文不像英文，有空白作為詞之邊界。當令（111)中文詞切分是一個斷詞方法，再本發明之中，是以固定的（112) 1字詞（113) 2字詞（114) 3字詞為切割單位，不須準備字典，而直接對字串作這三種方式的斷詞。本發明於第三部分是在這步驟中（120)初始檢索系統，以變形化ΒΜ-25為主要權重配分方式，達成一個 (121)簡易檢索模型以取出（122) ΤορΝ個段落，作為答案候選者。最後在（140)答案比對模組上’本發明先設計出—個 (141)最佳匹配法，列出所有可能之比對模式，接著再〇以動態規劃法（142)匹配結果，找出最佳一對一的查詢字串與文章段落配對組合，接下來對此匹配組合，作（143) 再切割字串，最後則是將所細切出之字串，配合本發明之權重配置計分法（144)段落權重配分與檢出，將所有結果依此權重重新排名，並呈現給予使用者除了相對應之答案影片 '出現於影片内之時間資訊、以及影片文字辨識結果。 [0005] 【圖式簡單說明】第一圖為本發明之系統架構示意圖第二圖為影片文字辨識模組之示意圖第二圖為答案比對模組之示意圖第四圖為本發明之系統查詢介面圖第五圖為本發明之系統瀏覽介面圖第六圖為本發明之影片觀看介面圖 098112787 第七圖為本發明網路介面之檢余圖表單編號Α0101 第9頁/共19頁 0982021335-0 201039149 【主要元件符號說明】 [0006] 1 0 1〜1 44 :系統步驟流程 2 01〜2 0 6 :答案比對演算法步驟流程 [0007] 100 使用者查詢問題或關鍵字 101 影片自動問題系統 102 答案影片 103 影片庫 110 中文字詞擷取 111 中文詞切分 112 1字詞 113 2字詞 114 3字詞 120 初始檢索系統 121 簡易檢索模型 122 ΤορΝ個段落 130 影片文字辨識系統 131 文字區塊偵測與切割 132 過濾非文字區塊 133 文字追蹤 134 文字區塊二值化 135 辨識 136 段落切割 137 辨識文字 140 答案比對 141 最佳匹配法表單編號A0101 第10頁/共19頁 098112787 0982021335-0 201039149 142匹配結果 143再切割 144段落權重配分與檢出 [0008] 201 ΤορΝ個段落 202快速匹配字串演算法 203尋找最佳匹配位置組合演算法 204重切割演算法 205權重配置演算法 Ο 206計算整體段落排名分數 Ο 098112787 表單編號Α0101 第11頁/共19頁 0982021335-0Processing) and other projects. As far as the overall architectural design spirit is concerned, in the past, the technical capabilities that can be achieved by domestic and foreign technologies have a certain level, but they have not been positively developed for the system and have not been solved for natural language input. The keyword-based technology is the author of the author of the present invention, but the design of this integrated invention and its automatic question-and-answer algorithm has never been seen. The following is an international perspective on past related technologies and research reports. In the past years, relevant academic research on content retrieval in meta-films has been proposed, for example, using video vocabulary recognition (V ideoOCR) and speech recognition technology to support text retrieval on video. Early, domestic Dr. Lin Chuanjie et al.) Zeng, Lin 098112787 Form No. A0101 The simple same weighting method applied in the film Page 4 of 19 0982〇21335-〇 201039149 检索 Search. They only identify the "white" subtitles in the movie and manually create the vocabulary of the keyword list to increase the weighting. In 2003, foreign scholars proposed a fairly complex automated question answering system for English news videos called VideoQA. They have a huge collection of external resources in their systems, such as the WordNet relational dictionary, the shallow syntactic parser, the Named entity tagger, the electronic news materials on the Internet, artificial rules, And the self-built Ontology ontology system. As in the previous approach, they only use the keyword weighting method to find the most likely answer. Often, the system is ported to another language or domain, and the external resources used primarily are re-prepared, and not every domain or language has as many external resources as English. So this will be a very difficult task for transplanting Chinese. Moreover, the problem of broken (sub)words in Chinese has not yet been resolved. In fact, Chinese word-breaking results often affect the quality of the keywords that are extracted. Once an unknown word or a broken word occurs, the keyword will never be found. The most important point is that the past ο research found that it is not to solve Chinese. In the past five years, the inventor has designed the world's first cross-language (English to Chinese) film subtitle automatic question answering system. The main problem to be solved is the answer to the answer on the proper noun of the specific person's time. Through the system, the user can retrieve the Chinese film in English. The technology behind it is a self-developed paragraph retrieval and answer algorithm. The study uses a machine-automatic translation mechanism to convert all Chinese subtitles into English, followed by a proper noun recognizer in English to indicate possible answers. Obviously, this method is not suitable for Chinese-Chinese query in pure Chinese language, and the development of the paragraph search method is still 098112787 Form No. A0101 Page 5 / 19 pages 0982021335-0 201039149 Based on words, so for In Chinese, the problem of word breaking is still the first element to be overcome. Although these academically published publications are very similar to the objectives of the present invention, there are still significant differences in nature. The most important of these is that the past study used the film as a broadcast news, ignoring the subtitles that appeared on the screen. In addition, although some videos have subtitle files, there is no need to extract this information. However, it is very infeasible for most of the current videos or videos that are broadcast on TV. In particular, the film or news content in Taiwan is mostly based on subtitles to convey the video message. Therefore, the subtitles or texts in the scenes are very consistent with the ideas expressed in the current film. In addition, the literature in the past focuses on how to transfer existing technologies and uses the usual existing document retrieval methods to search for video information required by users. However, although the technology of document retrieval is very mature and widely used in educational research and the industry of origin, its original design method is to search for documents written by humans or articles with few typos. Obviously, the technique of directly using document retrieval can only see how the review rate of these recognized texts in the prior art is applied in this field without any adjustment. Different from the traditional documents in the past, the words recognized in the film (whether it is extracted by subtitles or extracted from the sound) are often different in format from the files that are generally played by human beings; even more, many typos are unavoidable. Because the current optical character recognition is only 90% accurate in a good environment, speech recognition can only reach 70% of the effect when the same person speaks. The direct use of traditional file retrieval methods under such documents is clearly problematic because there are many words that cannot be found or identified. Moreover, for Chinese, the vocabulary of 098112787 Form No. A0101 Page 6/19 pages 0982021335-0 201039149 is also a great challenge. [Invention] [0003] Ο 〇 098112787 The purpose of the present invention is It is to propose a novel robust automatic video content question answering system. With the text information extraction technology proposed by the present invention, the subtitle information in multiple types of movies such as news, advertisements, serials and the like can be identified as 'and the corresponding time is left. Information, and then use the automatic film question and answer technology proposed by the present invention to analyze query questions or keywords, and finally respond to questions or related videos to the user. The object of the present invention is also to provide an automatic video content question answering system 'for the user to query the question by keyword or natural language, to get the answer to the question in the corresponding movie and its playing start time: information, and to view the paragraph Video content.广:... '丨'' 'View' The purpose of the present invention is also to provide an automatic video content question answering system' for users to query questions in a keyword or natural language, and to directly retrieve the results of their movie caption recognition. . The object of the present invention is also to provide an automatic video content question answering system that allows the user to view the video passages related to the query, and the precise playback start time, and can directly fast, rewind, pause the movie. , stop, play, etc. It is also an object of the present invention to provide an automatic video content question answering system ’ for users to query questions in a keyword or natural language manner and to directly retrieve the result content of their movie caption recognition. The object of the present invention is also to provide an automatic video content question answering system 'integrated film text subtitle identification technology (developed by the present invention), search usage behavior and the like analysis 'users can use the common search engine behavior, subtitles or The text is searched and browsed. Form No. A_1 Page 7 of 19 0982021335-0 201039149 [Embodiment] [〇〇〇4] Please refer to the first figure, which is an overall operation diagram and operation flow chart of the present invention. The first figure contains several important core modules of the present invention, and a conceptual diagram of the user input system. The operational flow of the embodiment of the present invention will be explained as follows. From the perspective of the overall system, (100~103) is a module related to the user, (100) is input to the system for the question or keyword queryed by the user, and (101) represents the entire automatic problem system of the movie. Code; (10 2) is the system automatically determines the answer to the user's answer to the video, as for (103) the film library is a prepared source database. First, the implementation of the present invention will be described with respect to the character recognition of the video library. (130) represents the overall system of the film text recognition system. When you want to identify (103) the contents of the video library, you will first extract every 11 images of 1?1^1^ in the movie, and then perform (131) text block detection and cutting on the image, and block all the text blocks. Detect and accurately cut the position of the image where it is located. Each of the blocks is then filtered by the non-text block method of the development of the present invention (132). Since the frame is continuous '(133) the text tracking module is responsible for tracking how many frames each text appears in and combining the same text content that appears (134) the binning of the text block is the color word map The action of extracting the color of the text turns the text color to black and the background to white. And (135) the identification rib will be the binarized word map of the input person, for the recognition of the text 'the result identified is (137) rhyme text file 1 the last stage is the combination of Fr followed by the text as a paragraph, The word used is (ι cutting module. 098112787, the second part is (1) 0) Chinese character word retrieval, which is mainly the Chinese form number A0101 HR "S' / α 1 a 0982021335-0 201039149 Ο string phrase Partial separation 'Because Chinese is not like English, there is a blank as the boundary of the word. When the (111) Chinese word segmentation is a word-breaking method, in the present invention, the fixed (112) 1 word (113) 2 word (114) 3 words are used as the cutting unit, and the dictionary is not required to be prepared. And directly make a three-word break word for the string. In the third part of the present invention, in the step (120), the initial retrieval system uses the deformed ΒΜ-25 as the main weighting division method to achieve a (121) simple retrieval model to extract (122) ΤορΝ paragraphs as candidates for the answer. By. Finally, on the (140) answer comparison module, the present invention first designs a (141) best matching method, lists all possible comparison modes, and then uses the dynamic programming method (142) to match the results. The best one-to-one query string is paired with the article paragraph, and then the matching combination is made (143) and then the string is cut, and finally the finely cut string is matched with the weight allocation of the present invention. The scoring method (144) paragraph weights are divided and detected, and all results are re-ranked according to this weight, and the time information given to the user in addition to the corresponding answer film 'appears in the video, and the video text recognition result are presented. BRIEF DESCRIPTION OF THE DRAWINGS [Fig. 1 is a schematic diagram of a system architecture of the present invention. The second diagram is a schematic diagram of a video text recognition module. The second diagram is a schematic diagram of an answer comparison module. The fourth diagram is a system query of the present invention. The fifth diagram of the interface is the system browsing interface diagram of the present invention. The sixth diagram is the video viewing interface diagram of the present invention. 098112787. The seventh diagram is the remnant diagram of the network interface of the present invention. Form number Α0101, page 9/19 pages 0992021335- 0 201039149 [Main component symbol description] [0006] 1 0 1~1 44 : System step flow 2 01~2 0 6 : Answer comparison algorithm step flow [0007] 100 User query question or keyword 101 Video auto problem System 102 Answer Video 103 Video Library 110 Chinese Words 111 Chinese Words Segment 112 1 Words 113 2 Words 114 3 Words 120 Initial Search System 121 Simple Search Model 122 ΤορΝ Paragraph 130 Video Character Recognition System 131 Text Area Block Detection and Cutting 132 Filtering Non-Text Blocks 133 Text Tracking 134 Text Block Binarization 135 Identification 136 Section Cutting 137 Identification Text 140 Answer Alignment 141 Best Match Method Form No. A0101 Page 10 / Total 19 Page 098112787 0982021335-0 201039149 142 Match Result 143 Re-cut 144 Paragraph Weight Assignment and Detection [0008] 201 ΤορΝParagraph 202 Quick Match String Algorithm 203 Search Best matching position combination algorithm 204 re-cutting algorithm 205 weighting algorithm Ο 206 Calculating the overall paragraph ranking score Ο 098112787 Form number Α 0101 Page 11 / Total 19 page 0982021335-0

Claims

201039149 VII. Patent application scope: 1. An automatic film text information extraction and retrieval system, which is a process of extracting the text appearing in the film and searching it, including: 2. The film text recognition module, including: a text block detection Measure and cut: detect and cut the range of the text block into a unit. Filter the non-text block: For the unit that is cut but not the text block, make another discriminant and filter a text track: the same text The blocks are successively combined on multiple images and tracked. Binding of a text block: color segmentation of the text block, so that the text color and the background color are completely different. A character recognition: the final character recognition of the binarized text area is performed by a segment cutting module: each All recognized words in the frame are treated as one sentence, each K sentence is a unit, the previous L sentence is repeated, multiple sentences are gathered into the segment, and the time of occurrence is recorded. 3. The Chinese character word capture module, including: one Chinese Word segmentation module: will query questions or strings, according to three units of 1 word, 2 words, 3 words, etc. 4. An initial paragraph retrieval module, including: a simple retrieval model: establish a multi-word library, It includes a 1-word index library, a 2-word index library, a 3-word index library, etc., and the input question and the recognized character string are decomposed into 1 word to match the N candidate paragraphs. 5. The answer comparison module , including: an optimal string matching algorithm: according to matching N candidate paragraphs, find the best multiple words and the most dense words of each 098112787 form number A0101 page 12 / 19 pages 0982021335-0 201039149 candidate paragraphs Match result 〇 repeatedly cutting the module: according to the matching result, and then according to the 1 word index library, 2 word index library, 3 word index library, the word string is further cut into a paragraph weight distribution and detection module: the matching paragraph According to the corresponding 1 word, 2 word, 3 word weight, give the final weight value of the whole paragraph, and rank the paragraph according to the score, and search. 6. 2. An automatic film text information extraction and retrieval system as described in item 1 of the patent application, wherein the text block detection and cutting is performed by using the overall Otsu method after the extracted image is detected by the Canny edge. After filtering the non-obvious edge points, the local non-obvious edge points are filtered by the local (20xl6) Entropy method; then the first time the text is cut by the horizontal projection method, the vertical projection method allows the vertical text areas of 7 discrete pixels. Block cutting; then repeat the above steps once to cut out all possible text blocks. 3. An automatic film text information extraction and retrieval system as described in claim 1, wherein the non-text block is filtered, and the non-text block is removed by the following rule. Height < 9 pixels, the level is zero < 9 horizontal peaks < 7 times to obtain all text blocks. 4. An automatic film text information extraction and retrieval system according to item 1 of the patent application scope, wherein the text tracking is performed by using a vertical edge point projection amount as a vector, tolerating each of the left and right lattice errors, and performing tracking ratios of multiple frames. Yes, if the two vector Euclidean distance is less than a certain threshold, it is regarded as the same text, and the record is saved; and the merge system will have all the pixels in the text block in multiple sheets 098112787 Form No. A0101 Page 13 of 19 Page 0982021335-0 201039149 frames When the probability of occurrence must be higher than a certain percentage, it is retained or treated as a background and deleted. 5. An automatic film text information extraction and retrieval system as described in claim 1, wherein the text block is binarized and is a two-stage process: The first stage first finds the overall threshold by the following formula: Threshold = 0.75 *0tsu+0.25^Entropy and use this threshold to binarize the text block. The second stage is to extend a certain proportion from the outer corner of the text block, such as 5pixels' and connect the skin from the outside to the inside. The connected pixels are removed. 6. If you apply for a patent scope! The automatic film text information extraction and retrieval system described in the item, wherein the character recognition is a multi-stage method, including text cutting, size normalization, characteristic spike, word group _, word recognition, and identification result correction. As shown in Figure 4. 7. An automatic film text information extraction and retrieval system as described in item i of the patent application scope, wherein the paragraph cutting is a frame-based sentence and records the appearance time of the sentence, and then merges successively {{sentences, and The former sentence is the last sentence of the previous paragraph. 8. An automatic film text information extraction and retrieval system as described in item i of the patent application scope, wherein the Chinese character word retrieval is based on multiple N-grams, and according to the work words, Three units, such as 2 words and 3 words, cut the input string. In addition, if English and numbers are included, they are not taken in this way, but are cut in units of words. 098112787 9. An automatic film text information extraction and retrieval system as described in Patent Application No. β, wherein the simple retrieval model is based on a word, and then the 司ορΝ is first searched by the weight of βΜ 25 Related paragraphs. Form No. 1010101 Page 14/19 pages 0992021335-0 201039149 10. An automatic film text information extraction and retrieval system as described in claim 1 of the patent application, wherein the best string matching algorithm is based on matching N Write candidate paragraphs to find the best multi-word and most dense word matching results for each candidate paragraph. The matching method is based on dynamic programming to find a best one-to-one pairing group instead of the traditional computing word frequency. 11. An automatic film text information extraction and retrieval system according to item 1 of the patent application scope, wherein the repeated cutting module is based on the matching result, and further according to the 1 word index database, the 2 word index database, and the 3 word words. The index library will slash the paragraph and then cut its string into the result of a maximum of 3 words. 〇12. An automatic film text information extraction and retrieval system as described in item 1 of the patent application scope, wherein the paragraph weight distribution and the detection module, the matched paragraphs, according to the corresponding 1 word, 2 The weight of words and 3 words is given to the final weight value of the whole paragraph, and the paragraphs are ranked according to the score and retrieved. The weights are calculated as follows: Passage _ Score(P) = max{AX QW _ Density(0, P,) + (I - λ) X QW _ Weight(0, P,), 〇λ X QW _ Densily(C? , P2) + (I - λ) X QW _ Weight(0, P2)} 0982021335-0 098112787 Form No. A0101 Page 15 of 19