TWI288335B

TWI288335B - Method to automatically summarize Chinese digital documents

Info

Publication number: TWI288335B
Application number: TW94105191A
Authority: TW
Inventors: Yuen-Hsien Tseng
Original assignee: Webgenie Information Ltd; Yuen-Hsien Tseng
Priority date: 2005-02-22
Filing date: 2005-02-22
Publication date: 2007-10-11
Also published as: TW200630825A

Abstract

A method is developed to automatically summarize Chinese digital documents. It evaluates the importance of each sentence in the digital document, finds out the similarity between the document title and each of its sentences, and combines the title with the reduced sentence to become a summary candidate with its number of characters below an assigned limit. Finally, according to the ratio of the number of characters and the similarity, it sorts those summary candidates and provides the result for the user to select. This invention simultaneously considers factors including the number of words after summarization, the importance covered by the content, the readability, and the coherence of the meaning. It effectively represents the key points of original document within limited number of words and greatly reduces the human labor burden.

Description

!288335 15843twf.doc/g 九、發明說明：【發明所屬之技術領域】 =„有關於一種文件的摘要方法，且特別是有關於一種中文數位文件的自動摘要方法。【先前技術】 / ，於數位科技的普遍運用，近年來數位文件的產生極生約7_以5位文：而吕’單—個新聞網站每天就可以產新門立一貝斤聞，右再加上每數十分鐘就更新-次的即時二相當龐大，對個人而言很容易造成資訊過用變得不可或I。’以自動化方式處理數位文件的研發與運著近人們—種快速取得訊息的管道，而隨二手機新聞更能提供人們即時便利的二：手機新聞與網路新聞的不同頁除，上，而必須進一步將次要、重覆的内容刪 -般;機形字則^8 ϊζ ^_69個全形字為限，半字内。子输，有的刪手機則限制在45個全形兩二句。圖1 、只" ，幾乎只有習知方法ΐ將寺新聞的範例’其包含標題只有二句。詞、名詞、連接’對每個詞棄標上詞性標記（如動連接岡專），再以自行整理的規則切割長句為 1288335 15843twf.doc/g 小句（Meaningful Unit，MU)。例如：以逗點分隔的「片段」，若起始詞彙為「而且」、「且」等詞，則將此「片段」往前合併；若起始詞彙為動詞，則其主詞應該在前面的「片段」，因此也將此「片段」往前合併。以範例中的第二句為例，按照上述的方法，必須將最後一個片段··「且如願吃到頂級的佐贺牛肉壽喜燒」往前連結。但前一個片段··「還住在一晚高達6萬日幣的飯店裡」本身無法當成一個小句的起始，必須再往前連結。但刖個片^又·「專訪職棒明星王貞治」的「專訪」是一個動詞，必須繼續往前連結到「除了能親眼目睹日本職棒」。依此方法做下去，最後可得『黃子佼為訪問王貞治前往福岡巨蛋欣賞日本職棒比赛』（23個字），『雖然2天行程緊，但佼佼此行可說是「頂級豪華之旅」』個字），以及『除了能親眼目睹日本職棒，專訪職棒明星王貞治，，住在-晚高達6萬日幣的飯店裡，且如願吃到頂級的佐賀牛肉壽喜燒』（53個字）等三個小句。將文件分割成小句後，要組合出預定的長度。由於預定的長度為69字或45字，幾乎只能容納一至二個小句。出文件的大意，使擷取出來的摘要具有晝龍點睛二果’可以選擇標題做為其中—個小句，剩下的字數再谷納其他小句。而剩下的小句應當選擇最長<[不超過總長度的小句，以盡可能符合給定的字數限制。以上例而言，選最後一個小句跟標題按原順序結合，可以得到67字的摘要·仗仗訪王貞治，豪華日本行，除了能親眼目睹日本!288335 15843twf.doc/g IX. Description of the invention: [Technical field to which the invention pertains] = „ There is a summary method for a file, and in particular, an automatic summary method for a Chinese digital file. [Prior Art] / , The widespread use of digital technology, in recent years, the production of digital documents is about 7_ to 5 digits: and Lu's single news website can produce a new door every day, plus right every tens of minutes. As for the update-time instant 2, it is quite large, and it is easy for an individual to cause the information to become useless or I. 'Automatically handle the development and operation of digital files in an automated way--a way to quickly get information, and Second, mobile news can provide people with instant convenience. Secondly, different pages of mobile news and online news must be deleted, and the secondary and repeated content must be deleted. The machine type is ^8 ϊζ ^_69 Full-size characters are limited to half-words. Sub-transmissions, and some deleted mobile phones are limited to 45 full-form two-two sentences. Figure 1, only ", almost only the conventional method of the temple news sample 'its title Only two sentences Words, nouns, and connections' discard each word with a part-of-speech tag (such as a link), and then cut the long sentence into a 1288335 15843twf.doc/g clause (Meaningful Unit, MU). : "fragments" separated by commas. If the starting words are "and" and "and", the "fragments" are merged forward; if the starting vocabulary is a verb, the subject should be preceded by " Fragment, so this "fragment" is also merged forward. Taking the second sentence of the example as an example, according to the above method, the last piece must be linked to the "Saga beef sukiyaki" which is top-notch. But the previous episode, "Residing in a restaurant with a maximum of 60,000 yen a night" can't be used as a starting point for a clause, and must be linked forward. However, the "Interview" of "The Interview with the Professional Stars Wang Zhizhi" is a verb. It is necessary to continue to link to "except for seeing the Japanese professional baseball team". In this way, I can finally get "Yellow Zi to visit Wang Haozhi to visit the Fukuoka Dome to enjoy the Japanese professional baseball game" (23 words), "Although the two-day trip is tight, but this trip can be said to be "top luxury tour" "The word", and "In addition to being able to witness the Japanese professional baseball team, interviewing the professional baseball star Wang Zhizhi, living in a restaurant with a maximum of 60,000 yen in the evening, and wishing to eat the top Saga beef sukiyaki" ( 53 words) three clauses. After splitting the file into clauses, combine the predetermined lengths. Since the predetermined length is 69 words or 45 words, it can only accommodate one or two clauses. The general meaning of the document, so that the abstract extracted from the 昼 has the finishing touches of the two fruits, can choose the title as one of them - the rest of the words and then the other clauses. The remaining clauses should be selected with the longest <[over the total length of the clause to match the given word limit as much as possible. In the above example, the last sentence is combined with the title in the original order, and you can get a summary of 67 words. Suwa Wang Zhizhi, a luxury Japanese trip, except to see Japan in person.

一、正確的斷詞，詞性標記與分割小句並不容易。因為未知詞及難以預料的語法繁多，若沒有事先分析完整，則分割出的小句，難以保證其語意完整。二、所選出的小句並非最佳。例如前例的45字摘要，可以選擇標題與文件最後嗣固片段做成效果更好的45字摘要：『佼佼訪王貞治，豪華日本行，還住在一晚高達6 萬日幣的飯店裡，且如願吃到頂級的佐贺牛肉壽喜燒。』 (因為長度更接近45字，且内容更具互補性）。三、處理長文件困難度高。當文件長度越長而摘要的長度依然固定如此短時，其摘要的困難度越高，不同人做 1288335 15843twf.doc/g 職棒’專訪職棒明星王貞治，還住在—晚高達6萬日幣的飯店裡，且如願吃到頂級的佐賀牛肉壽喜焊。， η 69字，可以說是最佳摘要。若要 1生』4;摘要’可將標題及第二長的小句結合，得到37?的『使貞「= ’豪華日本行’雖然2天行程緊，但佼佼此行可況疋頂級蒙華之旅」』，此句比另一小得到的35字摘要：『佼佼訪王貞治，豪華曰本行為訪問王貞料往福縣蛋欣賞日本職棒比兮要好（因為長度更接近45字，且内容更具互補性）>。上述的作法可歸納如下：先將文件斷詞，做詞性標記，按照某些規則分割成小句；再選擇與標題合併後長^最長但不超過預定長度的小句與標題結合，成為所擷取的摘要。然而，上述的作法會產生幾個問題：的摘要岐異性也越大。 1288335 15843twf.doc/g 【發明内容】本^明的目的就是在提供一種數位文件的自動法，根據文件的標題來找出文件内與標題相似句’將這些相似文句個別與標題組合成不超過指^文候選句，經排序後供選取運用。數的 1發明提出-種巾域位文狀自_要方法的數位文件中包括有多個句子，此自動摘要方法首先ς 該數位文件内各句子的重要性，取最重要的前η個句_ 為=選句，接著在各候選句中，找出恰當的接句點虚 U，以產生摘要候選句，最後再根據摘要候選之: 比例與相似度將摘要候選句排序，並由高到低依:數依照本發明的較佳實施例所述之中文數位文件^ 摘要方法，上述之評估數位文件内各句子的 = 中最長的，或是出現頻率最高= 取大重複子串。接者以各個最大重複字串以及任庫對此數蚊件輯詞處理，並經停_過雜二的重複詞，（即出現-次以上）做為數位文件的關鍵詞最，則計#各睛㈣的重要性，絲據各句子中的詞彙的數1:及重要性來制這些奸，由大則、 ’，取前η個句子做為候選句。 ’ 依照本發明的較佳實關所述之中文數位$件之摘要方法’上述產生摘要候選句的步驟為首先以動查方式比對標題與各個候選句之間的編輯距離，找出ς輯ς 離最小（即與標題最相似）之句子片段以及標點符號位置。 1288335 15843twf.doc/g 接著以標題取代句子片段，並以標點符號位置做為接句點結合各候選句之剩餘片段而得到摘要候選句。若接句後之該摘要候選句的字數超過一個預先給定之限定摘要字數，則縮知:接句的子句，並以不超過此限定摘要字數的最多子句與標題連接；若接句後摘要候選句的字數比此限定摘要子數還少，則嘗試增長接句的子句，並以不超過此限定摘要字數的最多子句與標題連接。First, the correct word segmentation, part of speech mark and segmentation clause is not easy. Because there are many unknown words and unpredictable grammars, if there is no prior analysis, the segmented clauses are difficult to guarantee their semantic integrity. Second, the selected clause is not optimal. For example, in the 45-word abstract of the previous example, you can choose the 45-word abstract that makes the title and the last snippet of the file better. "Su Shi Wang Zhizhi, a luxurious Japanese trip, still live in a restaurant with a maximum of 60,000 yen a night. And if you want to eat the top Saga beef sukiyaki. 』 (because the length is closer to 45 words and the content is more complementary). Third, it is difficult to handle long files. When the length of the file is longer and the length of the abstract is still fixed so short, the difficulty of the abstract is higher. Different people do 1288335 15843twf.doc/g professional baseball's interview with professional baseball star Wang Zhizhi, still living in - night up to 60,000 In the Japanese currency restaurant, I would like to eat the top Saga beef Shouxi welding. , η 69 words, can be said to be the best summary. If you want to be a student, you can combine the title with the second long sentence to get 37? "Make 贞" = 'Luxury Japan trip'. Although the 2 day trip is tight, you can see the top level. "The trip to China", this sentence is a 35-word abstract than the other one: "Wu visit Wang Zhizhi, the luxury sputum behavior visit Wang Yu to Fuxian eggs to appreciate the Japanese professional baseball is better than (because the length is closer to 45 words) And the content is more complementary) >. The above methods can be summarized as follows: first break the file, make a part-of-speech tag, and divide it into clauses according to some rules; then select the longest sentence that is combined with the title and not longer than the predetermined length, and the title is combined with the title. A summary of the fetch. However, the above approach raises several questions: The abstract is also different. 1288335 15843twf.doc/g [Summary of the Invention] The purpose of the present invention is to provide an automatic method for digital files, and to find a sentence similar to a title in the file according to the title of the file 'combining these similar sentences with the title into no more than Refers to the candidate sentences of the text, which are sorted and used for selection. The invention of the number 1 proposes that the number of files in the pattern of the towel domain includes a plurality of sentences. The automatic summary method first identifies the importance of each sentence in the number file, and takes the most important first η sentences. _ is = select sentence, then in each candidate sentence, find the appropriate sentence point virtual U to generate the summary candidate sentence, and finally according to the summary candidate: proportional and similarity will sort the summary candidate sentence, and from high to low According to the Chinese digital file method according to the preferred embodiment of the present invention, the above method evaluates the longest of the sentences in the sentence, or the highest frequency of occurrence = the large repeated substring. The receivers use the maximum repeated string and any library to deal with the number of mosquitoes, and stop the repeated words of the _ more than two, (that is, appear more than - times) as the keyword of the digital file, then # The importance of each eye (four), according to the number of words in each sentence 1: and the importance of these traits, from the big, ', take the first η sentences as a candidate. The method for generating a digest candidate sentence according to the preferred embodiment of the present invention is as follows: firstly, the editing distance between the title and each candidate sentence is compared in a dynamic manner to find out句 The smallest sentence (that is, the most similar to the title) and the punctuation position. 1288335 15843twf.doc/g Then replace the sentence segment with the title, and use the punctuation position as the sentence to combine the remaining segments of each candidate sentence to get the summary candidate sentence. If the number of words of the summary candidate sentence after the sentence exceeds a predetermined limited number of words, the clause of the sentence is concatenated, and the maximum number of clauses not exceeding the number of the limited number of words is connected to the title; After the sentence, the summary candidate sentence has fewer words than the qualified summary sub-number, then try to increase the clause of the clause and connect to the title with the most clauses that do not exceed the limit of the limited number of words.

依照本發明的較佳實施例所述之中文數位文件之自動摘要方法，上述將摘要候選句排序的步驟為首先將各摘要候選句的編輯距離轉換成相似度。若原數位文件的内文超過特定數目時，則將第二個句子錢每—㈣子的相似度都乘以-個特疋係數，作為該句最後的相似度。接著將结合後的字數除以指定的摘要字數，使其轉換成0到i之間的字數比例。最後則以—種預設規則，決定這些摘要候選句最後的排序。依照本發明的較佳實施例所述之中文數位文件之自動摘要方法，上卿設_分為三個步驟。首先比較相似度最向的第—摘要候選句與字數_最高的第二摘要候選句。若第-摘要候選句即是第二摘要候選句，則輸出第一摘要候選句，並從_顧選句集合帽第—摘要候選句刪除。其次為若第一摘要候選句的相似度大於€二摘要候選句的相：度的一個特定倍數’且第-摘要候選句的字數輸出第一摘魏勒，.若否，聰出第二摘要候選 9According to the automatic summary method of the Chinese digital file according to the preferred embodiment of the present invention, the step of sorting the summary candidate sentences is to first convert the edit distance of each summary candidate sentence into a similarity. If the content of the original digit file exceeds a certain number, the similarity of each of the second sentence money is multiplied by a special factor as the last similarity of the sentence. The number of words combined is then divided by the specified number of digest words to convert it to a ratio of words between 0 and i. Finally, the final ordering of these summary candidates is determined by a preset rule. According to the automatic summary method of the Chinese digital file according to the preferred embodiment of the present invention, the upper _ is set to three steps. First, the first-sum summary candidate with the most similarity and the second summary candidate with the highest number of words_ are compared. If the first-sum candidate sentence is the second summary candidate sentence, the first summary candidate sentence is output, and is deleted from the _chouse sentence collection cap-summary candidate sentence. Secondly, if the similarity of the first abstract candidate sentence is greater than a specific multiple of the phase of the dichosum candidate sentence: and the number of words of the first-digest candidate sentence is outputted by the first extract Weiler, if not, the second is the second Summary candidate 9

1288335 15843twf.doc/g ^原摘要㈣㈣合巾將㈣的摘魏科重複前兩個步驟，直到原摘要候選句集右取後則候選句剩餘為止。、，又有任何摘要依照本發明的較佳實施例所述之中文數位文件，要方法，上述決定摘钱選句最後的排相方法包= 如以預設規則以及以機||學習找出最佳的規則。依照本發明的較佳實施例所述之中文數位文件之商要方法’上狀標題轉先給定，若無，騎過上述之，位文件關鍵概之自動擷取方法擷取數位文件的關鍵詞菜，，後依照這些關鍵詞彙的重要性將數個之關鍵詞彙取出，最後並依其在數位文件中出現的順序結合而產生之標題。綜上所述，本發明之中文數位文件之自動摘要方法適用於一般中文數位文件，尤其特別適合手機新聞簡訊的自動，要。其特點在於由其產生的摘要能同時考慮到摘要後的字數’内容函蓋重點，可讀性，以及文意的連貫性等因素’能有效地以指定的字數表達原文件的重點内容，並大幅減輕人力負擔。為讓本發明之上述和其他目的、特徵和優點能更明顯易懂’下文特舉較佳實施例，並配合所附圖式，作詳細說明如下。二【實施方式】圖2是依照本發明較佳實施例所繪示的中文數位文件之自動摘要方法的流程圖。在此實施例中，首先取得一份 1288335 15843twf.doc/g 中文數位文件，並視情況決定是否限定摘要字數（步驟 S210)。舉例來說，若此份中文數位文件的摘要是用來當，手機的新聞簡訊之用，則可視情況將其摘要字數限制^ f見的45字或69字的範圍内。接著判斷此中文數件文件是否有包含一組標題（步驟S220)，若無，則由數位文件中求出標題（步驟S230)。其中，求出標題的方法包括例如透過先前發明之數位文件關鍵特徵之自動擷取方法（中華民國公告第153789專利）擷取該數位文件的關鍵詞彙，然後依照_詞彙的重要性將前3到6個之繼詞棄取出，並依其在該數位文件中出現的順序結合而產生標題。，而，在此僅以特定方法求得關鍵詞彙為例，並不限定求得關鍵詞彙的方法，熟悉此技藝者，當可選擇不同之關鍵詞彙篩選方法用以求得關鍵詞彙。求得標題後則可據以評估此中文數位文件内每個句子的重要性，將每個句子依重要性由高至低排列，取最重要的前η個句子作為候選句（步驟S24〇)。並由候選句中找籲出適當的接句點與標題結合，做成摘要候選句（步驟 S250)。最後則計算摘要候選句之字數比例與相似度，並據以排序摘要候選句，由高到低依序輸出，並提供字數比例與相似度資訊，以方便使用者挑選（步驟S26〇)。圖3是依照本發明較佳實施例所繪示的候^^句產生流程圖。首先透過先前發明之數位文件關鍵特徵之自動擷取方法求出中文數位文件内的最大重複字串（步驟S3i〇)。其中，由於文件的主題詞彙會重複出現，但並非所有的重 1288335 15843twf.doc/g 複字^是有用的關鍵詞，因此，只有最長的，或是出現頻率最高的重複字串才被稱為最大重複字串，把經過停用詞過濾、後的最大重複字串視為該文件的關鍵詞。由於標題在新聞中相當重要，因此以12萬詞的詞庫對標題做斷詞處理’亚經濾除停用詞後，也視為該文件的關鍵詞。如此從文件標題與文件内文兩種來源獲得該文件的關鍵詞（步驟 S320)。接著將以上所求得的關鍵詞彙依下述如公式計 ΐ其f要性’由大到小排序後，取前η個句子作為後續處王之用（步驟S330) ’其中，匕為關鍵詞彙冰在該數位詞頻，而為該文件中出現最多次之關鍵詞莱的棚，η可視為由使用者指定的候選句數1288335 15843twf.doc/g ^Original Abstract (4) (4) The towel will be picked up by Wei (the fourth step) until the original summary candidate set is taken right and the candidate is left. And any abstract according to the preferred embodiment of the present invention, the Chinese digital file, the method, the above-mentioned decision to withdraw the money, the last phase of the method package = as defined by the rules and machine | | learning to find the most Good rules. According to the preferred embodiment of the preferred embodiment of the present invention, the method for selecting the Chinese digital file is given first. If not, the above-mentioned method is used to capture the key of the digital file. The word dish, after which the keywords are extracted according to the importance of these key words, and finally the title is generated according to the order in which they appear in the digital file. In summary, the automatic digest method of the Chinese digital file of the present invention is suitable for general Chinese digital files, and is particularly suitable for the automatic, mobile news newsletter. It is characterized in that the abstract generated by it can take into account the number of words after the summary, the content of the content, the readability, and the coherence of the semantics, etc., which can effectively express the key content of the original document in the specified number of words. And greatly reduce the human burden. The above and other objects, features, and advantages of the present invention will become more apparent <RTIgt; [Embodiment] FIG. 2 is a flow chart of an automatic digest method for a Chinese digital file according to a preferred embodiment of the present invention. In this embodiment, a 1288335 15843 twf.doc/g Chinese digit file is first obtained, and it is determined whether or not to limit the number of digest words (step S210). For example, if the abstract of the Chinese digital file is used for the newsletter of the mobile phone, the summary word count may be limited to 45 words or 69 words within the range. Next, it is judged whether or not the Chinese digital file contains a set of titles (step S220), and if not, the title is obtained from the digital file (step S230). Wherein, the method for determining the title includes, for example, the automatic extraction method of the key features of the digital document of the prior invention (the Republic of China Announcement No. 153789 patent), extracting the keyword of the digital file, and then according to the importance of the vocabulary The six successors are discarded and the title is generated in the order in which they appear in the digital file. However, in this case, only the keyword method is obtained by a specific method, and the method for finding a keyword pool is not limited. Those skilled in the art can select different key word screening methods to obtain keyword sinks. After obtaining the title, the importance of each sentence in the Chinese digital file can be evaluated, and each sentence is ranked from highest to lowest according to importance, and the most important first η sentences are taken as candidate sentences (step S24). . And the candidate sentence is called to select an appropriate sentence to be combined with the title to form a summary candidate sentence (step S250). Finally, the word proportion and similarity of the summary candidate sentence are calculated, and the summary candidate sentences are sorted according to the order of high-to-low output, and the word ratio and similarity information are provided to facilitate user selection (step S26). . FIG. 3 is a flow chart showing the generation of a sentence according to a preferred embodiment of the present invention. First, the maximum repeated string in the Chinese digital file is obtained by the automatic extraction method of the key features of the digital file previously invented (step S3i). Among them, because the subject vocabulary of the file will be repeated, but not all of the weights 1288335 15843twf.doc / g complex word ^ is a useful keyword, therefore, only the longest, or the most frequently occurring repeated string is called The maximum repeating string is the keyword that is filtered by the stop word and the largest repeated string is regarded as the keyword of the file. Since the title is quite important in the news, the title is treated as a word breaker with a 120,000-word lexicon. After the filter is removed from the stoppage, it is also considered a keyword for the document. Thus, the keyword of the file is obtained from both the file title and the document context (step S320). Then, the above-mentioned keywords are summarized according to the following formula: the order of the 'fity' is sorted from large to small, and the first n sentences are taken as the follow-up kings (step S330) 'where 匕 is the keyword sink Ice is in the digit frequency, and is the shed where the most frequent keyword appears in the file, η can be regarded as the number of candidate sentences specified by the user.

電腦摘要完之後’可供使用者選擇的摘要候選句數。以I Ιίί關鍵特徵自動擁取方法及12萬字詞庫僅為應用於 —實施例，在實際運財並不限定此特定方法，及重要性公式，使用者可依其需要，選用不同之万/去求传所需之摘要候選句。 = Σ (〇·5 + 0.5 X tfw/max^tf) 為了要讓擷取出來的摘要，具有内容一致性、連貫性 ’因此跟標題結合的句子’最好跟標題在内容上 ‘二ΐ ’亦即有足夠的相似度。當然相似i最高時，布小二元全相同，並不恰當。而—般新聞編輯下的標題，潔二J件本身的句子完全複製得來，而是更濃縮、更簡 ’自妙片又。其結果是與標題相似的句子，在内容上跟標題 …、、而然具有互補性。除此之外，也要知道要從那個片段After the computer is finished, the number of summary candidates available for the user to select. The automatic feature collection method and the 120,000 word vocabulary with I Ι ίί are only applied to the embodiment. In the actual operation, the specific method and the importance formula are not limited, and the user can select different 10,000 according to their needs. / Go to the summary candidate for the desired sentence. = Σ (〇·5 + 0.5 X tfw/max^tf) In order to get the abstract from the ,, it has content consistency and consistency 'so the sentence combined with the title 'preferably with the title on the content 'two ΐ ' That is, there is enough similarity. Of course, when i is the highest, the cloth binary is the same, which is not appropriate. Under the headline of the general news editor, the sentence of Jiejie J itself is completely copied, but it is more concentrated and simpler. The result is a sentence similar to the title, with the title ..., and the complement of the content. In addition to this, you also need to know from that fragment.

12 1288335 15843twf.doc/g 開始跟標題做結合。因此，不僅要找哪裡最相似。為了同時;;足這似 ===規劃的方式比對標題與句子之間:輯生>4:是::本，圭實施例所緣示的摘要候選句產之切動祕畫方式比_標題與祕候選句 ’找出編輯距離最小(即與該標題最二) :了;符號位置(步驟S，。並以標題取代句子片&，並以標點符號位置做為接句點結合12 1288335 15843twf.doc/g Start working with the title. So not only find the most similar places. For the same time;; this is like === planning method between the title and the sentence: the series of students > 4: Yes:: Ben, the example of the summary of the candidate sentence produced by the example of the secret _ heading and secret candidate sentence 'find the editing distance is the smallest (that is, the second with the title): ; symbol position (step S, and replace the sentence piece & with the title, and use the punctuation position as the point combination

剩餘片段而得到摘要候選句（步驟S420)。然而，W 位文件的摘要是絲當做手機的新義訊之用，則$ 況給予-限定的摘要字數，其常見的例子包括45、字及^ 字。在給定限定摘要字數後則可藉以判斷接句後的摘要候選句的字數是否超過限定摘要字數（步驟S43〇)，若是1 則將接句點往後挪，以縮短接句的子句（步驟S45〇) 1 之則嘗試將接句點往前挪，以增長接句的句子（步驟 S440)。而結合後的句子以不超過該限定摘要字數為原則。圖5是依照本發明較佳實施例所繪示的摘要候選句排序的％IL私圖。為了便於比較不同長度句子的相似度，則以下述公式將该些編輯距離轉換成相似度幻’所（步驟S51〇)，其中exp為自然指數，為編輯距離，w為標題的字數， π為候選句的字數。然而，上述僅以此特定公式為例，使用者可視實際需要使用不同的公式來求得所需的相似度。 . / edit 、 edit — sim = exp(- 13 1288335 15843twf.doc/g 由於新聞的寫法是以金字塔型方式敘述，因此對於細節的描述是愈後面愈詳細，相對的，越前面的文字則愈像摘要。所以，為加強前面句子的相似度，需優先考量前數句。因此，接著判斷是否原數位文件的内文超過三個句子 (步驟S520)。若是，則將第二個句子以後句子的相似度都乘以一特定倍數例如〇·85，作為最後的相似度（步驟 S530)。在相似度決定之後，進而將結合後的字數除以指定的摘要字數，使其轉換成〇到1之間的字數比例（步驟 S540)。最後再以一種預設規則，將所有之摘要候選句排The remaining segments are obtained to obtain a summary candidate sentence (step S420). However, the summary of the W-bit file is used as a new source for mobile phones, and the number of digested words is given to the number of digested words. Common examples include 45, words and ^ words. After the limited number of digest words is given, it can be determined whether the number of words of the digest candidate sentence after the sentence exceeds the limit digest word number (step S43〇), and if it is 1, the sentence point is moved backward to shorten the sub-sentence The sentence (step S45〇) 1 attempts to move the sentence point forward to increase the sentence of the sentence (step S440). The combined sentence is based on the principle that the number of words in the summary is not exceeded. FIG. 5 is a %IL private diagram of a summary candidate sentence sequence according to a preferred embodiment of the present invention. In order to facilitate comparison of similarities of sentences of different lengths, the edit distances are converted into similarity illusions by the following formula (step S51〇), where exp is a natural index, an edit distance, w is the number of words of the title, π The number of words in the candidate sentence. However, the above is only an example of this specific formula, and the user can use different formulas to obtain the desired similarity as needed. . / edit , edit — sim = exp(- 13 1288335 15843twf.doc/g Since the way the news is written is in a pyramidal way, the description of the details is more detailed, relative, and the more the front text, the more Therefore, in order to enhance the similarity of the preceding sentences, the first few sentences need to be considered first. Therefore, it is then determined whether the context of the original digit file exceeds three sentences (step S520). If so, the second sentence is followed by the sentence. The similarity is multiplied by a specific multiple such as 〇85, as the final similarity (step S530). After the similarity is determined, the number of combined words is further divided by the specified number of digest words to be converted into The ratio of the number of words between 1 (step S540). Finally, all the summary candidate sentences are arranged in a preset rule.

序以供使用者選擇（步驟S550)。其中包括由摘要候選句中選取相似度最高之摘要候選句A及字數比例最大之摘要候選句B (步驟S551)。並判斷摘要候選句a是否等於摘要候選句B (步驟S552)。若是，則將摘要候選句A輸出附加於已排序結果之後，並自原摘要候選句集合中刪除（步 B不^3)，之後直接進入步驟S556 ;若否，則接著判斷 ί定体ί候選句A的相似度大於摘要候選句B的相似度的的另且摘要闕句A的字數_大於B的字數比例候選勺αΪ倍數（步驟S554)。舉例來說，判斷是否摘要且ίί二ί目1_摘要候選㈣相似度㈣倍，若是，的字數比例大㈣的字數比_〇.75倍。摘ΐ候=^Γ53’其實施内容如上述；^否，則將，合中刪==著果二’=摘要候要候選句未輪出一)。若:==摘 1288335 15843twf.doc/g 繼續重複進行排序動作；若否，則終止排序程序 S557)，农後由輪出得到摘要候選句的排序結果。上述; 要候選句排序的步驟中，所給予的特定數目及特定倍為本發明之一實施例，並不限定其大小，使用者當可視者際需要予以增減，而達到本發明之功效。貝圖6是依照本發明較佳實施例所繪示的以動態規劃計算編輯距離的範例。在動態規劃裡，所謂的編輯距離是指利用「插入」，「刪除」，以及「代換」的動作，將字串轉成另一個字串「所需最少的步驟」（或是「所需最少的計算成本」）。在本實施例中使用的動態規劃法達如下：假設有兩字串A與B，長度各為11與111。將兩字串從頭比對起，則比對到A的第i個字（以八以]表示）與 B的第j個字（以B[j]表示）的編輯距離為： d[i，j] = min( d[i-l，j] + w(A[i]，0)， d[i_l，j-1] + w(A[i]，B[j])， d[i，j_l]+w(0，B[j])) 其中，min(X，Y，Z)表示取X，Y，Z三個數中最小的值，而其初始值為·· d[0? 0] = 〇 d[i，0] = d[i-l，0] + w(A[i]，0)，l<=i<=n d[0，j] = 〇，l<=j<=m 〜另外，函數w(X，Y)的意義為： w(A[i]，B[j]) ··表示將A[i]代換成B[j]的計算成本 w(A[i]，〇):表示插入A[i]的計算成本 15 1288335 15843twf.doc/g w(0, B[j]):表示刪除B[j]的計算成本乂“ 4」A — acjc ’「候選句」b decadecf為例，假設代換三插入與刪除的計算成本都為丨，則叫丨，』]可以表示成矩陣第i列的第』行’依照上述公式輯算出來的編輯距，結果如® 6。從矩陣最後—觸最後面掃描起，所發現第-個編輯距離最小的地方即是B與A最相似的部份。接句的時候就把B的前面的字代換成A的_，做成摘要零 t上述動態規劃比對完成，從最後—列的後面掃描，找到第一個編輯距離最小的位置後，由於以標題與候選的接句必須接在標點符號上以維持可讀性，因此必須由^ 一個編輯距離最小的位置開始左右掃描最近的標點符號，以找出另一編輯距離最小的標點符號位置，並以此為接句點，將標題與候選句結合，產生摘要候選句。然而做本實施例所使用之動態規劃的方法僅為用以計算編^足’ 之-例，並不限定其中的詳細規格及參數，熟悉:技=離 • 當可視實際情況使用類似之公式及參數，藉以外曾厂距離。 "异此編輯綜上所述，本發明提供之中文數位文件之自法，能夠找出與標題最為相似之句子，與標題钟入要方要，並經排序後提供使用者選擇，由其所篩選的生摘能有效地以指定的字數表達原文件的重點内容，並、摘要輕人力負擔。田我 1288335 15843twf.doc/g —雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明’任何熟習此技藝者，在不脫離本發明之精神和範圍内，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。【圖式簡單說明】圖1繪示為即時新聞的範例。 0 2為根據本發明一較佳實施例所繪示之中文數位文件之自動摘要方法的流程圖。The sequence is for the user to select (step S550). The method includes selecting the summary candidate sentence A with the highest similarity and the summary candidate sentence B having the largest proportion of words from the summary candidate sentence (step S551). And it is judged whether or not the digest candidate sentence a is equal to the digest candidate sentence B (step S552). If yes, the summary candidate sentence A output is appended to the sorted result, and is deleted from the original summary candidate sentence set (step B is not ^3), and then directly proceeds to step S556; if not, then the determined The similarity of the sentence A is greater than the similarity of the abstract candidate sentence B, and the number of words of the summary sentence A is greater than the number of words of the candidate number of the candidate c (Ϊ). For example, it is judged whether or not the abstract is ίί2_1 abstract candidate (four) similarity (four) times, and if so, the proportion of words is large (four) than the number of words _ 〇. 75 times. Excerpts = ^ Γ 53' whose implementation content is as above; ^ No, then, zhongzhong delete == fruit 2' = summary candidate sentence not rounded one). If: == extract 1288335 15843twf.doc/g continue to repeat the sorting action; if not, the sorting program is terminated S557), and the sorting result of the summary candidate sentence is obtained by rotation. In the above steps, in the step of sorting the candidate sentences, the specific number and the specific multiple are given as an embodiment of the present invention, and the size is not limited, and the user needs to increase or decrease the visibility to achieve the effect of the present invention. Figure 6 is an example of calculating the edit distance in a dynamic plan in accordance with a preferred embodiment of the present invention. In dynamic programming, the so-called edit distance refers to the use of "insert", "delete", and "substitution" actions to convert a string into another string "required minimum steps" (or "required" Minimum calculation cost"). The dynamic programming method used in this embodiment is as follows: Suppose there are two strings A and B, each having a length of 11 and 111. The two words are compared from the beginning, and the edit distance of the ith word (indicated by 八) to A and the jth word of B (in B[j]) is: d[i, j] = min( d[il,j] + w(A[i],0), d[i_l,j-1] + w(A[i],B[j]), d[i,j_l] +w(0,B[j])) where min(X,Y,Z) represents the smallest of the three numbers X, Y, and Z, and its initial value is ··d[0? 0] = 〇d[i,0] = d[il,0] + w(A[i],0),l<=i<=nd[0,j] = 〇,l<=j<=m ～ The meaning of the function w(X, Y) is: w(A[i], B[j]) ·· represents the computational cost of substituting A[i] into B[j] (A[i], 〇) : indicates the calculation cost of inserting A[i] 15 1288335 15843twf.doc/gw(0, B[j]): indicates the calculation cost of deleting B[j] 乂 "4" A - acjc '"candidate sentence" b decadecf is For example, suppose that the calculation cost of substitution and deletion is 丨, then 丨, 』] can be expressed as the edited distance calculated by the above formula from the _th row of the i-th column of the matrix, and the result is as о. From the last scan of the matrix - the last scan, the place where the first edit distance is found is the most similar part of B and A. When you accept the sentence, replace the word in front of B with the _ of A, and make the summary zero. The above dynamic programming comparison is completed. After scanning from the last column, find the position with the smallest editing distance, because The header with the title and the candidate must be connected to the punctuation mark to maintain readability, so the nearest punctuation must be scanned left and right by a position with the smallest edit distance to find the punctuation position with the smallest edit distance. Taking this as a result, the title is combined with the candidate sentence to generate a summary candidate sentence. However, the method of dynamic programming used in this embodiment is only for calculating the example of the editing, and does not limit the detailed specifications and parameters therein. Familiarity: skill = away • When using a similar formula and Parameters, borrowed from the factory distance. "In addition to the above, the self-method of the Chinese digital file provided by the present invention can find the sentence most similar to the title, and the title is required to be selected, and the user selection is provided after sorting. The selected raw extracts can effectively express the key content of the original document in a specified number of words, and the abstract is lightly burdened. 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 The scope of protection of the present invention is defined by the scope of the appended claims. [Simple Description of the Drawings] Figure 1 shows an example of instant news. 02 is a flow chart of an automatic digest method for a Chinese digital file according to a preferred embodiment of the present invention.

圖3為根據本發明一較佳實施例所繪示之候選句產生流程圖。圖4為根據本發明一較佳實施例所繪示之摘要候產生流程圖。 ' 圖5為根據本發明-較佳實施例所綠示之摘要候排序的流程圖。、示之以動態規劃圖6為根據本發明一較佳實施例所緣計算編輯距離的範例。【主要元件符號說明】 S210 :給定中文數位文件及限定摘要字數 S220 :判斷此中文數件文件是否有 S230 :由數位文件中求出標題不重要性，取前 S240 :評估此中文數位文件中句子的個句子作為候選句成摘=選Γ選句中找出適當的接句點與標題結合，做 17 1288335 15843twf.doc/g 將摘要候選=序計算摘要候選句之字數比例_ S310 .以關鍵特徵之自動擷取方法求出最大重複字串 S320:斷，處理及停用觸濾旻子串 S330 計算重要性，取前n個句子作為候選句 S410:找出編輯距離最小之句子片段以及標點符號位置 S420 :以標題取代句子片段，以接句點結合候選句 S43〇 ··判斷摘要候選句的字數是否超過限定摘要字數 S440 :將接句點往後挪，以縮短接句的子句 S450 :將接句點往前挪，以增長接句的句子 S510:計算相似度 S520 :判斷是否原數位文件超過三個句子 S530 ·將第—個句子以後句子的相似度乘以〇 % S540:計算字數比例 · 5550 :以一種預設規則，將所有之摘要候選句排序FIG. 3 is a flow chart showing the generation of candidate sentences according to a preferred embodiment of the present invention. FIG. 4 is a flow chart showing a summary generation according to a preferred embodiment of the present invention. Figure 5 is a flow diagram of a summary ordering of greens in accordance with the preferred embodiment of the present invention. Shown by Dynamic Programming FIG. 6 is an example of calculating an edit distance according to a preferred embodiment of the present invention. [Main component symbol description] S210: Given Chinese digital file and limited summary word count S220: Determine whether there are S230 in this Chinese several file: Find the title non-importance from the digital file, take the previous S240: Evaluate this Chinese digital file The sentences in the middle sentence are selected as candidate sentences = select the selected sentence to find the appropriate sentence to combine with the title, do 17 1288335 15843twf.doc / g will summarize the candidate = order to calculate the word count ratio of the summary sentence _ S310. The automatic feature extraction method of the key feature finds the maximum repeated string S320: breaks, processes and disables the touch filter substring S330 to calculate the importance, takes the first n sentences as the candidate sentence S410: finds the sentence segment with the smallest edit distance and Punctuation mark position S420: replacing the sentence segment with the title, and determining whether the number of words of the summary candidate sentence exceeds the limit digest word number S440 by combining the candidate sentence S43〇··: moving the sentence point backward to shorten the clause of the sentence S450: Move the sentence forward to increase the sentence of the sentence S510: Calculate the similarity S520: Determine whether the original number file exceeds three sentences S530 · The sentence after the first sentence Multiplied by the similarity billion% S540: calculate the ratio of the number of words · 5550: in a default rule, all of the digest candidate ranker

5551 ·選取相似度最高之摘要候選句a及字數比例最大之摘要候選句B 5552 ·判斷是否摘要候選句a等於摘要候選句b 5553 :將摘要候選句A輸出，並自原摘要候選句集合中刪除 i S554 ·判斷是否摘要候選句a的相似度大於摘要候選句B的相似度的1.25倍，且摘要候選句a的字數比例大於B的字數比例的0.75倍 1288335 15843twf.doc/g 5555 ··將摘要候選句B輸出，並自原摘要候選句集合中刪除 5556 :是否剩餘摘要候選句 5557 :結束排序5551 - Selecting the summary candidate sentence a with the highest similarity and the summary candidate sentence B 5552 having the largest proportion of words - determining whether the summary candidate sentence a is equal to the summary candidate sentence b 5553: outputting the summary candidate sentence A, and extracting from the original summary candidate sentence Delete i S554 · Determine whether the similarity of the summary candidate sentence a is greater than 1.25 times the similarity of the summary candidate sentence B, and the ratio of the number of words of the summary candidate sentence a is greater than 0.75 times the ratio of the number of words of B. 1288335 15843twf.doc/g 5555 · Output the summary candidate sentence B, and delete 5556 from the original summary candidate sentence set: whether the remaining summary candidate sentence 5557: end sorting

Claims

1288335

X. Application for patents: The automatic summarization method of the scoop cup 1^pin digital file, the digital file card includes a sentence, the automatic digest method includes ·· fiti the longest number of documents, or a speech The highest number of maximum repeated strings; the maximum word of the _ and any existing vocabulary to perform word segmentation on the digital file, and after the stop word filtering, to obtain the repeated vocabulary (that is, more than one occurrence) As the multiple key words of the digital file; the importance of counting these key words; according to the number of the riding money and the corresponding importance of the good towels, sorting (four), sorted by big to small After that, the first n sentences are taken as a plurality of candidate sentences; in the candidate sentences, an appropriate sentence is found and combined with a standard to generate a plurality of summary candidate sentences; and, according to the summary candidate sentences The word size ratio and similarity rank the summary candidate sentences and output them in order from high to low. ~ 2· As in the automatic digest method of the Chinese digital file described in the first paragraph of the patent application, the steps of generating the digest candidate sentences include: comparing the title with the candidate sentences by a sorrow pattern Edit the distance, find the sentence segment with the smallest edit distance and a punctuation mark position; and 20 !288335 fr replace the w segment with the title, and mix the period with the period to combine the remaining segments of the candidate sentences. Abstract candidate summary == The number of words of the abstract candidate sentence after the automatic sentence of the Chinese digital file mentioned in item 2 exceeds the limit of one sentence, and the clause of the sentence is shortened, so as not to exceed the limited summary. The word count clause is connected to the title, and the number of words of the summary candidate sentence is smaller than the limited summary word count, and the clause of the sentence is increased, and the number of words of the sentence is not exceeded. The most clauses are connected to the title. 4. The automatic digest method of the Chinese digital file as described in claim 1, wherein the step of sorting the digest candidate sentences comprises: converting the edit distance into a similarity; if the original digital file is in the text When the sentence exceeds a certain number, the similarity of each sentence after the number of the special number is multiplied by a specific system $' as the final similarity of the sentences; the number of combined words is divided by the specified abstract The number of words is converted into a ratio of the number of words between 〇 and 1; and a final rule is used to determine the final ordering of the summary candidates. 21 1288335 5. The Chinese digital file I self-digest method according to item 4 of the patent application scope, wherein the preset rule comprises: comparing the highest similarity-first-sum summary candidate sentence with the highest number of words-second The abstract waiter, if the first-thinking phrase is the second abstract candidate sentence, 'mixes the first-summary candidate sentence, and deletes the first-sum summary candidate sentence from the selected sentence; - the similarity degree of the abstract candidate sentence is greater than the - first specific multiple of the similarity of the second abstract candidate sentence, and the ratio of the number of words of the first abstract candidate sentence is greater than the second abstract ship (four) word number comparator - second special a number, then outputting the first-summary candidate sentence, if not, outputting the second summary candidate sentence, and extracting the output from the summary candidate sentences; and 4 repeating the step of 'decoding until there is no domain waiting Le Remaining until 6. Please refer to the automatic summary method for the digital file of the towel mentioned in item 4, which is the final method of the four-dimensional summary of the ordering method, including the preset rule and machine learning to find out the most Good rules. 7. If you apply for a patent range! The automatic summarization method of the Chinese digital file described in the item is given in advance. If not, the job takes the multi-side key words in the digital file, and the New Zealand depends on the importance of the vocabulary. The removal of the 'most money' combines in the order in which they appear in the mosquitoes to produce the desired title. twenty two