TW420774B

TW420774B - Method and apparatus for automatically correcting documents in chinese language

Info

Publication number: TW420774B
Application number: TW86119972A
Authority: TW
Inventors: Jiun-Jiu Guo
Original assignee: Matsushita Electric Ind Co Ltd
Priority date: 1997-03-28
Filing date: 1997-12-30
Publication date: 2001-02-01
Also published as: CN1195142A; JPH10269204A

Abstract

There are provided automatic correction method and apparatus for automatically detecting and correcting wrong characters and/or missed characters in documents written in chinese language. Character to reading conversion section 200 converts an input original document to strings of phonetic symbols. Candidate word detection section 300 divides the strings into syllables and detects possible candidate words and related information using syllables as search keys. Similar candidate word detection section 400 detects possible candidate word and related information using the string of phonetic symbols as searching key wherein similar bits have been masked with mask means. Most optimal candidate letter string determination section 500 links individual candidate words to form an oriented net by using start and end positions of each candidate word corresponding to the original document as search keys and takes out a most optimal route by the dynamic design method with use of a maximum integral value of weight for frequency of use plus weight for word length plus weight for similarity of original document plus weight for similarity of meaning. Matching section 600 performs matching between the letter string of the above optimal route and the original document to detect and mark different letters.

Description

經濟部中央標準局貝工消費合作社印製 42C774 A， _________ B"____ 五、發明説明（1 ) 〔產業上之利用領域〕本發明係關於一種利用電腦自動偵測並修正中文文書中之錯別字或掉字之中文文書自動校正方法及其裝置。〔以往的技術〕自古以來中國人書寫文文書時，造成錯別字的原因大致可以分為如下所示： (―)同音異義或同音異形字例如’【幼苗長得像筆】的【得】，一般人很容易誤寫為“的” ·，另外，【清激極了】的【極】一不小心也會被誤寫為“急”。 (二）筆畫錯誤例如，【帽子】的【帽】，一不小心“冒”很容易被誤寫為昌等或“目”字被寫成“曰”。另外，如筆畫複雜的字“龜”，“鬱”也常常被少寫許多筆晝。 (三）字形相近例如，【宰相】的【宰】，一般人如果不小心报各易將寶蓋頭上的一點疏忽了或將“辛”字誤寫為“幸”。另外，【吵閩】的【鬧】，“鬥，，也常被誤寫為“門”或【貓】的部首也常常被寫為 “犬”部等。 (四）掉字在書寫文書時，由於書寫速度太怏或不小心常造成掉字，例如，【辛辛苦苦】被誤寫為【辛苦苦本紙張尺度通用中國國家標準（CNS > A4規格（2‘297公$ ) (請先閱讀背面之注意事項再填寫本頁) 訂 I .4 經濟部中央標準局員工消費合作社卬製 420774 A 7 _____________________8，五、發明説明（2 ) ~ 】等。 (五）別字當用某字卻誤用他字。例如【家庭】常常被誤寫為【家廷】或【亭亭玉立】常常被誤寫為【婷辞玉立】等。近十幾年來，由於電子計算機的進步和普及，中文輸入法也隨著百家爭鳴。令文輪人法依其編瑪方式大體上可分為：一般鍵盤和專用輸入裝置一般鍵#類有如下所示 •(一）讀音記號（二）字根（三）形音合用（四）字碼（五）部首或筆晝數等；專用輸入裝置類則有如下所示：（一）專用大鍵盤（二）光學文字辨認裝置等。雖然中文輸入法解除了中國人寫漢字之勞累’但是利字電腦輸入的中文文書樓案中，雖可避免—些以往常犯的錯誤，例如，筆畫的錯誤；其他的錯誤則仍然無法避免。一般而言，造成中文文書檔案中錯別字的原因大體上則可分為如下所示： (一）無法正確輸入讀音記號和字根組合中文文干的讀音記號一般可區分為聲母、韻母、介音和聲調等四個部份，如下例所示. 聲音··勹（b)，女（P)，门（m)，· ·. 介音：一（I) ’ 乂（u) ’ u(yu，iu) 韻母：丫⑷’乙（ο) ’亡（e)，..... 聲調：―(一聲）’ / (二聲），M三聲），、（四聲），‘（輕聲）本紙張尺度適用中國國家標準（CNS ) A4規格（210X29?公釐） J-----------裝------訂------線 {請先閱讀背面之注意事項再填寫本頁) 420774 A , B7_ 五、發明説明.（3 ) ' 〜例如：【形】的讀音記號為【丁 —厶/】【字】的讀音記號為【卩、】其中’中國人谷易混淆的讀音對有如下所示：聲母部份：【尸】和[二】或[ < 】和【丁】等介音部份：【一】和【u】韻母部份··【乙】和【―】或【弓】和【太】等聲調部份：很容易分不清，尤其對外國人尤甚。例如’【興趣】（丁一二、< u、）很容易被誤讀為（丁一厶、< —、），所以常常被輪入成“性器，【學生】（TU^/尸厶—J常常和【寫生】 (丁一甘V尸二被相互錯用。另外，字形輸入時，由於輸入字根組合相近或組合錯誤皆有可能造成錯誤輸入。例如【曰】和【曰】或【觉】和【愛】等的組合字根碼極為類似〇 (二）同音異義字經濟部中央標隼局員工消費合作社印製同音異義字或詞的選擇錯誤所造成，例如【同音異義字】常被錯誤輸入為【同音意義字】或【同音異議字】等。 (三）參照字詞典的錯誤不論使用何種輸入法’都必須使用參照字典來執行轉換’所以一旦參照字典的内容有錯誤自然而然輸入結果也會錯誤。例如，參照字典中【形影不離】被登錄為【行影不離】的話，只要輸入上本紙張尺度適用中國國家標準（CNS ) A4現格（210X297公f ) 經濟部中央樣準局貝工消費合作社印^ 420774 Α7 _ B? 五、發明説明（4) 述片語的讀音字串，轉換結果一定是錯的。 (四）輸入操作錯誤一般製作文書，都會利用各種的文書編輯軟體，但是在執行【插入】或【刪除】等編輯功能時，常常由於不小心常常會造成編輯中文書多一個字 (贅字）或少一個字（掉字）。由於中文文書檔案中隨著錯別字的多募，嚴重影響文書的品質，所以如何能有效偵測中文文書檔案中的錯誤並自動正確訂正已成為一個重要課題。習知之中文錯別字自動訂正方法及裝置，例如令華民國專利公報公告第260772號所述，其實施例之系統方塊圖如第Π圖所示，100是輸入欲處理之中文文書的輸入裝置 ;110是儲存欲處理之中文文書之中文文件檔；12〇是參照綜合近似字集將輸入之中文文書中各文字替換成相近文字，供組合成複數個候選字串之綜合近似字形代換裝置；130 是儲存有中文文字之字形、字音、字義或輸人碼相近文字之綜合近似字集，例如，如下所示（s ••字形近似，p :字音近似，Μ :字義近似，！：輪入碼近似）Printed by the Central Standards Bureau of the Ministry of Economic Affairs, Shellfish Consumer Cooperative, 42C774 A, _________ B " ____ V. Invention Description (1) [Industrial Use Field] The present invention relates to a computer that automatically detects and corrects typos in Chinese documents or Method and device for automatically correcting Chinese documents with dropped characters. [Past technology] When ancient Chinese wrote literary documents, the causes of typos can be roughly divided into the following: (―) Homophonic or homonymous characters such as' [young seedlings look like pens] [Get], ordinary people It's easy to mistakenly write it as "·". In addition, [Extreme] of [Extremely Excited] can be mistakenly written as "Urgent" by accident. (2) Stroke error For example, in [Cap] of [Hat], accidentally "taking" can easily be mistakenly written as Chang et al. Or the word "目" is written as "Yue". In addition, complex characters such as “turtle” and “yu” are often written with fewer strokes. (3) The fonts are similar. For example, [Jai] in [Prime Minister], if people accidentally report that Gui Yi has neglected a point on Bao hijab or mistakenly wrote “Xin” as “Xing”. In addition, the "noisy" and "fighting" of "Noisy Fujian" are often mistakenly written as "Gate" or "Cat". The radicals are often written as "Dog", etc. When writing speed is too slow or careless, words are often lost, for example, [Tough Hard] is mistakenly written as [Tough Hard Paper Size Common Chinese National Standard (CNS > A4 Specification (2'297 G $)) ( Please read the notes on the back before filling in this page) Order I.4 Employee Cooperative Cooperative System of the Central Standards Bureau of the Ministry of Economic Affairs 420774 A 7 _____________________8, V. Description of Invention (2) ~】 etc. The word is misused. For example, [family] is often mistakenly written as [jiating] or [tingtingyuli] is often mistaken as [tingciyuli], etc. In recent ten years, due to the advancement and popularity of electronic computers, Chinese The input method also follows the controversy of hundreds of people. According to their editing methods, the Wenlun method can be roughly divided into: general keyboards and special input devices. The general key # types are as follows: (a) pronunciation marks (b) roots (c) ) Shape and sound combination (four) code (five) The first or the number of days, etc .; special input devices are as follows: (a) dedicated large keyboard (b) optical text recognition devices, etc. Although the Chinese input method relieves the Chinese from the tediousness of writing Chinese characters, but the computer is good for word input In the Chinese document building case, although some common mistakes can be avoided, such as stroke errors; other errors cannot be avoided. Generally speaking, the causes of typos in Chinese document files can be roughly divided into It is as follows: (1) The pronunciation marks and radical combinations of Chinese characters cannot be input correctly. The pronunciation marks of Chinese characters can be generally divided into four parts, including initials, finals, intermesons, and tones, as shown in the following example. Sound ·· 勹（ b), female (P), door (m), · ·. Preposition: one (I) '乂 (u)' u (yu, iu) vowel: aya ⑷ '乙 (ο)' 死 (e), ..... Tone: ― (one beep) '/ (two beeps), M three beeps), (four beeps),' (light beeps) This paper size applies the Chinese National Standard (CNS) A4 specification (210X29? Mm) ) J ----------- install ------ order ------ line {Please read the notes on the back before filling (This page) 420774 A, B7_ V. Description of the invention. (3) '~ For example: [Shape] is pronounced as [丁 — 厶 /] [Word] is pronounced as [卩,] where' Chinese Valley is easy to be confused. The pronunciation pairs are as follows: Initial part: [corpse] and [二] or [<] and [丁] and other intermediary parts: [一] and [u] Final consonant part ... [B] and [―] Or [Bow] and [Tai] and other tonal parts: It is easy to be indistinguishable, especially for foreigners. For example, '[Interest] (Ding Yi Er, < u,) is easily misinterpreted as (Ding Yizhang, < —,), so they are often turned into "sexual organs, [Student] (TU ^ / Corpse-J-often with [Sketch]] (Ding Yigan V, Zhe Er, misused each other. In addition, when inputting glyphs, the input radical combination is similar or the combination is wrong may cause incorrect input. For example, [Yue] and [Yue] or [Jiu] and [Love] are very similar to the root code of the word combination. (2) Homonyms Wrong choice of homonyms or words printed by the staff consumer cooperative of the Central Standardization Bureau of the Ministry of Economic Affairs As a result, for example, [homonyms] are often incorrectly entered as [homonyms] or [homonyms]. (3) Errors in the reference dictionary No matter what input method is used, a reference dictionary must be used to perform the conversion. Therefore, if there is an error in the contents of the reference dictionary, the input result will also be wrong. For example, if [Shade Shadow] in the reference dictionary is registered as [Shade Shadow], as long as the above paper size is entered, the Chinese National Standard (CNS) A4 is displayed (210X297 male f). ^ 420774 Α7 _ B? 5. Description of the invention (4) The pronunciation of the narration phrase must have a wrong conversion result. (4) Incorrect input operation Generally, all kinds of document editing software are used to make documents. However, when performing editing functions such as [Insert] or [Delete], often one word (redundant) is often edited due to careless editing. One word or less. With the increase of typos in Chinese document files, which seriously affects the quality of the documents, how to effectively detect errors in Chinese document files and automatically correct them has become an important issue. A conventional method and device for automatically correcting Chinese typos, such as described in the Republic of China Patent Gazette Bulletin No. 260772. The system block diagram of the embodiment is shown in Figure Π, 100 is an input device for inputting a Chinese document to be processed; 110 Is a Chinese document file for storing Chinese documents to be processed; 12 is a comprehensive approximate glyph replacement device that replaces each character in the input Chinese document with similar characters with reference to the comprehensive approximate character set for combining into a plurality of candidate strings; 130 It is a comprehensive set of approximate characters that stores Chinese characters with glyphs, phonetics, meanings, or characters similar to the input code, for example, as shown below (s •• Font approximation, p: Fonal approximation, M: Finality approximation ,! approximate)

人 :入S 力厲Ρ 勵Ρ 刀S，刃s 己已S 已S 乙S 干甘Ρ，乾Ρ，千S 弋戈S 冶-· 治S 衣纸故尺度適用中國國家標準（CNS ) Λ4規格（2丨Οχ 297公着） (請先閱讀背面之注意事項再填寫本頁)People: Enter S, force P, P, S, S, S, S, S, B, S, P, S, P, S, S, S, S, S, S, S, S, S, S, S, S, and S Specifications (2 丨〇χ 297) (Please read the precautions on the back before filling in this page)

經濟部中央標率局員工消f合作社印製 420774Printed by Cooperative of Staff of Central Standards Bureau of Ministry of Economic Affairs 420774

AT 〜-----B7__ 五、發明説明（5 ) 140是對各候選字串給予評分，並找出評分最高的候選予串之s吾s模型評分裝置；150是包含（a)語言模型統計 ―貝料庫，其中記錄有各語言單位的出現頻率及語言單位之間的接續出現頻率，以及並可包含一記錄各詞詞頻之中文知識庫，（b)評分裝置，根據一字串中所含之語言單位及語言模型統計資料庫，評定該字举的分數且對非原文件檔之文字予以扣分之評分裝置；160是以動態規劃方式搜尋最南評分候選字串之最高評分候選字串搜尋裝置；17〇是用以逐字比對該評分最高之候選字串與該文件樣中之文字 ’並標示其中相異的文字為錯別字之錯別字判斷裝置；丨80 是將標示完畢之文字字串輸出到標示後文件檔中之標示结果輪出；190是儲存標示後文字字串之標示後文件檔。習知例之中文錯別字偵測訂正方法的處理動作為輸入裝置100由中文文件檔U〇中輸入一欲處理之中文文書並且根據標點符號的位置分成複數個處理單位後，進入綜合近似字形代換裝置120，接著以各處理單位參照综合近似字集130,取出其所有字形、字音、字義或輪入碼相近之近似字並組合成複數個候選字_ ’然後進入語言模型評分裝置140，在評分裝置〖5〇中利用一統計式語言模型對各候選字串給予評分’其中之語言模型評分對非原文檔之文字評分予以扣分後’在最高評分候選字串搜尋裝置丨60中利用動態規劃方式搜尋最高評分候選字串；接著進入錯別字判斷裝置170’將具有最高評分的候選字串與輸入之原文件樓逐字比對’並標示出其中相異的文字為錯別字後，進入衣紙張尺度適财酬家標準（CNS) A；!規格（2丨Qx 297公着） -9- ----------I.裝------訂------^ ** (請先閱讀背面之注意事項再填寫本頁) 經濟部中央標準局負工消費合作社印裝 420774 * A 7 ---—______B7 五、發明説明（6 ) -——· 標示結果輸出18〇中，將標示結果輸出到標示後文件槽⑽ t。〔本發明所要解決的問題點〕上述習知技術存在有如下所示的問題點： (1) 只要综合近似字集令沒有收錄的字皆無法正確镇錯和訂正，因此必須花費大量的人力和物力來建立和維護此知識庫。 (2) 在語言模型評分裝置令，只考慮各候選單位出現頻率和候選單位間的接續出現頻率而沒有利用語意資訊，偵出率和訂正率無法進一步的提高。 (3) 對於中文文書中之缺字' 贅字和前後字互掉的問題，無法有效解決。〔解決問題點所用的手段〕為了解決上述的問題點，申請項第一項之發明乃係一種利用電腦技術之中文文書自動校正方法，其特徵在於包括下列步驟： v羯1在電腦記憶體中建構儲存有序表讀音、對應字序表之讀音記號表和對應讀音記號表之候選詞和其讀音記號的破音字典；在記憶體中建構儲存有文字記號與其初始設定讀音、其他可能讀音記號的字音辭典；在記憶體中建構儲存有讀音記號與其對應所有之同音異義字詞和其相關使用頻率加權和語意碼的音字詞典；本紙張尺度適用中國國家標準（CNs .) A4規格（2iOX297公楚） (請先閣讀背面之注意事項再填寫本百cAT ~ ----- B7__ V. Description of the Invention (5) 140 is a s model model scoring device that scores each candidate string and finds the highest rated candidate string; 150 is a language model that contains (a) Statistics-the shell database, which records the frequency of occurrence of each language unit and the frequency of successive occurrences between language units, and can include a Chinese knowledge base that records the frequency of each word, (b) a scoring device, based on a string Contained language unit and language model statistics database, a scoring device that evaluates the score of the word lift and deducts points for non-original documents; 160 is the highest score candidate for searching the southernmost score candidate string by dynamic programming String search device; 17 is a typo judging device that compares the highest-scoring candidate string with the text in the document word by word and indicates that the different text is a typo; 80 is the marked The text string is output to the marked result in the marked file. 190 is the marked file that stores the marked text string. The processing action of the conventional Chinese typo detection and correction method is that the input device 100 inputs a Chinese document to be processed from the Chinese file U0 and divides it into a plurality of processing units according to the position of the punctuation marks, and then enters a comprehensive approximate glyph substitution. The device 120 then refers to the comprehensive approximate word set 130 with each processing unit, takes out all approximate words with similar glyphs, phonetic meanings, word meanings, or round-in codes, and combines them into a plurality of candidate words_ ', then enters the language model scoring device 140, A statistical language model is used to give a score to each candidate string in the device "50", where the language model score deducts the text score of the non-original document, and the dynamic string is used in the highest-scoring candidate string search device. 60 Search for the highest-scoring candidate string; then enter the typo judging device 170 'to compare the candidate string with the highest score with the input original document building word-by-word' and mark the different characters as typos, and then enter the scale of the paper Appropriate Financial Compensation Standard (CNS) A;! Specifications (2 丨 Qx 297) -9- ---------- I. Equipment -------- Order ----- -^ ** (Please read the notes on the back before filling out this page) Printed by the Central Standards Bureau of the Ministry of Economic Affairs and Consumer Cooperatives 420774 * A 7 -------______ B7 V. Description of Invention (6) -—— · Marking Results In the output 18, the marked result is output to the marked file slot ⑽ t. [Problems to be Solved by the Invention] The above-mentioned conventional techniques have the following problems: (1) As long as the comprehensive approximate word set is used, the words that are not included cannot be correctly corrected and corrected, so it must take a lot of manpower and Material resources to build and maintain this knowledge base. (2) In the language model scoring device order, only the frequency of occurrence of each candidate unit and the frequency of successive occurrences between candidate units are considered without using semantic information, and the detection rate and correction rate cannot be further improved. (3) The problem of missing words and redundant words in Chinese documents cannot be effectively solved. [Methods used to solve problem points] In order to solve the above problem points, the first invention of the application is an automatic correction method for Chinese documents using computer technology, which is characterized by including the following steps: v 羯 1 in computer memory Construct a broken dictionary that stores ordered list pronunciations, phonetic notation tables corresponding to word order tables, candidate words corresponding to phonetic notation tables, and their pronunciation marks; constructs a memory that stores text marks and their initial settings, and other possible pronunciation marks Phonetic dictionaries of pronunciation; Construct a phonetic dictionary that stores pronunciation marks and their corresponding all homonyms and their related frequency weights and semantic codes in memory; This paper scale applies Chinese National Standards (CNs.) A4 specifications (2iOX297 public Chu) (Please read the notes on the back before filling in this hundred c

經濟部中央標隼局員工消費合作社印製 420774 五、發明説明（7 ) ~~~~~-~ 步驟2㈣上述之破音字典和字音辭典將由輸入裝置輸入之原始文書字串轉換成為讀音記號串列的字轉音轉換乎段；步雜3舰用字轉音轉換手段所得到的讀音記號串列執行音節切出後，以各音節為檢索要素參照音字字 ”將可π之候選子㈣及其相關資訊檢出的候選字詞選擇手段；步驟4 U對連續早字候選音節使用遮蔽手段遮蔽其類似位元後之項音s己號串列為檢索要素，參照音字詞典將可能之候選字詞及其相關資訊檢出的類似候選字詞選擇手段；步驟5以各候選字詞對應原始文書字串的開始和結束的文字位置為索引要素將各候選字詞連結成一有向網路並使用計算手段計算各候選字詞的類似度加權和詞長加權後，使用頻率加權、詞長加權、原始文書類似度加權之總和為評價函數，以動態規劃方式找出最佳評價分數之路徑的最佳候選字串決定手段；步赖6將最佳路徑上之文字串列和原始文書字串使用比對手段，將相異字檢出並予以標記之參照比對手段者。再者，申請專利範圍第2項之發明乃係一種利用電腦技術之中文文書自動校正裝置，其特徵為其包括：儲存有字序表讀應、對應字序表之讀音記號表和對應本纸張尺度適用中國國家標準（CNS ) A4规格（210X 297公釐） -11 - -----------裝------訂----^---4 (請先閲讀背面之注意事項再填寫本頁) 420774Printed by the Employees' Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 420774 V. Description of the Invention (7) ~~~~~-~ Step 2 The above-mentioned broken pronunciation dictionary and phonetic dictionary will be converted from the original document string input by the input device into a pronunciation symbol string The word-to-phonetic conversion of the column is almost a paragraph; after the syllables are extracted from the string of pronunciation marks obtained by the word-to-phonetic conversion method in Step 3, the syllables are referenced by the syllables as the search elements. Candidate word selection means detected by its related information; Step 4: U uses concealing means to mask the similar digits of consecutive early word candidate syllables as a search element, and refers to the phonetic word dictionary for possible candidate words. Words and their related information detected similar candidate word selection means; Step 5 uses the candidate text corresponding to the beginning and end of the original document string as the index element to link each candidate word into a directed network and use After calculating the similarity weighting and word length weighting of each candidate word, the sum of frequency weighting, word length weighting, and similarity weighting of the original document is used as the evaluation function. Method to determine the best candidate string for the path with the best evaluation score; Step 6 uses the text string on the best path to compare with the original document string to detect the different words and mark them. In addition, the invention in item 2 of the scope of patent application is an automatic correction device for Chinese documents using computer technology, which is characterized in that: the word order table reading should be stored, and the corresponding word order table should be stored. The phonetic notation table and the corresponding paper size are applicable to China National Standard (CNS) A4 specifications (210X 297 mm) -11------------------------ ^ --- 4 (Please read the notes on the back before filling this page) 420774

AT —__ ___ 五、發明説明（8 ) 讀音兄號表之候選詞和其讀音記號的破音字典；儲存有文字記號與其初始設定讀音、其他可能讀音記號的字音辭典；儲存有讀音記號與其對應所有之同音異義字詞和其相關使用頻率加權和語意碼的音字詞典；參照上述之破音字典和字音辭典將由輸入裝置輪入之原始文書字串轉換成為讀音記號串列的字轉音轉換部；對使用字轉音轉換手段所得到的讀音記號事列執行音即切出後’以各音節為檢索要素參照音字字典將可能之候選字詞及其相關資訊檢出的候選字詞選擇部；以對連續單字候選音節使用遮蔽手段遮蔽其類似位元後之讀音記號串列為檢索要素，參照音字詞典將可能之候選字詞及其相關資訊檢出的類似候選字詞選擇部；以各候選字詞對應原始文書字串的開始和結束的文字位置為索引要素將各候選字詞連结成一有向網路並使用計算手段計算各候選字詞的類似度加權和詞長加權後，使用頻率加權、詞長加權、原始文書類似度加權之總和為評價函數，以動態規劃方式找出最佳評價分數之路徑的最佳候選字串決定部；將最佳路徑上之文字串列和原始文書字串使用比對手段，將相異字檢出並予以標記之參照比對部者。再者，申請專利範圍第3項之發明乃係根據第2項之中文文書自動校正裝置’其特徵為：在記憶體中另建構記憶有經由學習手段學習起來的後本紙張尺奴用 ( CNS ) Λ( 21 O X 297^¾ 厂--- 420774 經濟部中央標傘局員工消費合作社印製 B7 五、發明説明（9 ) ' 詞語意碼及其前詞語意碼組合之語意對學習字典；上述之最佳候選字串決定部以各候選字詞對應原始文書字串的開始和結束的文字位置為索引要素將各候選字詞連結成一有向網路並使用計算手段計算各候選字詞的類似度加權和詞長加權後，參照語意對學習字典使用頻率加權、詞長加權、原始文書類似度加權和語意類似度加權之總和為評價函數，以動態規劃方式找出最佳評價分數之路徑者。〔作用〕依據上述本發明之文書自動校正裝置之構成，字轉音轉換手段由輸入部輸入原始文書字_並且參照破音字典和予音辭典將原始文書字串轉換成讀音記號串列後’使用候選字詞選擇手段參照音字辭典，將所有可能之候選字詞檢出，接著使用類似候選字詞選擇手段對於連續單字音節候選以類似音遮蔽手段並且參照字音辭典將類似候選字詞檢出後，使用最佳候選字串決定手段參照語意對學習字典找出最佳路徑字_，然後使用參照比對手段將原始文書字串中錯別字予以標記並將其結果輪出。〔實施例〕上述中，語意即是詞素本身的意義（或稱語意碼），本實施例使用日本角川書店所出版的類語字典(1985年)中的适意分類方法來表示。其分類方法是以十六進位之四位數字碼來表示一個詞素的分類碑’最左位數表示大分類，第二位數表示中分類，第三位數表示小分類，最右的位數則 Ί* L----------裳------訂------ (請先閱讀背面之注意事項再填寫本頁) 420774 Λ7 B7 五、發明説明（10) 表示細分類。該類語字典將所有詞素分成自然、性狀'變動、仃動、心情、人物、性向、社會、學藝、物品等十大類，每-大㈣分成十個巾分類。本實施财減等四位數碼之前冠以s ’如下例所示· sO s02 s028 s028a (屬於自然類） (屬於自然之氣象類） (屬於氣象的風類） (屬於風之強弱類）對於此種階層式的分類碼，例如，第13圖所示，越上層的意義碼含蓋的意義範圍越廣，相反地越下層的所含蓋的心義範圍越窄。因此，我們可以配合實際上的需求來應用意義碼’不必-一列舉而可以節省記憶空間。另外，於此n碼疋以數字來表示’所以可以使用數學上的運算方式例如’集合邏輯運算等，來處理意義碼而使得意義碼月b衍生出更多有仏值的資訊。關於意義碼的詳細說明，參’、’、中華民國專利公報公告第161238號“機器翻譯裝置” 之詳細說明書。經濟部十央標隼局員工消費合作杜印製再者，中文的讀音共有1300種的讀音，故只要使用兩個位mP可表不所有的讀音，其中聲母（子音）共有η個 ;ι曰3個s員母（母音）有14個和5種聲調。因此，二位元的中文讀音構造如第！圖所示，位元W包含有聲母（位元2 ◦矛；丨(位兀〇〜丨），位元組2則包含聲調（位元4〜6)和 4母（位το0〜3)。因此M列如只要遮蔽位元組1的介音區洋使用X集運算手段，g卩可找^具有相同聲母、韻母和聲本紙铁尺度適用中國國家榡準（CNS } -14- 420774 經涛部中央標準局ua；工消費合作杜印製 A7 B* _______五、發明説明（11 ) ~—"一~~ 調的文字》為了可以利用遮蔽手段處理各區之類似音，每區十各 '组類似音内之要素予以分配距離為1的位元式樣，如第2圖之例所示。關於上述中文讀音壓縮碼和類似位元配置的詳細說明，參照中華民國專利公報公告第089477號“漢字變換裝置 (一）之詳細說明書。另外，為了處理由於編輯錯誤所造成的贅字、缺字或前後字互掉的問題，本實施例之遮蔽手段除上述位元遮蔽外，也可以整個文字遮蔽，例如，“六一厶/ 、”不論以“六一厶/ 、为—、，·或it jig 厶/ \ (*’表示遮蔽之文字，亦即任何文字皆可）為檢索鍵，參照音字詞典皆可檢出“亭亭玉立” 。丁再者，參照尹華民國專利公報公告第089476號“漢字變換裝置（二）’’得知，將讀音記號事列轉換為文字事列時 ’詞長是-個重要料價判，所財實施财也將詞長加權作為評價函數的一項，其計算式為如下所示。例如，候選字詞為“大家”時，其詞長加權為（2-1)*2=2。詞長加權=(候選字詞的字數-1)*2 再者，為了利用原始文書令之文字資訊，以其有效找出最佳路徑’本實施例將原文書類似度加權作為評償函數的一項’其計算式為如下所示。原始文書類似度加權=(保准空π占二広， ^ (候選子6令和原始文書對應字詞的比對相同字數）/候選字詞字數本紙張尺度適财關家料（CNS ) Α4祕(17〇^297^7 —r 裳-- {請先閱讀背面之注意事項再填寫本頁) 訂AT —__ ___ V. Description of the invention (8) A dictionary of candidates for the pronunciation sibling list and its pronunciation marks; a pronunciation dictionary containing text marks and their initial settings, and other possible pronunciation marks; A phonetic dictionary of all homonyms and their related frequency-weighted and semantic code; refer to the above-mentioned broken dictionary and phonetic dictionary to convert the original document string rotated by the input device into a phonetic transcription conversion unit ; After the pronunciation is performed using the word-to-phone conversion method, the pronunciation is executed, and the candidate word selection unit that detects possible candidate words and related information by referring to the syllable dictionary with reference to each syllable as a retrieval element; A sequence of pronunciation marks after concealing similar bits by using a masking method for consecutive single-character candidate syllables is used as a search element, and a similar candidate word selection unit that detects possible candidate words and related information by referring to the phonetic word dictionary; The word position corresponding to the start and end of the original document string is used as an index element to link each candidate word into one Directed network uses calculation methods to calculate the similarity weighting and word length weighting of each candidate word, and uses the sum of frequency weighting, word length weighting, and similarity weighting of the original document as the evaluation function to find the best by dynamic programming. The best candidate string determination section of the path for evaluating scores; a reference comparison section that compares the character string on the optimal path with the original document string and uses the comparison method to detect and mark different words. Furthermore, the invention in the third item of the patent application scope is an automatic correction device for Chinese documents according to the second item, which is characterized by: constructing a new paper ruler in the memory that has been learned through learning methods (CNS) ) Λ (21 OX 297 ^ ¾ Factory --- 420774 Printed by the Consumers Cooperative of the Central Standard Umbrella Bureau of the Ministry of Economic Affairs B7 V. Description of Invention (9) 'Word meaning code and its pre-word meaning code combination semantic dictionary for learning dictionary; above The best candidate string determination unit uses the position of each candidate word corresponding to the start and end of the original document string as an index element to link each candidate word into a directional network and calculates the similarity of each candidate word using calculation means After degree-weighting and word-length weighting, the sum of frequency-weighting, word-length weighting, original document similarity-weighting, and semantic similarity-weighting of the learning dictionary with reference to semantics is used as an evaluation function to find the path of the best evaluation score in a dynamic programming manner. [Function] According to the structure of the automatic correction device for a document of the present invention, the word-to-sound conversion means inputs the original document character _ from the input unit and refers to the broken character. Dian and Yuyin Dictionary convert the original document string into the pronunciation symbol string. 'Use the candidate word selection method to refer to the phonetic word dictionary to detect all possible candidate words. Then use similar candidate word selection method for continuous single-word syllables. The candidate uses similar sound masking means and refers to the phonetic dictionary to detect similar candidate words, then uses the best candidate string determination means to find the best path word _ for the learning dictionary with reference to semantic meaning, and then uses the reference comparison means to convert the original document word The typo in the string is marked and the result is rounded off. [Example] In the above, the semantic meaning is the meaning (or semantic code) of the morpheme itself. In this example, the dictionary used by Kakugawa Bookstore (1985) is used. The appropriate classification method is shown. Its classification method uses a hexadecimal four-digit code to represent a morpheme. The leftmost digit indicates a large classification, the second digit indicates a medium classification, and the third digit indicates a small classification. Classification, the rightmost digit is Ί * L ---------- Shang ------ Order ------ (Please read the precautions on the back before filling this page) 420774 Λ7 B7 2. Description of the invention (10) means fine classification. This lexical dictionary divides all morphemes into ten categories: nature, character, change, throbbing, mood, character, sexual orientation, society, academy, and objects. Classification. Before the implementation of the four digits of the fiscal reduction and so on, s' is shown in the following example. SO s02 s028 s028a (belonging to nature) (being natural to meteorology) (belonging to meteorological wind) (belonging to wind strength) For this hierarchical classification code, for example, as shown in Figure 13, the meaning range of the upper-level meaning code covers a wider range of meanings. Conversely, the meaning range of the lower-level meaning code is narrower. Therefore, we can cooperate The actual need to apply the meaning code 'does not have to-enumerate and save memory space. In addition, the n code 疋 is represented by a number ′. Therefore, mathematical operations such as 'set logical operation' can be used to process the meaning code, so that the meaning code month b generates more information with value. For a detailed description of the meaning code, refer to ',', the detailed description of the "machine translation device" of the Republic of China Patent Gazette No. 161238. The consumer cooperation of the Shiyang Standard Bureau of the Ministry of Economic Affairs and Du Duanyi, in addition, there are 1,300 pronunciations in Chinese, so as long as you use two digits mP, you can not represent all the pronunciations, of which there are η initials (consonants); η The 3 s members (vowels) have 14 and 5 tones. Therefore, the structure of the two-bit Chinese pronunciation is the first! As shown in the figure, bit W contains the initials (bit 2 ◦ spear; 丨 (bits 0 ～丨)), and bit 2 contains the tones (bits 4 to 6) and 4 mothers (bits το 0 to 3). Therefore, as long as column M covers the median region of byte 1 and uses the X set calculation method, g 卩 can find ^ having the same initials, finals, and original paper. The iron scales are applicable to China National Standards (CNS} -14- 420774 Jing Tao Ministry of Central Standards Bureau ua; Industrial and consumer cooperation Du printed A7 B * _______ V. Invention Description (11) ~ — " 一 ~~ Transcribed text》 In order to use the masking method to process similar sounds in each district, ten per district Elements in each group of similar sounds are assigned a bit pattern with a distance of 1, as shown in the example in Figure 2. For a detailed description of the Chinese pronunciation compression code and similar bit configuration, refer to the Republic of China Patent Gazette Bulletin No. 089477 "Chinese character conversion device (1) detailed instructions. In addition, in order to deal with the problem of redundant characters, missing characters, or missing characters caused by editing errors, the masking means of this embodiment can be used in addition to the above-mentioned bit masking. The entire text is obscured, for example, "六一厶 /," not On "liuyi 厶 / ，为 — ,, · or it jig 厶 / \ (* 'stands for obscured text, that is, any text can be used) as the search key, you can check out" slim and slender "with reference to the phonetic dictionary. Ding In addition, referring to Yin Huamin Patent Gazette Bulletin No. 089476 "Chinese Character Conversion Device (2)", it was learned that the word length is an important price judgment when converting the phonetic symbol sequence to the literal sequence. Cai also uses word length weighting as an item of the evaluation function, and its calculation formula is as follows. For example, when the candidate word is "everyone", its word length weighting is (2-1) * 2 = 2. Word length weighting = (Word count of candidate word -1) * 2 Furthermore, in order to use the textual information of the original document order to find the best path effectively, this embodiment weights the similarity of the original book as an item of the evaluation function. 'The calculation formula is as follows. Original document similarity weighting = (guarantee space π accounted for two 広, ^ (candidate sub 6 orders and original document corresponding words of the same number of words) / candidate word number of papers Standards and wealth management materials (CNS) Α4 secret (17〇 ^ 297 ^ 7 —r 衣 —-- Please read the back Precautions to fill out this page) book

In I . -15- 420774 Λ7 、 ---------Β7_ 五、發明説明（12 ) 例如，原文書之對應文子為“亭玉立”而候選字詞為‘‘亭肀玉立時，此候選字詞之原始文書類似度加權為3/4 (0.75)。再者，本實施例中也導入接續語意資訊，如第12圖所示，此語意是由標記後大型語料庫中參照前後詞的語意自動學習而得到或也可以以領域別之標記後文書集合學習而得。由於語意碼是採用階層式定義，所以前後接續字詞之語意類似度的計算可用集合交集運算得之，例如，語意碼【7140】和【714a】的集合交集運算結果為【714】，由於只匹配成功三碼語意類似度被定義為3/4。依此類推，完全相同時為1，匹配成功兩碼時為1/2，匹配成功一碼時為1/4，完全不匹配時為〇。第3圖係本發明之一種實施例中之文書自動訂正方法及其裝置的系統方塊圖，250是記憶有以字序表為索引鍵和其對應讀音記號表’以及以上述讀音記號為索引鍵和其對應候選詞與讀音記號的破音字典部，其示意構造圖如第 9圖所示；260是記憶有以文字記號為索引鍵含其對應之初始設定讀音以及其他可能讀音記號的字音詞典部，其示意構造如第10圖所示；450是記憶有讀音記號為索引鍵和其對應之同音異義字詞及相關使用頻率加權（長期學習）和語意碼等訊息的音字詞典部，其示意構造圖如第11圖所示； 550是記憶有由大型標記語料庫中學習而得之以後詞語意碼為索引鍵及其對應之接續後詞語意碼的語意對學習字典，其示意構造圖如第12圖所示：350是暫時記憶有中間處本紙悵尺皮適用中國國家標隼（CNS ) Λ4規格（2丨0X297公釐） (請先閲讀背面之注意事項再填寫本頁) 裝. 經濟部中央標隼局員工消費合作社印製 -16-In I. -15- 420774 Λ7 、 --------- Β7_ V. Description of the invention (12) For example, the corresponding text in the original book is "Tingyuli" and the candidate word is "Tingyuyuli" , The original document similarity of this candidate term is weighted 3/4 (0.75). Furthermore, in this embodiment, consecutive semantic information is also introduced. As shown in FIG. 12, this semantic meaning is obtained by automatically learning the semantic meaning of the previous and subsequent words in the large corpus after tagging, or it can also be learned by labeling documents in different fields. And get. Since semantic codes are defined in a hierarchical manner, the calculation of semantic similarity between consecutive words can be calculated using set intersection. For example, the result of the set intersection of semantic codes [7140] and [714a] is [714]. The matching three-code semantic similarity is defined as 3/4. By analogy, it is 1 when they are exactly the same, 1/2 when they match two yards successfully, 1/4 when they match one yard, and 0 if there is no match at all. FIG. 3 is a system block diagram of an automatic document correction method and device in an embodiment of the present invention. 250 is a memory that uses a word sequence table as an index key and its corresponding pronunciation mark table 'and uses the above-mentioned pronunciation mark as an index key. The broken dictionary part of the corresponding candidate word and pronunciation mark, its schematic structure is shown in Figure 9; 260 is a word dictionary that stores the initial setting pronunciation and other possible pronunciation marks with the text mark as the index key The schematic structure is shown in FIG. 10; 450 is a vocabulary dictionary section that memorizes pronunciation marks as index keys and their corresponding homonyms and related frequency-weighted (long-term learning) and semantic codes. The structure diagram is shown in FIG. 11; 550 is a semantic dictionary for learning a semantic pair of word meaning codes as index keys and corresponding word meaning codes obtained after learning from a large tagging corpus. The schematic structure diagram is shown in FIG. As shown in Figure 12: 350 is temporarily memorized with paper in the middle. The paper ruler is applicable to the Chinese National Standard (CNS) Λ4 specification (2 丨 0X297 mm) (Please read the notes on the back first (Please fill this page again) Pack. Printed by the Employees' Cooperatives of the Central Bureau of Standards, Ministry of Economic Affairs -16-

A 420774 ______B7 五、發明説明（π ) 理資料的緩衝區·，100是由輸入裝置，例如磁蝶機，鍵盤等，輸入原始文書字串的輸入部；200是參照破音字典250 和字音詞典260，將輸入之原始文書字串轉換成讀音記號串列的字轉音轉換部；300是參照音字詞典部45〇，對於讀音記號串列執行候選音節切出後，將所有候選字詞和相關資訊撿出的候選字詞選擇部；4〇〇是以對連續單音節候選，使用類似曰或音郎遮蔽手段並參照音字詞典部4 5 〇將類似候選字詞和相關資訊撿出的類似候選字詞選擇部；5〇〇是以每個候選字詞在原始文書字串上的開始和結束位置為索引，建立候選字詞之有向網路，接著參照語意對學習字典550以使用頻率加權、詞長加權、原始文書類似度加權和語意類似度加權之和為評價函數，並使用由後往前之動態規劃方式找出最佳路徑的最佳候選字串決定部：6Q〇是將最佳路徑上的字串取出後，使用比對手段和記憶於緩衝區350之原始文書字串執行比對後，在原始文書字串上標記出相異文字的參照比對部；700是將最佳字串和標示後原始文書字串輪出之輪出部。字轉音轉換部200的處理流程圖如第4圖所示，步驟 s201由輸入部丨00輸入原始文書字串並將其儲存於緩衝區 350中，接著進入步驟52〇2參照字音詞典部26〇對原始文書字串執行音節切出後，進入步驟s203參照字音詞典26〇執行非破音字音節之讀音轉換’接著進入步驟32〇4參照破音子典250生成破音字音節之讀音記號後，進入步驟s205利用語法修正讀音記號，例如，“媽媽”的轉換後讀音為“ 本紙浪尺度適用中國國家標率（CNS ) Α4規格（2丨0x297公釐） (請先閲讀背面之注意事項再填寫本頁} iL·. 裝經濟部中央榨準局員工消費合作社印製 -17- 420774 五、發明説明（μ ) 丫丫 —''但疋邊法上此時第二個音節要念成輕聲 D因此，經由此步驟的執行上示讀音會被修正為“门丫 _ 门丫.。當執行完步驟s205後，結束處理。候選字詞選擇部300的處理流程圖如第5圖所示，步驟 s301由字轉音轉換部2〇〇輪入轉換完成之讀音記號串列，參照音宇詞典部450執行可能音節切出後’進入步驟s3〇2 以每一個切出音節的讀音為檢索要素參照音字詞典45〇取出所有的候選字或詞及其使用頻率加權和語意碼，接著進入步驟s303將候選字或詞及其相關資訊存入緩衝區350中後，結束處理。經濟部中央樣孪局員工消費合作社印製 (諳先閱讀背面之注意事項再填寫本頁) 類似候選字詞選擇部4〇〇的處理流程圖如第6圖所示，步驟s401由候選字詞選擇部3〇〇輸入讀音記號串列和所有候選子词及其相關資訊後，進入步驟s4〇2判斷是否有未處理之連績單予音節候選，如果有的話，進入步驟S403以連續單字音節的候選的讀音和類似音遮蔽手段，參照音字詞典部4 5 0取出類似候選字詞和其相關資訊後，執行步驟s 4 〇 4 對取出之類似候選字詞，參照讀音記號串列計算各類似候選子祠之起始和結束位置後，執行上述之步驟s4〇2 ;如果步驟s402的判斷結果為沒有的話，結束處理。最佳候選字串決定500的處理流程圖如第7圖所示，步驟s501由類似候選字詞選擇部4〇〇輪入類似候選字詞和其相關資訊後，執行步驟S5〇2由緩衝區350取出候選字詞和其相關資訊，接著進入步驟S503以各候選字詞的起始和結束位置為索引要素，建立候選字詞有向網路後，執行步驟本紙浪尺度適用中國囪家標準（CNS )八4規格（210父297公茇）— ' •18- 經濟部中央標準局員工消費合作社印製 420774 ~^—____—----------B7__ 五.、發明説明（15) ~~ - — 一 s504由緩衝區350取出原始文金阳你始又書干串，以各候選字詞的起始和結束位置為索引要素，計算原始文書類似度加權和詞長加權後，執行步驟s505以使用頻率加權、詞長加權、原始文書類似度加權和語意類似度加權的和為評價函數，由後往前以動態規劃方式找出最佳路#，接著，執行步驟s5〇6 將最佳路徑之候選字詞取出並輸出&，結束處理。參照比對部600的處理流程圖如第8圖所示，步驟s6〇i 由最佳候選字串決定部500輸入最佳路徑上之字串八後，執行步驟S602由緩衝區35〇取出原始文書字串6後，執行步驟S603以比對手段比較字串八和丑，並標記出原始文書中錯誤的字或詞，接著執行步驟s604輸出標記後之原始文書字串和最佳路徑之字串至輸出部7〇〇後，結束處理。以中文文書自動校正為例，由輸入部1〇〇輸入原始文書字串多語資料庫系統”後，進入字轉音轉換部2〇〇 ’ 參照字音詞典260和破音字典250，將原始字串轉換文“汐乂乙—UV卩一为一幺\万乂、丁一、六乂並且將原始文書字串儲存至緩衝區35〇中，接著進入候選字詞選擇部執行可能音節切割後，以這些可能音節為檢索鍵，參照字音詞典450撿出所有可能的字或詞候選和相關資訊，如第i4圖之（3)所示；然後，進入類似候選字詞選擇部4〇〇，由於“汐乂r一uv”音節下，只有候選字而無候選詞 ’故針對這兩個音節以上述之音節壓縮和遮蔽手段，參照字音詞典450可找出如第14圖之（4)所示之類似候選字詞和其相關資訊後’進入最佳候選字串選擇部500，首先以各本紙張尺度遥用中國國家標準（CNS ) A4規格（210X297公釐） ---.------ -¾.------ίτ------Λ t锖先閱讀背面之注意ί項再填寫本頁) 420774 A7 經濟部中央標準局員工消費合作社印褽五、發明説明（l6 ) 候選字或詞在原始文書字串之起始和結束位置為索引要素 ’將所有的候選字或詞連結成一有向網路，如第15圖之(5) 所示後，參照語意對學習字典，以使用頻率加權' 詞長加權、原始文書類似度加權和語意類似度加權之和為評價函數’以由後往前之動態規劃方式可找出如第15圖之（5)中所示的最佳路徑後，進入參照比對部600，由緩衝區35〇中取出原始文書字串利用比對手段，比對最佳路徑上字串和原始文書字串並將不同處，以標記符號標記出，如第“圖之⑹所示；掉字，【】：錯別字。最後，輪出部侧將上述之最佳路徑字串和標記後字串輸出之。本發明不限於本實施例，只要不改變其要旨皆可實施，例如字典t之讀音記號可直接使用二位元組之壓縮記^ 來表示；或字音詞典可和破音字詞典合併者等皆應屬財實施例的變形。〔發明的效果] 如上所述本發明解決了以往例之問題點，換言之，效果是： (1) 可有效對中文文書執行偵錯和訂正。由小學課本語料庫中挑出大⑴萬字作為實驗的資，並且由人I任意造成文書的錯誤並記錄其位置。經由^ 發明之文書自動訂正裝置執行後，憤錯率和訂正達87%左右。 # j阿 (2) =需要任何語言模型和特殊之知識庫，故不須花費量人力和物力於收集和維護知識庫上。其料大 (請先閱讀背面之注意事項再填寫本頁) 裝私紙張尺度边用中國國家;準（CNs) A4規格（2!{)κ297公爱 -20-A 420774 ______B7 V. Buffer of the description of the invention (100), 100 is the input part for inputting original document strings by input devices, such as magnetic butterfly machines, keyboards, etc .; 200 is a reference to the dictionary of broken sounds 250 and the dictionary of phonetic sounds 260: The word-to-phonetic conversion unit that converts the inputted original document string into a pronunciation symbol string; 300 is a reference to the phonetic word dictionary portion 45. After performing candidate syllable cuts on the pronunciation symbol string, all candidate words and related Candidate word selection section for information extraction; 400 is a similar candidate for similar single words and related information, which is used for contiguous single syllable candidates, using similar masking methods or vocal masking, and referring to the vocabulary dictionary section 4 5 Word selection unit; 500 is based on the start and end positions of each candidate word on the original document string as an index, and establishes a directed network of candidate words, and then refers to the semantic dictionary for learning dictionary 550 to use frequency weighting , Word length weighting, original document similarity weighting, and semantic similarity weighting sum are used as the evaluation function, and the best candidate word for the best path is found using dynamic programming from back to front Decision section: 6Q〇 is to take out the string on the best path, use the comparison method and the original document string stored in the buffer 350 to perform a comparison, and then mark the reference of the different text on the original document string Comparison section; 700 is a round-out section that rotates the best string and the original document string after marking. The processing flow chart of the word-to-sound conversion unit 200 is shown in FIG. 4. In step s201, the original document string is input by the input unit 丨 00 and stored in the buffer 350, and then proceeds to step 522 with reference to the word-sound dictionary unit 26. 〇 After performing syllable cutting on the original document string, proceed to step s203 with reference to the phonetic dictionary 26. Execute the pronunciation conversion of non-broken character syllables', and then proceed to step 32. 4 After referring to the broken character code 250, generate the pronunciation of the broken character syllables. Proceed to step s205 to use the grammar to correct the pronunciation marks. For example, the converted pronunciation of "Mom" is "The standard of Chinese paper (CNS) Α4 standard (2 丨 0x297 mm) for the paper scale. (Please read the notes on the back before filling This page} iL .. Printed by the Consumer Cooperatives of the Central Press Bureau of the Ministry of Economic Affairs -17- 420774 V. Description of Invention (μ) Yaya — '' But the second syllable at this time should be pronounced as a soft D Therefore, the pronunciation shown above after the execution of this step will be corrected to "门丫 _ 门丫." After the execution of step s205, the processing is terminated. The processing flowchart of the candidate word selection unit 300 is shown in FIG. s 301 The sequence of pronunciation marks completed by the word-to-speech conversion unit in the 200 round conversion. After the reference syllable dictionary unit 450 executes possible syllable cut-outs, it proceeds to step s302 and uses the pronunciation of each cut-out syllable as the search element reference. The phonetic word dictionary 45 takes out all the candidate words or words and their frequency weighting and semantic codes, and then proceeds to step s303 to store the candidate words or words and their related information in the buffer 350, and then ends the process. Printed by the employee's consumer cooperative (please read the notes on the back before filling in this page) The process flow similar to the candidate word selection section 400 is shown in Figure 6. Step s401 is input by the candidate word selection section 300. After the pronunciation symbol string and all candidate sub-words and their related information, proceed to step s402 to determine whether there are unprocessed consecutive transcripts for the syllable candidates, and if so, proceed to step S403 to use the pronunciation and Similar sound masking means, after referring to the phonetic word dictionary section 4 50 to retrieve similar candidate words and related information, execute step s 4 〇04 for the extracted similar candidate words, refer to the pronunciation marks After calculating the start and end positions of the similar candidate temples in series, the above-mentioned step s402 is executed; if the judgment result of step s402 is not, the process is terminated. The flow chart of the process of determining the best candidate string 500 is as the first step. As shown in Figure 7, after step s501, the similar candidate word selection unit 400 turns in similar candidate words and their related information, and then executes step S502 to retrieve the candidate words and their related information from the buffer 350, and then proceeds to step S503 uses the start and end positions of each candidate word as the index element. After establishing a directed network of candidate words, the steps are performed. The scale of this paper applies the Chinese Standard (CNS) Standard 8 (210 parent 297). — '• 18- Printed by the Consumer Cooperatives of the Central Bureau of Standards of the Ministry of Economic Affairs 420774 ~ ^ _______-------- B7__ V. Description of the invention (15) ~~-— One s504 is taken out from the buffer 350 In the original text, you start and end the book, using the starting and ending positions of each candidate word as the index element. After calculating the similarity weighting and word length weighting of the original document, perform step s505 to use frequency weighting, word length weighting, Original instrument similarity plus And semantic similarity degree evaluation function is a weighted sum, dynamic programming in a forward manner to identify the best path #, then, after step s5〇6 candidate of the optimal path and outputting the extracted terms &, the process ends. The processing flowchart of the reference comparison unit 600 is shown in FIG. 8. After step s60i is inputted by the best candidate string determination unit 500, the string eight on the best path is executed, and step S602 is executed to retrieve the original from the buffer 35o. After the document string 6, execute step S603 to compare the string eight and ugly by means of comparison, and mark the wrong word or word in the original document, and then execute step s604 to output the labeled original document string and the best path word After being connected to the output unit 700, the processing is terminated. Taking the automatic correction of Chinese documents as an example, the input unit 100 enters the original document string multilingual database system, and then enters the character-to-speech conversion unit 200 ′ with reference to the phonetic dictionary 260 and the broken dictionary 250 to convert the original characters The string conversion text "Xi 乂 B—UV 卩 I is a 乂 \ 万乂, Ding Yi, Liu 乂, and the original document string is stored in the buffer 35. Then, after entering the candidate word selection section to perform possible syllable cutting, Taking these possible syllables as the search key, referring to the phonetic dictionary 450, all possible word or word candidates and related information are picked up, as shown in (3) of FIG. I4; then, a similar candidate word selection section 400 is entered, because Under the syllable "Xi 乂 r-uv", there are only candidate words but no candidate words. Therefore, for the two syllables, the above-mentioned syllable compression and masking methods are used. With reference to the syllable dictionary 450, it can be found as shown in (4) of Figure 14 After entering similar candidate words and related information, 'enter the best candidate string selection section 500, and first use the Chinese National Standard (CNS) A4 specification (210X297 mm) for each paper size ---.---- --¾ .------ ίτ ------ Λ t 锖 Read first Please note this item and fill in this page again) 420774 A7 Seal of the Consumer Cooperatives of the Central Standards Bureau of the Ministry of Economic Affairs. 5. Description of Invention (l6) Candidate words or words are index elements at the beginning and end of the original document string. Candidate words or words are linked into a directed network. As shown in (5) of Figure 15, the learning dictionary is referred to by semantics, using frequency weighting, word length weighting, original document similarity weighting, and semantic similarity weighting. The sum is an evaluation function. In the dynamic programming method from back to front, the best path as shown in (5) of FIG. 15 can be found, and then the reference and comparison unit 600 is entered, and the original document is taken out from the buffer area 35. The strings are compared using the matching method to compare the strings on the best path with the original document strings and mark the differences with the mark symbol, as shown in the figure "Figure ⑹; drop words, []: typo. Finally, The wheel-out part side outputs the above-mentioned optimal path string and the labeled string. The present invention is not limited to this embodiment, and can be implemented as long as the gist is not changed. For example, the pronunciation symbol of the dictionary t can directly use two bytes. Compressed notes ^ Come Or the combination of the phonetic dictionary and the broken phonetic dictionary should be a modification of the financial embodiment. [Effects of the Invention] As described above, the present invention solves the problems of the conventional examples. In other words, the effects are: (1) It can be effective Error detection and correction of Chinese documents. The large tens of thousands of characters were selected from the textbook corpus of the elementary school as the experimental fund, and the error of the document was recorded by the person I arbitrarily and the location was recorded. After the document automatic correction device of the invention was implemented, The rate of anger and correction is about 87%. # J 阿 (2) = Any language model and special knowledge base are needed, so there is no need to spend manpower and material resources on collecting and maintaining the knowledge base. Its material is large (please read first Note on the back, please fill in this page again.) The size of the paper used for packing is in the Chinese country; quasi (CNs) A4 size (2! {) Κ297 公爱 -20-

A U774 ^_______ _B7 五、發明説明（17) —- (3)可作為中文輸入法或文字辨認裝置的後處理，例如可有效解決搶詞的問題等。等等，本發明之實用性甚高。〔圖式的簡單說明〕第1圖係表示2位元之中文讀音構造圖；第2圖係表示每區中各組類似音内之要素予以分配距離為1的位元式樣例示圖；第3圖係本發明之一實施例的方塊圖；第4圖係本發明之實施例中之字轉音轉換部之處理流程圖；；第5圖係本發明之實施例中之候選字詞選擇部之處理流程圖；第6圖係本發明之實施例中之類似候選字詞選擇部之處理流程圖；第7圖係本發明之實施例中之最佳候選字串決定部之處理流程圖；第8圖係本發明之實施例中之參照比對部之處理流程圖；第9圖係本發明之實施例中之處理流程破音字典的示意構造圖；第10圖係本發明之實施例令之字音詞典的示意構造圖第11圖係本發明之實施例中之音字詞典的示意構造圖本紙朵尺度適用中國國家標準（CNS ) A4規格（210X297公廢） (請先閲讀背面之注意事項再填寫本頁) 裝· 丁 -t 經濟部中央標準局員工消費合作社印製 -21 - 420774 A7 B：五、發明説明（18) 第12圖係本發明之實施例中之語意對學習字典的示意構造圖：第13圖係本發明之實施例中之語意階層分類的概念圖 9 第14圖係本發明之實施例中之處理範例示意圖：第15圖係第14圖之連續之說明圖；第16圖係第14圖之連續之說明圖；第17圖係以往例的實施例之系統方塊圖。元件標號對照 (請先閲讀背面之注意事項再填寫本頁} 裝訂 100.. .輸入部 200.. .字轉音轉換部 250…破音字典部 260.. .字音詞曲部 300…候選字詞選擇部 350…緩衝區 400.. .類似候選字詞選擇部 450.. .音字詞典邹 500…最佳候選字串決定部 550…語意對學習字典 600·..參照比對部 700…輸出部經濟部十央標準局員工消费合作社印製本紐尺度剌帽龄辟 -22-A U774 ^ _______ _B7 V. Description of the invention (17) —- (3) It can be used as the post-processing of Chinese input method or text recognition device, for example, it can effectively solve the problem of word grabbing. Wait, the present invention is highly practical. [Brief description of the figure] Figure 1 shows a 2-bit Chinese pronunciation structure diagram; Figure 2 shows an example of a bit pattern with a distance of 1 assigned to elements within each group of similar sounds in each zone; Figure 3 FIG. 4 is a block diagram of an embodiment of the present invention; FIG. 4 is a processing flowchart of a word-to-sound conversion unit in the embodiment of the present invention; FIG. 5 is a candidate word selection unit in the embodiment of the present invention FIG. 6 is a processing flowchart of a similar candidate word selection unit in the embodiment of the present invention; FIG. 7 is a processing flowchart of the best candidate string determination unit in the embodiment of the present invention; FIG. 8 is a processing flowchart of a reference comparison unit in the embodiment of the present invention; FIG. 9 is a schematic structural diagram of a broken dictionary in a processing flow in the embodiment of the present invention; FIG. 10 is an embodiment of the present invention Figure 11 shows the schematic structure of the phonetic dictionary. Figure 11 shows the structure of the phonetic dictionary in the embodiment of the present invention. The paper size is applicable to the Chinese National Standard (CNS) A4 specification (210X297 public waste). (Please read the precautions on the back first. (Fill in this page again) -t Printed by the Consumer Cooperatives of the Central Bureau of Standards of the Ministry of Economic Affairs-21-420774 A7 B: V. Description of the invention (18) Figure 12 is a schematic diagram of the semantic dictionary for learning dictionary in the embodiment of the present invention: Figure 13 Concept of semantic hierarchy classification in the embodiment of the present invention FIG. 9 FIG. 14 is a schematic diagram of a processing example in the embodiment of the present invention: FIG. 15 is a continuous explanatory diagram of FIG. 14; FIG. 16 is a diagram of FIG. 14 Continuous explanatory diagram; FIG. 17 is a system block diagram of a conventional example. Component label comparison (please read the notes on the back before filling in this page) Binding 100... Input section 200... Word-to-sound conversion section 250. Broken dictionary section 260... Phonetic word section 300 ... Candidate word Selection unit 350 ... buffer 400 .. Similar candidate word selection unit 450 ... phonetic dictionary Zou 500 ... best candidate string determination unit 550 ... semantic pair learning dictionary 600 .. reference comparison unit 700 ... output unit Printed in New Zealand by the Consumers' Cooperative of the Shiyang Standard Bureau of the Ministry of Economic Affairs

Claims

ABCD 420774 6. Scope of patent application 1 'An automatic correction method for Chinese documents, which is a method for the computer to automatically correct electronic Chinese documents, including the following steps: Step 1 Constructing a stored word order table pronunciation and corresponding word order in computer memory The phonetic notation table of the table and the corresponding dictionary of the phonetic notation and the pronunciation dictionary of the phonetic notation; construct and store in the memory order a character dictionary containing the text notation and its initial setting pronunciation and other possible pronunciation notations; construct and store in the memory Phonetic dictionaries with pronunciation marks and their corresponding all homonyms and their associated frequency weighting and semantic codes; Step 2 convert the original document string input by the input device into a phonetic symbol sequence with reference to the above-mentioned broken pronunciation dictionary and phonetic dictionary. Step 3: Perform syllable cut-outs on the pronunciation symbol string obtained by using the word-to-tone conversion method, and use each syllable as the retrieval element to refer to the phonetic dictionary to check the possible candidate words and related information. The candidate word selection means; step 4 The syllable uses a masking method to mask the similarity of the phonetic symbol string as a search element, and refers to a similar candidate word selection method that picks out possible candidate words and related information with reference to the phonological exhaustion; step 5 uses each candidate word The word position corresponding to the beginning and end of the original document string is used as an index element to link each candidate word into a directional network and use calculation methods to calculate the class A4 appearance of each candidate postword (210X297 mm > (please first (Read the note $ on the back side and fill out this page). Installation ·-, 11 Printed by the Consumer Standards Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs 420774 AS BS C8 D8 Printed by the Consumer Standards Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs After weighting with the word length, using the sum of frequency weighting, word length weighting, and similarity weighting of the original document as an evaluation function, a dynamic programming method is used to find the best candidate string determination method for the path of the best evaluation score; step 6 The text sequence on the best path and the original document word _ use the comparison means' to detect and mark the different words as a reference comparison means. 2. —kind A device for automatically correcting a document using computer technology, comprising: storing a word sequence table pronunciation, a phonetic notation table corresponding to the word sequence table, a candidate dictionary corresponding to the syllabary number list, and a broken pronunciation dictionary thereof; Phonetic dictionaries with text marks and their initial pronunciation, and other possible pronunciation marks; Phonetic dictionaries that store phonetic marks and their corresponding all homonyms and their related frequency weighting and semantic stele; refer to the above-mentioned broken pronunciation dictionary and phonetic dictionary The word-to-speech conversion unit that converts the original document string rotated by the input device into a pronunciation-symbol string; performs syllable cut-outs on the pronunciation-symbol string obtained by using the word-to-speech conversion method, and uses each syllable as the search element reference The phonetic dictionary lists possible candidate words and the candidate word detection unit related to the information > as a retrieval element, using a series of pronunciation marks after concealing similar bits by using a masking method for consecutive single-word candidate syllables. This paper size applies to the Chinese national standard (CNS > A4 size (210 X 297 public director)) Note on the back page, please fill in this page again) Installation · -Order _ < k -24- Printed by Employee Consumer Cooperative of Central Bureau of Standards, Ministry of Economic Affairs 420774 ίΐ C8 _______ D8 VI. Candidates with patent scope 3 and related information Detected similar candidate characters 々§1 selection unit; use the candidate text corresponding to the beginning and end of the original document string as the index element to link each candidate word into a directed network and use special means to calculate each After candidate word similarity weighting and word length weighting, the sum of frequency weighting, word length weighting, and original document similarity weighting is used as an evaluation function to find the best candidate string for the path of the best evaluation score in a dynamic programming manner. Decision department; a reference comparison department that uses the text string on the best path and the original document string to compare and find out the different words and marks them. 3. The automatic correction device for Chinese documents according to item 2 of the scope of the patent application, which is characterized by: constructing a combination of the post-word meaning code and the pre-word meaning code learned through learning means in the s-memory body. Semantic pair learning dictionary > The above-mentioned best candidate string determination unit uses the position of each candidate word corresponding to the start and end of the original document string as an index element to link each candidate word into a directed network and uses calculation means After calculating the similarity weighting and word length weighting of each candidate, the sum of frequency of weighting of learning subscripts using semantic meaning, weighting of word length, similarity weighting of original documents, and weighting of semantic similarity is used as the evaluation function. Those who get the best evaluation score. This paper size applies to Chinese National Standard (CNS) A4 specification (2l0x297) J1 [T-! I I In I—- I · ici I —I— ---- _ _--I ^ -Ψ. --- ^^ 1 _ I n 1 N U3 i Hu— (Please read the precautions on the back before filling this page) -25-