TW200521712A - System, method and machine-readable storage medium for automated sentence annotation - Google Patents

System, method and machine-readable storage medium for automated sentence annotation Download PDF

Info

Publication number
TW200521712A
TW200521712A TW92135509A TW92135509A TW200521712A TW 200521712 A TW200521712 A TW 200521712A TW 92135509 A TW92135509 A TW 92135509A TW 92135509 A TW92135509 A TW 92135509A TW 200521712 A TW200521712 A TW 200521712A
Authority
TW
Taiwan
Prior art keywords
sentence
words
conversion
word
mentioned
Prior art date
Application number
TW92135509A
Other languages
Chinese (zh)
Other versions
TWI225994B (en
Inventor
Wen-Chih Chen
Lu-Ping Chang
Wen-Tai Hsieh
Shih-Chun Chou
Original Assignee
Inst Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inst Information Industry filed Critical Inst Information Industry
Priority to TW92135509A priority Critical patent/TWI225994B/en
Application granted granted Critical
Publication of TWI225994B publication Critical patent/TWI225994B/en
Publication of TW200521712A publication Critical patent/TW200521712A/en

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A system and a method for automated sentence annotation comprise a sentence annotation module for receiving a sentence, in which, according to a finite state machine containing a plurality of orderly states, where there is a conversion word between two states, conversion words in the sentence is detected in order. If there are conversion words in the sentence containing same order, the words between the two conversion words in the sentence are annotated into corresponsive states and thus structural information containing states and words between the two conversion words in the sentence is generated.

Description

200521712 五、發明說明(1) 發明所屬之技術領域 此發明是一種文件自動標記系統及方法,特別是一種 運用有限狀態機(finite state machine)之句子自動標s己 糸統及方法。 u 先前技術 由於語言之多詞一義,以及一詞多義的情況,讓傳統 資訊檢索(i n f 〇 r m a t i 〇 n r e t r i e v a 1 )技術遭遇到_此瓶 頸,因此,許多文件自動標記技術被提出,為一文件加上 共旱之語意標記’用以提高精確度(precision)、完整度 (r e c a 1 1)以及跨領域的延伸性(s c a 1 i b i 1 i t y )。 傳統之文件自動標記技術,可分為三種:統計基礎 (statistics — based)、機器學習(machine learning)以及 探索法則(heur i st ic rule)。統計基礎技術利用大量語料 庫(thesaurus),進行統計字詞的分析,來擷取資訊,但 其精確度會受到語料庫大小影響。機器學習技術提出一數 學模型讓機器自動學習來辨識資訊,其需有一定訓練時 間’但有時訓練結果不會收斂,會導致精確度不穩定。探 索法則技術簡單易懂,較符合人的理解方式,但需大量人 工手動微調,否則精確率不易再提昇。 雖然’傳統之文件自動標記技術,可適度解決語言多 詞一義’以及讓一詞多義的情況。但是,由於其大多針對 整份文件(而非句子)來進行標記,所以,其在資訊檢索的 精確度提升上,還是只能停留在文件的層次。除此之外, 在句子中有時會出現異音別詞、同音別詞以及錯誤字詞組200521712 V. Description of the invention (1) The technical field to which the invention belongs This invention is a system and method for automatic tagging of documents, in particular a system and method for automatically tagging sentences using a finite state machine. u In the prior art, due to the polysemy of words and the polysemy of words, traditional information retrieval (inf 〇rmati 〇nretrieva 1) technology encountered this bottleneck. Therefore, many automatic file tagging technologies have been proposed, adding one file to one file. The semantic mark of the communist drought is used to improve precision (reca 1 1) and cross-domain extensibility (sca 1 ibi 1 ity). Traditional file automatic tagging technology can be divided into three types: statistics — based, machine learning, and heur i st ic rule. The basic statistical technology uses a large number of corpora (thesaurus) to analyze statistical words to retrieve information, but its accuracy will be affected by the size of the corpus. Machine learning technology proposes a mathematical model for the machine to learn automatically to identify information, which requires a certain training time ', but sometimes the training results will not converge, which will lead to unstable accuracy. The exploration rule technology is simple and easy to understand, which is more in line with human understanding, but requires a lot of manual manual fine-tuning, otherwise the accuracy rate will not be easily improved. Although ‘traditional document auto-tagging technology’ can moderate the problem of linguistic polysemy and polysemy. However, because it mostly marks the entire document (not a sentence), its accuracy improvement in information retrieval can only stay at the document level. In addition, syllables, homonyms, and wrong phrases sometimes appear in sentences.

0213-A40162TWF(Nl);B9250TW;SNOWBALL.ptd 第5頁 200521712 五、發明說明(2) 合情形,讓句子標記的精確度下降。 因此,需要一系統與方法進行文件中之 提高資訊檢索的精確度,除此之外,♦ ^于铩5己,以 詞、同音別詞以及錯誤字詞組合的= 標記的精確度。 卜㈢I中低句子 發明内容 有鑑於此,本發明之目的為提供—種句子/ 統與方法’以提高資訊檢索的精確度,除此之外,::: 中出現異音別言司、同音別詞以及錯誤字詞组 ;口 : 不會降低句子標記的精確度。 、σ的^況’亦 首先ΪΠ之句子自動標記系統及方法, 入裝置,並使用匯流排將其連結在一# 存U:輸 一 #料庫系統、檔案系統或其他可儲存 ^ σ以為 包含一領域詞庫(domain thesaurus),与1 ’庙士、,其中 之多個同義詞、異音別詞及同音別詞。#庫中紀錄了一詞 軟體模組架構含有由程式指令碼 組、同義詞處理模組、同音別詞;;之=記模 模組以及錯誤排列詞處理模组。 、、、,、日別詞處理 體所包含之句子標記模組、、同義詞處^器用以載入記憶 理模組、異音別詞處理模組以及鈣 =〔且、同音別詞處 根據程式指令以及使用者藉由輸入阱^處理模組,並 行句子自動標記功能,並於最佳情^斤輸入之資料,執 顯示到顯示裝置上。 /卜’將執行後之結果0213-A40162TWF (Nl); B9250TW; SNOWBALL.ptd Page 5 200521712 V. Description of the invention (2) Combined with the situation, the accuracy of sentence marking is reduced. Therefore, a system and method are needed to improve the accuracy of information retrieval in the document. In addition, the accuracy of the = mark with a combination of words, homonyms, and wrong words is required. In view of the above, the content of the low and medium sentence of Bu Yi I is to provide a kind of sentence / system and method to improve the accuracy of information retrieval. In addition, there are different sounds, syllables, and homonyms in :: Words and wrong word phrases; Mouth: Does not reduce the accuracy of sentence tagging. First, the sentence automatic tagging system and method of Ϊ, enter the device, and use a bus to link it in one # 存 U: 输 一 # material warehouse system, file system, or other storable ^ σ to include A domain thesaurus, with 1 'temple, among many synonyms, syllables and homonyms. # 库 中 录 一词 The software module architecture includes a program script group, a synonym processing module, and a homonym;; == a memory module and a misaligned word processing module. ,,,,, sentence tag module included in the Japanese word processing body, the synonym processor to load the memory management module, the different word processing module, and calcium = (and, the same word according to the program instructions And the user uses the input sink processing module, parallel sentence automatic tagging function, and the data input under the best circumstances, and displays it on the display device. / 卜 ’will be the result after execution

0213 -A40162TWF(N1);B9250TW;SNOWBALL.p t d 第6頁 200521712 五、發明說明(3) 有限狀態機包含數個狀態(s t a t e ),其中,包含一特 定之結束狀態,代表整個句子之語意標記功能結束。在有 限狀態機中,從一狀態進入另一狀態需要一轉換詞。 同義詞處理模組用以輸入一詞,檢索領域詞庫,輸出 關聯於該詞之同義詞,同義詞係指具相同意義之不同詞。 同音別詞處理模組用以輸入一詞,檢索領域詞庫,輸出關 聯於該詞之同音別詞,而異音別詞處理模組則檢索領域詞 庫,用以輸出關聯於該詞之異音別詞,其中,同音別詞以 及異音別詞係代表句子中有可能因單字輸入錯誤所出現之 錯別詞。錯誤排列詞處理模組用以輸入一詞,將該詞中所 包含的字重新組合,產生並輸出錯誤排列詞,其中,錯誤 排列詞係代表句子中有可能因單字順序顛倒所出現之錯別 詞。 句子標記模組用以輸入有限狀態機,並將其中之轉換 詞依序輸入同義詞處理模組、同音別詞處理模組、異音別 詞處理模組以及錯誤排列詞處理模組,得到每一轉換詞之 同義詞、同音別詞、異音別詞以及錯誤排列詞,形成一同 義詞集合以及一錯別詞集合。輸入一句子,依序^[貞測句子 所存在之轉換詞、同義詞、同音別詞、異音別詞或錯誤排 列詞,將兩轉換詞之間之詞,標記出相應之狀態,並儲存 成可擴展式標記式語言(X M L)訊息、貪料庫紀錄或稽案紀 錄。 實施方式 第1圖係表示依據本發明實施例之句子自動標記系統0213 -A40162TWF (N1); B9250TW; SNOWBALL.ptd Page 6 200521712 V. Description of the invention (3) The finite state machine includes several states, including a specific ending state, which represents the semantic marking function of the entire sentence End. In a finite state machine, a transition word is required to enter from one state to another. The synonym processing module is used to input a word, search the domain lexicon, and output synonyms related to the word. Synonyms refer to different words with the same meaning. The homophone processing module is used to input a word, search the domain lexicon, and output the homophone related to the word, and the heterophone processing module is used to search the domain vocabulary, and output the different pronunciation related to the word. Words, in which homophones and heterophones represent misspellings in sentences that may occur due to incorrect input of a single word. The misplaced word processing module is used to input a word, recombine the words contained in the word, and generate and output a misplaced word. The misplaced word represents a mistake that may occur in the sentence due to the reversal of the word order word. The sentence mark module is used to input the finite state machine, and the conversion words in it are sequentially input to the synonym processing module, homophone processing module, heterophone processing module, and misaligned word processing module to obtain each conversion. Synonyms, homophones, heterophones, and misaligned words of words form a set of synonyms and a set of misspellings. Enter a sentence, in order ^ [converted words, synonyms, homophones, homonyms, or misaligned words that exist in the Zhentest sentence, mark the corresponding states between the two converted words, and save them as Extensible Markup Language (XML) messages, corrupt database records, or audit records. Embodiment 1 FIG. 1 shows an automatic sentence marking system according to an embodiment of the present invention.

0213 -Α40162TWF(Ν1);B9250TW;SNOWBALL.p t d 第7頁 200521712 五、發明說明(4) 之系統架構圖。依據本發明實施例之句子自動標記系統】〇 包括一儲存裝置11、一中央處理器丨2、一記憶體丨3、顯示 ί =14、輸入裝置15,並使用匯流排16將其連結在一起。 結合儲存裝置1 1、中央處理器丨2、記憶體丨3、顯示裝置i 4 以及輸入虞置15,可以形成一部大型電腦(mainframe)、 個人電腦、工作站、筆記型電腦或其他電腦設備。0213-Α40162TWF (N1); B9250TW; SNOWBALL.p t d p. 7 200521712 V. System architecture diagram of invention description (4). An automatic sentence marking system according to an embodiment of the present invention] includes a storage device 11, a central processing unit 2, a memory 1, 3, a display 15, an input device 15, and is connected together using a bus 16 . Combining the storage device 11, the central processing unit 丨 2, the memory 丨 3, the display device i 4 and the input device 15 can form a mainframe, a personal computer, a workstation, a notebook computer or other computer equipment.

儲存裝置11可以為一資料庫系統、檔案系統或其他可 儲存貧料之裝置,其中包含一領域詞庫(d〇mai η thesaurus) ’詞庫中紀錄了一詞之多個同義詞、異音別詞 及同音別詞。例如,一字詞”行經”在領域詞庫中包含兩個 同義凋π行至"、"途經”;兩個同音別詞”型經”、"行莖"; 一個異音別詞π襲經"。 第2圖係表示依據本發明實施例之句子自動標記系統 之軟體模組架構圖。本發明實施例之軟體模組架構存在於 吕己憶體1 3中’含有由程式指令碼組成之句子標記模組 1 3 1、同義詞處理模組丨3 2、同音別詞處理模組丨3 3、異音 別詞處理模組1 3 4以及錯誤排列詞處理模組丨3 5。中央處理The storage device 11 may be a database system, a file system, or another device capable of storing lean materials, which includes a domain thesaurus (domai η thesaurus), and a plurality of synonyms, syllables and other words recorded in the thesaurus. And homonyms. For example, the word "Xing Jing" in the domain thesaurus contains two synonymous lines "to" and "quotation"; two homonyms "type Jing", "quoting" and "different words"; π 袭 经 &#; FIG. 2 is a diagram showing a software module structure of an automatic sentence marking system according to an embodiment of the present invention. The software module structure of the embodiment of the present invention exists in Lu Jiyi's body 13 Sentence mark module 1 3 composed of instruction code 1. Synonym processing module 3 2. Homophone processing module 3 3. Homophone processing module 1 3 4 and wrongly arranged word processing module 3 5. Central processing

Is 1 2用以載入記憶體1 3所包含之句子標記模組丨3 1、同義 一處理模組1 3 2、同音別詞處理模組丨3 3、異音別詞處理模Is 1 2 is used to load the sentence mark module contained in memory 1 3 丨 3 1. Synonymous processing module 1 3 2. Homonym processing module 丨 3 3

組1 34以4及錯誤排列詞處理模組丨35,並根據程式指令以及 使=者藉由輸入裝置1 5所輸入之資料,執行句子自動標記 功能’並於最佳情況下,將執行後之結果顯示到顯示裝置 1 4上〇 第3圖係表示依據本發明實施例之範例有限狀態機示Group 1 34 arranges word processing modules 4 and 4 with errors 4 and 35, and executes the automatic sentence marking function according to the program instructions and the data entered by the input device 15, and in the best case, will be executed after The results are displayed on the display device 14. Figure 3 shows an exemplary finite state machine diagram according to an embodiment of the present invention.

200521712 五、發明說明(5) 意圖。有限狀態機3包含五個狀態(state),分別為人 321、案發時間322、f發車型323、案發地點324以及 331。其中,結束狀態331為—特殊狀態,代表整個句^ 語f標記功能結束。在有限狀態機中,從一狀態進入另一 狀悲需要-轉換詞,在本實施例中,分別為"受處分人 "311、”於"312、',駕駛車號"313、,,行經„314以及細 查"315。有限狀態機3可以表示如下,{受處分人攔 + {於} +案發時間+ {駕駛車號丨+案發車型+{行締 + {經警攔查} +結束。 、、二}案♦地點 同義詞處理模組132用以輸入一詞,檢 輸出關聯於該詞之同義㈣,同義詞係、指具相同意義,之庫, 调。同音別詞處理模組丨33用以輸入一詞,索 同 輸出關聯於該詞之同音別詞,而異音別詞處理模’ 索領域詞庫,用以輸出關聯於該詞之異音別、复 則 音別詞以及異音別詞係代表句子中有可能因單字輸:铒二 所出現之錯別詞。錯誤排列詞處理模組丨3 5 、a决 詞,將該詞中所包含的字重新組合,產生並 #^入一 詞,其中,錯誤排列詞係代表句子中有=誤為卜列 倒所出現之錯別詞。 兩了此因早字順序顛 句子標記模組1 3 1用以輸入有限狀態機3, ,換詞311、312、313、314以及315依序輸入同義^、户理 杈組132、同音別詞處理模組133、異音別詞處理模二 以及錯誤排列詞處理模組丨3 5,得到每一轉換气、/、 詞、同音別詞、異音別詞以及錯誤排列詞,形3成之—同同義義詞200521712 V. Description of Invention (5) Intent. The finite state machine 3 includes five states, namely a person 321, an incident time 322, an f-issue model 323, an incident place 324, and 331. Among them, the end state 331 is a special state, which represents the end of the entire sentence. In the finite state machine, from one state to another, the need-transformation word is, in this embodiment, "quotee" 311, "Yu" 312, ', driving car number "313 ,,, Passing through 314 and Investigating " 315. The finite state machine 3 can be expressed as follows: {subject to punishment + {at} + incident time + {driving car number 丨 + incident model + {line association + {by police check} + end). 、、 二} Case ♦ Place The synonym processing module 132 is used to input a word, and to check out the synonymous words that are related to the word. Synonyms and fingers have the same meaning. The homophone processing module 丨 33 is used to input a word, and to output the homophone related to the word, and the heterophone processing module is used to output the heterophone and complex related to the word. Then the syllables and syllables are representative of the possible misspellings in a sentence: the second word appears in the second sentence. Wrongly arranged word processing module 丨 3, a decisive word, recombining the words contained in the word to generate and # ^ 入 word, where the wrongly arranged words represent = wrong in the sentence A typo that appears. This is because the early word order reverses sentence marking module 1 3 1 to input the finite state machine 3, and the words 311, 312, 313, 314, and 315 are used to input synonym ^, household management group 132, and homonyms in order. The processing module 133, the different sound word processing mode 2 and the wrongly arranged word processing module 丨 35, get each conversion gas, /, word, homophone, different phonetic word, and wrongly arranged word, which form 3% —same Synonym

200521712 五、發明說明(6) 集合以及一錯別詞集合。第4圖係表示依據本發明實施例 之範例同義詞以及錯別詞集合示意圖,轉換詞3 π、3丨2、 3 1 3、3 1 4以及3 1 5之同義詞、同音別詞、異音別詞以及錯 誤排列詞’羅列於列471、472、47 3、474以及475。於此 必須注意的是任何一轉換詞並不一定需要關聯同義詞、同 音別詞、異音別詞以及錯誤排列詞,其可以透過使用者設 ^產生關聯之同義詞、同音別詞、異音別詞或錯誤排列 組131輸入如下所示之酒醉駕車判決文, Λ ΛΛΛ 九十年十月十三曰凌晨零時三十 安東路-段五三巷巷口 '經馨二;小f車型 '經臺北市長 酒味’以呼氣酒精測試器檢;丨〔現文處分人身上帶有 度達每公升〇.五人毫克處分人之呼氣所含酒精漢 機3之轉換詞、相關連之同属、。接下來,依據有限狀態 記,標記出句子中之人名3 2二以及錯別詞,進行句子標 323、案發地點324等狀能4φ、案發時間322、案發車型 句子標記模組程詳述如下。 詞&quot;受處分人&quot;,因此,在&quot;為二文,中之一開始存在轉換 轉換詞,,於’,312,並於偵測^二二人π之後之句子中,尋找 之後之句子中,尋找轉換詞&quot;二文中之&quot;於π時,在”於” 中並不存在&quot;駕駛車號&quot;一气、、車號3 1 3。由於在句子 任何之同義詞,因此,在Γ於:並且,轉換詞313亦不存在 在關聯之錯別詞&quot;駛駕車號,,或&quot;之後之句子中,偵測是否存 3車唬駕駛丨丨之詞。結果,偵 0213 -Α40162TWF(Ν1);B9250TW;SNOWBALL.p t d 第10頁 200521712 五、發明說明(7) =其:含有&quot;駛駕車號,,一詞’於 岡後,偵測轉換詞||行經” 314以 、 駕車唬一 &quot;315。結果’於後之判決文中轉換柯經警攔查發現 同音別詞型經&quot;以及轉換詞&quot;經^到=與詞二經&quot;314之 經警攔檢發現&quot;。當發現&quot;經M—杳現31 5之同義詞” 技主、任λ α 土 s爛查發現丨丨3 1 5之轉換啕日本 代表進入結束狀態331,不再繼續進行句子桿^換^, 之狀2子;;;組131將兩轉換詞之間之詞:、標記出相庫 期ΐ:;;、“ί資料庫紀錄、槽案紀錄。另夕卜,在A 期間亦可以使用諸如停用詞過廣 铽屺 在較佳之情況下」想要的標點符號或字詞。 〈判決文〉 ,、充式軚記語言(XML)描述如下, 〈人名〉吳阿呆&lt;/人名&gt; 〈案發時間〉民國九十年+曰 , 許&lt;/案發時間〉九十年十月十三日凌晨零時三十三分 〈f發車型〉L F c ——六八號小客車&lt;/案發車型&gt; 〈案發地點〉臺北市長安東路一段五三巷&amp;口〇案發地 &lt;/判決文〉200521712 V. Description of the invention (6) Collection and a typo collection. FIG. 4 is a schematic diagram showing a set of example synonyms and misspellings according to an embodiment of the present invention, and the conversion words 3 π, 3 丨 2, 3 1 3, 3 1 4 and 3 1 5 are synonyms, homophones, and heterophones. And the misplaced word 'is listed in columns 471, 472, 47 3, 474, and 475. It must be noted here that any conversion word does not necessarily need to be related to synonyms, homonyms, homonyms, and misaligned words. It can generate related synonyms, homonyms, homonyms, or errors through user settings. The permutation group 131 enters the drunk driving judgment sentence as shown below. Λ ΛΛΛ 90th Anniversary Road-Section Wusan Lane Alley 'Jing Xin Er; Xiao F Model' via Taipei Mayor "Taste of alcohol" is checked with breath alcohol tester; 丨 [Currently, the person who carries the disposition has a degree of 0.5 mg per liter. The person who breathes the person's breath contains a conversion word of the Chinese character 3, the related affiliation ,. Next, according to the limited state record, mark the names of the person 3 2 2 and the misspellings in the sentence, and carry out the sentence mark 323, the incident location 324, etc. 4 状, the incident time 322, the incident model sentence marking module process details As described below. The word &quot; subjected person &quot;, therefore, in &quot; is the second text, one of the conversion words begins to exist, in ', 312, and in the sentence after detecting ^ two π, look for the following In the sentence, look for the conversion word &quot; in the two texts &quot; in π, which does not exist in "于" &quot; driving car number &quot; Yiqi, and car number 3 1 3. Because of any synonyms in the sentence, in Γ in :, and the conversion word 313 does not exist in the related misspell &quot; drive number, or &quot;丨 丨 words. As a result, the detection 0213-Α40162TWF (N1); B9250TW; SNOWBALL.ptd Page 10 200521712 V. Description of the invention (7) = which: contains &quot; driving number ,, the word 'after the gang, detect the conversion word || "Scripture" 314, driving bluffing &quot; 315. As a result, in the following sentence, the Jing Jing police switched over and found that the homophone type script &quot; and the conversion word &quot; 经 ^ 到 = 与 词 二 经 &quot; 314 It was found by police interception. "When it was found, it was" synonymous with M-3131 5 ". The master, Ren λ α ss found out 丨 丨 3 1 5 transition 啕 Japanese representative entered the end state 331, no longer Continue the sentence bar ^ change ^, the state 2 ;;; group 131 will be the words between the two conversion words :, mark out the phase of the library phase ΐ: ;;, "ί database records, slot case records. Another Xibu , During A, you can also use punctuation or words such as stopwords that are too broad, and in the best case. <Judgment text>, the description of the XML language is as follows, <Person name> Wu Adai &lt; / person name &gt; <incident time> 90 years of the Republic of China +, Xu &lt; / incident time> 90 0:33 am on October 13th, 2015. <f-type vehicle> LF c ——No. 6 passenger car &lt; / incident model &gt; <occurrence location> Wusan Lane, Section 1, Chang'an East Road, Taipei City &amp;口 〇Place of occurrence &lt; / decision text>

於此必須注意的是若被標記的句子無法進入結束狀態It must be noted here that if the marked sentence cannot enter the end state

200521712 五、發明說明(8) ---— 之方法流程圖,此方法由程式碼所組成,並可被中央處理 器載入並執行。 、 首先,如步驟S5 1 1,接收一有限狀態機,如第3圖所 示,有限狀態機3包含五個狀態(state),分別為人名 3 2 1、案發時間3 2 2、案發車型3 2 3、案發地點3 2 4以及結束 3 3 1。其中,結束狀態3 3 1為一特殊狀態,代表整個句子之 語意標記功能結束。在有限狀態機中,從一狀離進入另一 狀態需要一轉換詞。 “ ▲如步驟S521所示以及步驟S522所示,取得相應於轉換 詞之同義詞以及錯別詞,其中,錯別詞包括同音別詞、里 音別詞以及錯誤排列詞,結果如第4圖所示。 如步驟S531所不,接收一句子,之後如步驟S541所示 偵測句子中之轉換詞,其偵測方法可比對轉換詞本身、相 應於轉換詞之同義詞或錯別詞。於較佳的情況下,在本步 驟中會先比對轉換詞本身,其次是同義詞,最後為錯別 詞。 如步驟S542所示’判斷句子是否結束,是則結束整個 方法;否則執行步驟S543。如步驟S543所示,判斷是否進 如結束狀態’是則執行步驟S551 ;否則回到步驟S541,繼 續偵測下一轉換詞。 如步驟S551所示,將轉換詞與轉換詞之間之句子,標 記上適當之狀態,諸如人名3 2】、案發時間3 2 2、案發車型 3 2 3、案發地點3 2 4等,並產生一結構化訊息、一資料庫紀 錄(r e c 〇 r d )或一檔案紀錄。於較佳之情況下,此結構化訊200521712 V. Description of the invention (8) ----- Method flow chart, this method consists of code, and can be loaded and executed by the central processor. First, as in step S5 1 1, a finite state machine is received. As shown in FIG. 3, the finite state machine 3 includes five states, which are a person's name 3 2 1, an incident time 3 2 2, and an incident. Model 3 2 3, place of crime 3 2 4 and end 3 3 1. Among them, the end state 3 31 is a special state, which represents the end of the semantic marking function of the entire sentence. In a finite state machine, transitioning from one state to another requires a transition word. "▲ As shown in step S521 and step S522, the synonyms and misspells corresponding to the converted words are obtained, where the misspelled words include homophones, insyllabics, and wrongly arranged words. The results are shown in FIG. 4 As shown in step S531, a sentence is received, and then the conversion word in the sentence is detected as shown in step S541. The detection method can compare the conversion word itself, a synonym or a typo corresponding to the conversion word. In this case, in this step, the conversion word itself is compared first, followed by the synonyms, and finally the misspelled word. As shown in step S542, 'judgment whether the sentence ends, if yes, then the entire method is terminated; otherwise, step S543 is performed. As step S543 As shown in the figure, if it is judged that the end state is 'Yes', then step S551 is performed; otherwise, it returns to step S541 and continues to detect the next converted word. As shown in step S551, mark the sentence between the converted word and the converted word appropriately. Status, such as the name of the person 3 2], the time of the incident 3 2 2, the model of the incident 3 2 3, the place of the incident 3 2 4 and so on, and a structured message, a database record (rec 0rd), or a file Record. In the preferred case, the structure of this information

200521712 五、發明說明(9) 息為XML訊息。 再者,本發明提出一種電腦可 存-電腦程式’上述電腦程式用以實現以儲 法’此方法會執行如上所述之步驟。。子自動己方 第6圖係表示依據本發明實施例之 之電腦可讀取儲存媒體示意圖。此 自動軚記方法 一電腦程式620,用以實現句-存媒體60,用以儲存 式包含六個邏輯,分別為接收子自動= 輯625以及產生結構化資料邏輯626。、 σ 之轉換詞邏 因此,藉由本發明所提供之句子自動 法思:以提高資訊檢索的精確度,除此之外”:及方 ;;=、同音別詞以及錯誤字詞組合的情;句:中出 降低句子標記的精確度。 月’兄,亦不會 雖然本發明之實施例揭露如上,然其並非用 發明’任何熟悉此項技藝者,在不脫離本發限定本 圍内,當可做些許更動與潤飾,因此本發明之:精神和範 視後附之申請專利範圍所界定者為準。 ’、濩範圍當 0213 -Α40162TWF(Ν1);B9250TW;SNOWBALL.200521712 V. Description of the invention (9) The information is an XML message. Furthermore, the present invention proposes a computer-storable-computer program 'the above-mentioned computer program is used to implement the storage method'. This method will perform the steps described above. . Fig. 6 is a schematic diagram showing a computer-readable storage medium according to an embodiment of the present invention. This automatic memorization method is a computer program 620, which is used to implement sentence-storage media 60. The storage format contains six logics, namely, the receiver automatically = 625 and the structured data logic 626. The conversion word logic of σ is therefore based on the automatic sentence analysis of the sentence provided by the present invention: in order to improve the accuracy of information retrieval, in addition to ": and side ;; =, homophones and wrong word combinations; Sentence: Intermediate output reduces the accuracy of sentence markings. "Brother, it will not be disclosed though the embodiment of the present invention is as above, but it is not an invention." Anyone who is familiar with this skill will not depart from the scope of the present invention. When some changes and retouching can be done, therefore, the spirit and scope of the present invention are determined by the scope of the attached patent application. ', 濩 Scope when 0213-A40162TWF (N1); B9250TW; SNOWBALL.

Ptd 第13頁 200521712 圖式簡單說明 為使本發明之上述目的、特徵和優點能更明顯易懂, 下文特舉實施例,並配合所附圖示,進行詳細說明如下: 第1圖係表示依據本發明實施例之句子自動標記系統 之系統架構圖; 第2圖係表示依據本發明實施例之句子自動標記系統 之軟體模組架構圖; 第3圖係表示依據本發明實施例之範例有限狀態機示 意圖; 第4圖係表示依據本發明實施例之範例同義詞以及錯 別詞集合示意圖; 第5圖係表示依據本發明實施例之句子自動標記方法 之方法流程圖; 第6圖係表示依據本發明實施例之句子自動標記方法 之電腦可讀取儲存媒體示意圖。 符號說明 1 0〜句子自動標記糸統; 11〜儲存裝置; 1 2〜中央處理器; 1 3〜記憶體; 1 4〜顯示裝置; 1 5〜輸入裝置; 1 6〜匯流排; 1 3 1〜句子標記模組; 1 3 2〜同義詞處理模組;Ptd Page 13 200521712 Schematic description In order to make the above-mentioned objects, features and advantages of the present invention more comprehensible, the following specific examples are given in conjunction with the accompanying drawings for detailed description as follows: Figure 1 shows the basis System architecture diagram of a sentence automatic tagging system according to an embodiment of the present invention; FIG. 2 is a diagram showing a software module architecture diagram of a sentence auto-tagging system according to an embodiment of the present invention; and FIG. 3 is a diagram showing an exemplary limited state according to an embodiment of the present invention Machine diagram; Figure 4 is a schematic diagram showing a set of example synonyms and misspellings according to an embodiment of the present invention; Figure 5 is a flowchart showing a method for automatically marking a sentence according to an embodiment of the present invention; Schematic diagram of computer-readable storage medium of automatic sentence marking method in the embodiment of the invention. Explanation of symbols 1 0 ~ sentence automatic tagging system; 11 ~ storage device; 1 2 ~ central processing unit; 1 3 ~ memory; 1 4 ~ display device; 1 5 ~ input device; 1 6 ~ busbar; 1 3 1 ~ Sentence Marker Module; 1 3 2 ~ Synonym Processing Module;

0213-A40162TWF(Nl);B9250TW;SNOWBALL.ptd 第14頁 200521712 圖式簡單說明 1 3 3〜同音別詞處理模組; 1 3 4〜異音別詞處理模組; 1 3 5〜錯誤排列詞處理模組; 311 &gt;312.....3 1 5〜轉換詞; 3 2 1〜人名狀態;3 2 2〜案發時間狀態; 3 2 3〜案發車型狀態; 3 2 4〜案發地點狀態; 3 3 1〜結束狀態; 4 1〜轉換詞欄位; 4 2〜同義詞欄位; 4 3〜錯別詞欄位; 4 4〜同音別詞欄位; 4 5〜異音別詞欄位; 4 6〜錯誤排列詞欄位; 471、472 ..... 475〜轉換詞之同義詞以及錯別詞集合 S51 1、S521.....S55卜操作步驟; 6 2 0〜句子自動標記電腦程式; 6 2 1〜接收有限狀態機邏輯; 6 2 2〜取得關連於轉換詞之同義詞邏輯; 6 2 3〜取得關連於轉換詞之錯別詞邏輯; 624〜接收句子邏輯; 6 2 5〜偵測句子中之轉換詞邏輯; 6 2 6〜產生結構化資料邏輯。0213-A40162TWF (Nl); B9250TW; SNOWBALL.ptd Page 14 200521712 Schematic brief description 1 3 3 ~ homophone processing module; 1 3 4 ~ heterophone processing module; 1 3 5 ~ misaligned word processing Module; 311 &gt; 312 ..... 3 1 5 ~ converted words; 3 2 1 ~ person name status; 3 2 2 ~ incident time status; 3 2 3 ~ incident model status; 3 2 4 ~ incident Location status; 3 3 1 ~ End status; 4 1 ~ Conversion word field; 4 2 ~ Synonym field; 4 3 ~ Sentence word field; 4 4 ~ Homophone word field; 4 5 ~ Hardphone word field Bits; 4 6 ~ Wrongly arranged word fields; 471, 472 ..... 475 ~ Synonyms of conversion words and misspellings S51 1, S521 ..... S55 Operation steps; 6 2 0 ~ Sentences automatically Mark computer program; 6 2 1 ~ Receive finite state machine logic; 6 2 2 ~ Get synonym logic related to converted words; 6 2 3 ~ Get wrong word logic related to converted words; 624 ~ Receive sentence logic; 6 2 5 ~ Detect conversion logic in sentences; 6 2 6 ~ Generate structured data logic.

0213-A40162TWF(N1);B9250TW;SNOWBALL.ptd 第15頁0213-A40162TWF (N1); B9250TW; SNOWBALL.ptd Page 15

Claims (1)

200521712 六、申請專利範圍 1 · 一種句子自動標記系統,包括: 一句子標記模組,用以接收一句子,依據一包含複數 具順序性狀態之有限狀態機,其中上述兩狀態之間包含一 轉換詞1依序偵測上述句子中相應之上述轉換詞,若上述 句子中含有所有具相同順序之上述轉換詞,則將上述句子 中之相應兩上述轉換詞間之詞標記成相應之上述狀態以及 產生包含上述狀態以及上述句子中之兩上述轉換詞間之詞 之一結構化資料。 2 ·如申請專利範圍第1項所述之句子自動標記系統, 其中上述結構化資料為一資料庫紀錄、一檔案紀錄或一 擴充式標§己語言(X紅)訊息。 3·如申請專利範圍第1項所述之句子自動標記系統, 詞。 &lt; 锝換3之同義 4 ·如申凊專利範圍第3項所述之句子自動標纪么 於句子標記模組中,檢索相應於上述轉換詞之上3述、&amp;統、’ 詞,依序偵測上述句子中相應之上述轉換詞以及;L同義 述轉換詞之上述同義詞,若上述句子中含有所 2應於上 序之上述轉換同以及相應上述轉換詞之上述 〔相同順 者,則將上述句子中之相應兩上述轉換詞間^ =中之一 應之上述狀態。 H 記成相 5.如申請專利範圍第1項所述之句子自動栌 更包含一儲存裝置,用以儲存複數相應上述糸統, 詞。 符狹词之錯別200521712 6. Scope of patent application1. An automatic sentence marking system, including: a sentence marking module for receiving a sentence, based on a finite state machine containing a plurality of sequential states, wherein a transition is included between the two states Word 1 sequentially detects the corresponding conversion words in the above sentence. If the above sentence contains all the conversion words in the same order, the words between the corresponding two conversion words in the sentence are marked as the corresponding status and Generate structured data containing one of the above-mentioned states and one of the words between the two above-mentioned converted words in the above-mentioned sentences. 2 · The sentence automatic tagging system as described in item 1 of the scope of patent application, wherein the structured data is a database record, a file record, or an extended mark (X red) message. 3. Automatic sentence marking system as described in item 1 of the scope of patent application, word. &lt; Synonym 4 of 锝 change 3 · As in the third paragraph of the scope of the patent application, the sentence is automatically marked in the sentence mark module, and the words corresponding to the above mentioned conversion words &amp; Detect the corresponding conversion word in the above sentence and the above-mentioned synonym of the conversion word in the same sentence in order, if the above sentence contains the above conversion word that should be in the previous order and the above-mentioned [same sequence, Then one of the two corresponding conversion words in the above sentence ^ = should be in the above state. H is recorded as phase 5. The sentence described in item 1 of the scope of patent application automatically includes a storage device for storing a plurality of words corresponding to the above system. Mistakes in Runes 0213 -A40162TWF(N1);B9250TW;SNOWBALL.p t d 第16頁 200521712 申請專利範圍 6 ·如申請專利範圍弟5項所述之句子自動標記系统, 其中上述錯別詞為一同音別詞、一異音別詞或一錯誤排列 詞。 、 7 ·如申請專利範圍第5項所述之句子自動標記系統, 於句子標記模組中,檢索上述相應上述轉換詞之錯別詞, 依序偵測上述句子中相應之上述轉換詞以及相應於上述轉 換詞之上述錯別詞’若上述句子中含有所有具相同順序之 上述轉換詞以及相應上述轉換詞之上述錯別詞中之一者, 則將上述句子中之相應兩上述轉換詞間之詞標記成上述狀 態。 8 ·如申請專利範圍第4項所述之句子自動標記系統, 上述儲存裝置中,更儲存複數相應上述轉換詞之錯別詞。 9.如申請專利範圍第8項所述之句子自動標記系統, 於句子標記模組中,檢索相應於上述轉換詞之上述同義詞 以及上述錯別詞’依序偵測上述句子中相應之上述轉換 詞、相應於上述轉換詞之上述同義詞以及相應於上述轉換 詞之上述錯別詞,若上述句子中含有所有具相同順序之上 述轉換詞、相應上述轉換詞之上述同義詞以及相應上述轉 換詞之上述錯別詞中之一者,則將上述句子中之相應兩上 述轉換詞間之詞標記成上述狀態。 I 0 ·如申請專利範圍第9項所述之句子自動標記系統, 其中上述錯別詞為一同音別詞、/異音別詞或一錯誤排列 詞。 II · 一種句子自動標記方法,被—具有一中央處理器0213 -A40162TWF (N1); B9250TW; SNOWBALL.ptd page 16 200521712 Patent application scope 6 · Sentence automatic tagging system as described in item 5 of the patent application scope, where the above misspellings are homonyms and a disyllabic words Or a misaligned word. 7) According to the automatic sentence marking system described in Item 5 of the scope of patent application, in the sentence marking module, search for the above misspellings corresponding to the above conversion words, and sequentially detect the corresponding above conversion words in the above sentences and the corresponding In the above misspellings of the above conversion word, if the above sentence contains one of all the above conversion words in the same order and the above error words corresponding to the above conversion word, then the corresponding two above conversion words in the above sentence are separated. The word is marked as above. 8 · According to the automatic sentence marking system described in item 4 of the scope of patent application, the storage device further stores a plurality of misspellings corresponding to the above conversion words. 9. According to the automatic sentence marking system described in item 8 of the scope of patent application, in the sentence marking module, the above-mentioned synonyms corresponding to the above-mentioned conversion words and the above-mentioned misspellings' are sequentially detected in the above sentences. Words, the above synonyms corresponding to the above conversion words, and the above typos corresponding to the above conversion words, if the sentence contains all the above conversion words in the same order, the above synonyms corresponding to the above conversion words, and the above corresponding to the above conversion words. One of the misspellings marks the word between the corresponding two converted words in the above sentence as the above state. I 0 · The sentence automatic tagging system as described in item 9 of the scope of the patent application, wherein the misspelling is a homonym, a / unison, or a wrong arrangement. II · A method for automatic tagging of sentences, being-with a central processing unit 0213-A40162TWF(Nl);B9250TW;SNOWBALL.ptd 第17育 200521712 六、申請專利範圍 之電子裝置執行,其方法包括下列步驟·· 接收一句子; 依據一包含複數具順序性狀態之有限狀態機,其中上 述兩狀態之間包含一轉換詞,依序偵測上述句子中相應之 上述轉換詞;以及 若上述句子中含有所有具相同順序之上述轉換詞,則 將上述句子中之相應兩上述轉換詞間之詞標記成相應之上 述狀態以及產生包含上述狀態以及上述句子中之兩上述轉 換詞間之詞之一結構化資料。 12 ·如申請專利範圍第11項所述之句子自動標記方 法,其中上述結構化資料為一資料庫紀錄、一檔案紀錄或 一可擴充式標記語言(XML)訊息。 去1 立3方如#·申Λ專利範圍第11項所述之句子自動標記方 法’其方法更包括下列步驟: 檢索相應於上述轉換詞之一同義詞· 依序偵測上述句子巾^日虛 ’ 述轉換詞之上述同義詞;以=述轉換詞以及相應於上 若上述句子中含有所有具相 相應上述轉換詞之上述同義詞中之j序之上述轉換詞以及 之相應兩上述轉換詞間之詞標^己^ 者’則將上述句子中 其方法更包括下列步驟: 檢索上述相應上述轉換詞之— 依序偵測上述句子中相應之::別詞; 上迷轉換詞以及 法 14·如申請專利範圍第11項所 '相應之上述狀態。 U 一,— 、 述之句子自動標記方 相應於上 第18頁 0213-A40162TWF(Nl);B9250TW;SNOWBALL.ptd 200521712 六、申請專利範圍 述轉換詞之上述錯別詞;以及 若上述句子中含有所有具相同順序之上述轉換詞以及 相應上述轉換詞之上述錯別詞中之一者,則將上述句子中 之相應兩上述轉換詞間之詞標記成上述狀態。 1 5.如申請專利範圍第1 4項所述之句子自動標記方 法,其中上述錯別詞為一同音別詞、一異音別詞或一錯誤 排列詞。 1 6.如申請專利範圍第1 3項所述之句子自動標記方 法,其方法更包括下列步驟: 檢索相應上述轉換詞之上述同義詞以及上述錯別詞; 依序偵測上述句子中相應之上述轉換詞、相應於上述轉換 詞之上述同義詞以及相應於上述轉換詞之上述錯別詞;以 及 若上述句子中含有所有具相同順序之上述轉換詞、相 應上述轉換詞之上述同義詞以及相應上述轉換詞之上述錯 別詞中之一者,則將上述句子中之相應兩上述轉換詞間之 詞標記成上述狀態。 1 7. —種電腦可讀取儲存媒體,用以儲存一電腦程 式,該電腦程式用以載入至一電腦系統中並且使得該電腦 系統執行如申請專利範圍第11至1 6項中任一者所述之方 法00213-A40162TWF (Nl); B9250TW; SNOWBALL.ptd Article 17 Yu 200521712 VI. Implementation of a patent-applied electronic device, the method includes the following steps: · Receive a sentence; According to a finite state machine containing a plurality of sequential states, A conversion word is included between the two states, and the corresponding conversion words in the sentence are sequentially detected; and if the sentence contains all the conversion words in the same order, the corresponding two conversion words in the sentence are replaced. The inter-words are marked as corresponding to the above-mentioned states and a structured data including one of the above-mentioned states and the word between the two above-mentioned converted words in the sentence is generated. 12 · The automatic tagging method for sentences as described in item 11 of the scope of patent application, wherein the structured data is a database record, a file record, or an extensible markup language (XML) message. Go to 1 to 3 parties as described in # · 申 Λ Patent Range Item No. 11's method for automatic tagging of sentences. The method further includes the following steps: Retrieve a synonym corresponding to one of the above conversion words. • Detect the above sentence in sequence. 'The above-mentioned synonyms of the conversion word; the == conversion word and the above-mentioned conversion word corresponding to the order of j in the above-mentioned synonyms with the corresponding above-mentioned conversion word in the above sentence and the corresponding word between the two above-mentioned conversion words If you mark ^^^, the method in the above sentence further includes the following steps: Retrieve the corresponding above conversion words — sequentially detect the corresponding ones in the above sentences :: other words; addictive conversion words and methods 14. If you apply The above-mentioned state corresponding to the item 11 of the patent scope. U I, —, The sentence of the auto-tagging party corresponds to the above page 18 0213-A40162TWF (Nl); B9250TW; SNOWBALL.ptd 200521712 6. The above typos of the conversion word in the scope of patent application; and if the above sentence contains All of the above-mentioned conversion words having the same order and one of the above-mentioned misspellings corresponding to the above-mentioned conversion words will mark the words between the corresponding two above-mentioned conversion words in the above-mentioned sentences as the above state. 1 5. The automatic tagging method of a sentence as described in item 14 of the scope of patent application, wherein the misspelling is a homonym, a disyllabic word, or a wrong permutation word. 1 6. The automatic tagging method for a sentence as described in item 13 of the scope of patent application, the method further includes the following steps: searching for the above synonyms of the corresponding conversion word and the above misspelling; sequentially detecting the corresponding above in the above sentence The conversion word, the above synonyms corresponding to the above conversion word, and the above typo corresponding to the above conversion word; and if the sentence contains all the above conversion words in the same order, the above synonyms corresponding to the above conversion word, and the corresponding above conversion word One of the above-mentioned misspellings marks the word between the corresponding two conversion words in the sentence as the above-mentioned state. 1 7. —A computer-readable storage medium for storing a computer program for loading into a computer system and causing the computer system to execute any one of items 11 to 16 of the scope of patent application The method described 0 0213 -A40162TWF(N1);B9250TW;SNOWBALL.p t d 第19頁0213 -A40162TWF (N1); B9250TW; SNOWBALL.p t d p.19
TW92135509A 2003-12-16 2003-12-16 System, method and machine-readable storage medium for automated sentence annotation TWI225994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW92135509A TWI225994B (en) 2003-12-16 2003-12-16 System, method and machine-readable storage medium for automated sentence annotation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW92135509A TWI225994B (en) 2003-12-16 2003-12-16 System, method and machine-readable storage medium for automated sentence annotation

Publications (2)

Publication Number Publication Date
TWI225994B TWI225994B (en) 2005-01-01
TW200521712A true TW200521712A (en) 2005-07-01

Family

ID=35613502

Family Applications (1)

Application Number Title Priority Date Filing Date
TW92135509A TWI225994B (en) 2003-12-16 2003-12-16 System, method and machine-readable storage medium for automated sentence annotation

Country Status (1)

Country Link
TW (1) TWI225994B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI396983B (en) * 2010-04-14 2013-05-21 Inst Information Industry Named entity marking apparatus, named entity marking method, and computer program product thereof

Also Published As

Publication number Publication date
TWI225994B (en) 2005-01-01

Similar Documents

Publication Publication Date Title
Petersen et al. A machine learning approach to reading level assessment
CN101388012B (en) Phonetic check system and method with easy confusion tone recognition
US5845306A (en) Context based system for accessing dictionary entries
US5521816A (en) Word inflection correction system
US5477448A (en) System for correcting improper determiners
US5537317A (en) System for correcting grammer based parts on speech probability
CN108519974A (en) English composition automatic detection of syntax error and analysis method
WO2021146831A1 (en) Entity recognition method and apparatus, dictionary creation method, device, and medium
CN106601253B (en) Examination & verification proofreading method and system are read aloud in the broadcast of intelligent robot word
WO2022134355A1 (en) Keyword prompt-based search method and apparatus, and electronic device and storage medium
CN109213998A (en) Chinese wrongly written character detection method and system
CN103034625A (en) System and method for detecting and correcting mismatched Chinese character
CN111859858A (en) Method and device for extracting relationship from text
Huang et al. Words without boundaries: Computational approaches to Chinese word segmentation
Bontcheva et al. Using human language technology for automatic annotation and indexing of digital library content
TW200521712A (en) System, method and machine-readable storage medium for automated sentence annotation
CN116306487A (en) Intelligent detection system and method for academic treatises of higher institutions
Bloodgood et al. Data cleaning for xml electronic dictionaries via statistical anomaly detection
Xu Exploration of English Composition Diagnosis System Based on Rule Matching.
Mondal et al. Natural language query to NoSQL generation using query-response model
Parveen et al. Clause Boundary Identification using Classifier and Clause Markers in Urdu Language
CN117113964B (en) Composition plagiarism detection method
US20100169768A1 (en) Spell Checker That Teaches Rules of Spelling
Rodrigues et al. Detecting structural irregularity in electronic dictionaries using language modeling
Round et al. Automated parsing of interlinear glossed text from page images of grammatical descriptions

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees