TW548600B - Method and system for identifying attributes of new words in non-segmented text - Google Patents

Method and system for identifying attributes of new words in non-segmented text Download PDF

Info

Publication number
TW548600B
TW548600B TW90124532A TW90124532A TW548600B TW 548600 B TW548600 B TW 548600B TW 90124532 A TW90124532 A TW 90124532A TW 90124532 A TW90124532 A TW 90124532A TW 548600 B TW548600 B TW 548600B
Authority
TW
Taiwan
Prior art keywords
character
word
probability
speech
string
Prior art date
Application number
TW90124532A
Other languages
Chinese (zh)
Inventor
Andi Wu
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Application granted granted Critical
Publication of TW548600B publication Critical patent/TW548600B/en

Links

Landscapes

  • Machine Translation (AREA)

Abstract

Embodiments of the present invention provide a method and apparatus for segmenting text by identifying new or rare words in the text. Under the present invention, a sub-string of single characters in the text is identified. For each character in the sub-string, an independent word probability is calculated that indicates the probability that each single character represents a single-character word. The probabilities for all of the characters in the sub-string are combined to form a total probability. If the total probability is below a threshold, the characters in the sub-string are considered to form a single multi-character word. In a further embodiment, the system determines parts of speech for multi-character words that are not found in the dictionary.

Description

54860ο Α754860ο Α7

可 特本文之電腦爲基礎方法有關。更 i疋本發明和將包含某種語言之新字之本文分段有關。 字之分段和形成語言表達(如本 關。仝、八W1尽又)〈個別字辨識方法有 1 子心分&有助於檢查#字及文、、本 1 執行自炊終士解 /、由本文合成語音、 目MM D 了~及搜哥一組特足字或片語資料。 ^英文本文之字分段是相當簡單的,因空格及標點符 虎:界疋出本文中各個別字。但對像日文或中文之未分段 =文,字的邊界是内含而不明顯的。即未分段本文之字 ^常未包含m標點符號。故這些語文無法以和英文 的字分段同樣方式執行分段。 在大部份之先前技術系統,利用簡單的斷字器將本文分 段。這些斷字器通常將字元群組爲可能之分段,然後在字 典中搜尋該分段。若在字典中找到一分段,將之留做本文 一可.能之分段。 此技術雖對字典中的字有效,但對語言中之新字及少用 的字因在字典中常找不到而不大有效。通常字典技術以辨 識一組字元爲一組單字元字取代辨識該組字元爲形成單一 ”少用"字。 有些系統以統計爲基準之計算擴充字典爲基準之分段, 以助於辨識此"少用,,字。在此方法多字元“少用,,字機率和 一些臨界比較。若超過該臨界,辨識爲字。但因“少用,,字 在語言中很少出現,其機率幾乎總低於該臨界。故即使是 -4 - 本纸張尺度適用中國國家標準(CNS) A4規格(210X 297公釐)It may be related to the computer-based method in this article. Furthermore, the present invention relates to the segmentation of the text which will contain new words in a certain language. Word segmentation and formation of linguistic expressions (such as this pass. Same, eight W1 as much as possible) (individual word recognition methods have 1 sub-centred & help to check # 字 和文 ,, this 1 self-cooking solution / , Synthesize the speech from this article, and then search for a group of special characters or phrases. ^ The word segmentation in English text is quite simple, because of spaces and punctuation. But for unsegmented = Japanese such as Japanese or Chinese, the boundaries of the words are implicit and not obvious. That is, the word of the text is not segmented ^ often does not include m punctuation. Therefore, these languages cannot be segmented in the same way as the English word segmentation. In most prior art systems, this text is segmented using a simple hyphenator. These word breakers usually group characters into possible segments and then search the dictionary for that segment. If you find a paragraph in the dictionary, leave it as a possible paragraph in this article. Although this technique is effective for words in the dictionary, it is not effective for new words and rarely used words in the language because they are often not found in the dictionary. Generally, dictionary technology replaces the recognition of a group of characters as a set of single-character characters to form a single "less-used" word. Some systems use statistics-based calculations to expand dictionary-based segmentation to help Identify this " less use ,, word. In this method, use more characters, "less use," word probabilities and some critical comparisons. If this threshold is exceeded, it is recognized as a word. However, because "less used, words rarely appear in language, and the probability is almost always lower than this threshold. Therefore, even -4-this paper size applies the Chinese National Standard (CNS) A4 specification (210X 297 mm)

裝 訂Binding

548600 A7 B7 五、發明説明(2 ) 這些擴充系統,亦常無法適當機率"少用,,字。 若在本文執行句法分析,分段系統需在本文中辨識少用 或新字,並辨識這些字可能部分語音。故需要之分段系統 是較能辨識少用字且可在部分語音中辨識少用或新字。 發明概論 本發明之實施例提供利用辨識本文中新或少用字將本文 分段之方法及裝置。本發明辨識本文中單一字元之子字 串。對子字串之各字元計算獨立字機率,該機率表示各單 一字元代表單字元字之機率。結合子字串中所有字元之機 率以形成總機率。若總機率在臨界之下,認爲該子字串中 字元形成單一多字元字。 在另一實施例該系統決定在字典找不到之多字元字部分 語音。爲達此功能,該系統決定字中各字元之機率,該機 率表示在具多字元字長且具特定部分語音之字中,可在目 前位置找到該字元之可能。例如對二字元字” AB',,系統將 決定字元"Απ出現在二字元名詞第一個字元之第一機率、 字元"Α"在二字元動詞之第一個字元之第二機率及字元 π A "在二字元形容詞第一個字元之第三機率。 在一部分語音庫組合該等字元機率,形成各部分語音分 別總機率。然後將各總値和臨界比較。在以上之範例, ·· A "出現在二字名詞第一字元之機率將和” B "出現在二字 元名詞第二字元之機率組合,以得到” ΑΒΠ爲名詞之總機 率。機率超過臨界之各邵分語骨相加,做爲多字元字可能 之部分語音。 -5- 本紙張尺度適用中國國家標準(CNS) Α4規格(210 X 297公釐) 548600 A7548600 A7 B7 V. Description of the invention (2) These expansion systems often fail to have a proper probability of "less use," words. If syntactic analysis is performed in this article, the segmentation system needs to identify rarely used or new words in this article, and recognize that these words may be part of the phonetic. Therefore, the required segmentation system is more capable of recognizing less used words and can recognize fewer or new words in some speech. SUMMARY OF THE INVENTION Embodiments of the present invention provide a method and apparatus for segmenting the text by identifying new or less used words in the text. The present invention recognizes substrings of single characters in the text. The individual word probability is calculated for each character of the substring, and the probability indicates the probability that each single character represents a single character word. The probability of all characters in the substring is combined to form the total probability. If the total probability is below the threshold, the characters in the substring are considered to form a single multi-character word. In another embodiment the system decides on a multi-character portion of speech that cannot be found in the dictionary. In order to achieve this function, the system determines the probability of each character in the word, which indicates the possibility of finding the character at the current position in a word with a multi-character length and a certain part of the voice. For example, for the two-character word "AB '", the system will determine the first probability that the character " Απ appears in the first character of the two-character noun, and the character " Α " is the first in the two-character verb. The second probability of the character and the third probability of the character π A " in the first character of the two-character adjective. The probability of combining these characters in a part of the speech library forms the total probability of each part of the speech. Then each Sum of the total and critical. In the above example, the probability of A " appearing in the first character of the two-character noun will be combined with "B " the probability of appearing in the second character of the two-character noun to obtain" ΑΒΠ is the total probability of nouns. The odds that exceed the threshold are added together as possible parts of the multi-character character. -5- This paper size applies the Chinese National Standard (CNS) Α4 specification (210 X 297 male) (Centimeter) 548600 A7

FI 1 口、汝 選式,簡 圖1疋適於實施本發 圖2是可眘〜士 月又靶例禺用電腦系統方塊圖。 仃本發明之掌上型裝 圖3是本發明一每^ 衣^万塊圖 圖4是依照本^月】元㈣詳細方塊圖。 士五立、、 ^ 是明貫施例將本文分段及辨識部分 b曰乏万法流程圖。 t明實施例 可實施本發明之合適計算㈣環境⑽範例。 盤太恭死方塊i G G只是合適計算環境之—範例,而不是要 ^ Λ月〈使用及功能範圍加以限制。範例作用環境1 0 0 任何7°件或其組合也非計算環境1 0 0之要求或有相 依性。 發月可以,午夕其它一般目的或特殊目的計算環境或架 作用4用於本發明之知名系統環境及/或架構範例包 口仁不限於·個人電腦、词服器電腦、掌上型或膝上型裝 置、多元處理器系統、微處理基礎系统、視訊轉換器、可 程式消費電子裝置、網路個人電腦、迷你電腦、大型全機 電腦、包含任何以上系統或裝置及類似者之分散式計算環 境。 本發月可於由%腦執行如—程式模組之電腦可執行指令一 般情形中描述。程式模組通常包含執行特定任務或實施特 定抽象資料類型之子程式、程式、物件、元件、資料結構 等。亦可利用經由通訊網路鏈結之遠端處理裝置執任務之 分散式計算環境實行本發明。在分散式計算環境,程式模 -6 - t紙張尺度適财® S家料(GNS) Α4規格(210 X 297公釐)FI 1 port, optional, simple diagram Figure 1 is suitable for the implementation of the present invention. Figure 2 is a block diagram of a computer system that can be used with caution and target.掌 Palm type of the present invention FIG. 3 is a block diagram of the present invention. FIG. 4 is a detailed block diagram according to the present invention. Shi Wuli, and ^ are examples of Ming Guan's segmentation and identification of this article. Example: An example of a suitable computing environment that can implement the present invention. The plate is too respectful. The box i G G is just an example of a suitable computing environment. It is not intended to limit the use and range of functions. Example operating environment 1 0 0 Any 7 ° component or combination is not a requirement or dependency of the computing environment 1 0 0. It is possible to send the month, other general purpose or special purpose computing environments or frameworks at midnight. 4 The well-known system environment and / or architecture examples used in the present invention are not limited to personal computers, server computers, palmtops or laps. Devices, multi-processor systems, microprocessor-based systems, video converters, programmable consumer electronics devices, network personal computers, mini computers, mainframe computers, decentralized computing environments that include any of the above systems or devices and the like . This month can be described in the general case where the computer executes instructions such as program modules. Program modules usually contain subroutines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The invention can also be implemented using a decentralized computing environment where remote processing devices linked via a communication network perform tasks. In a decentralized computing environment, the program model is -6-t paper size suitable for financial ® S home materials (GNS) Α4 size (210 X 297 mm)

裝 訂 548600 五 A7 B7 、發明説明(4 ) 組可位於包含記憶體儲存裝置之本地及遠端電腦儲存媒體 中0 圖1爲用以實施本發明之範例系統包含形式爲電腦1 1 0 之萬用型計算裝置。電腦1 1 0之元件可包含但不限於處理 單元1 2 0、系統記憶體1· 3 0及系統匯流排1 2 1將包含系統 記憶體之各種系統元件和處理單元1 2 0耦合。系統匯流排 1 2 1可爲多種類型之匯流排架構,包含記憶體匯流排或記 憶體控制器、周邊匯流排及使用多種匯流排架構之本地匯 流排。此架構例如但不限於包含工業標準架構(ISA)匯流 排、微通道架構(MCA)匯流排、增強型工業標準架構 (EISA)匯流排、視訊標準協會(VESA)本地匯流排、周邊組 件互連(PCI)匯流排(亦稱Mezzanine匯流排)。 電腦1 1 0通常包含多種電腦可讀媒體。電腦可讀媒體 可爲任何可由電腦1 1 0存取之可用媒體,並包含揮發及非 揮發媒體、可移除及不可移除媒體。電腦可讀媒體例如但 不限於可包含電腦儲存媒體及通訊媒體。電腦儲存媒體包 含揮發及非揮發、可移除及不可移除媒體,於用以儲存如 電腦可讀指令、資料指令、程式模組或其它資料之資訊之 任何方法或技術。電腦儲存媒體包含但不限於RAM、 ROM、EEPROM、快閃記-憶體或其它記憶體技術、 CDROM、數位式多用途光碟(DVD)或其它光碟儲存、磁 匣、磁帶、磁碟儲存或其它磁性儲存裝置、或可用於儲存 想要資訊及可由電腦100存取之任何其它媒體。通訊媒體 通常於如載波或其它傳輸媒體之調變資料信號實施電腦可 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 裝 訂 548600 五、發明説明(5 ) 讀指令、資料結構、程式模組或其 料傳送媒體。"調變資料’ 並匕^任何資 ^ . 十、唬表777 一信號具有其一或多個 特性改變或以於該信號將資 一 ^ ^ Λ、、届碼万式變更。通訊媒體例 不限於包含如有線網路或單線連接之有線媒體及如聲 1 Ύ卜線或其Ε無線媒體之騎媒體。任何以上 組合亦應包含於電腦可讀媒體範圍中。 系統記憶體U0包含形式爲揮發及/或非揮發記憶體之 電腦儲存媒體如唯讀記彳音,π η λ 貝己匕、姐(R0M) 13丨及隨機存取記憶體 (RAM) 132。通常儲存於R〇M 131中之其 甲艾基本輸入/輸出系統 (B1 〇S) ’具有如在啓動時幫助電腦1 1 G中元件間資科 轉移之基本子程式。通常RAM 132具有之資料及/或程式 可由處理U12G立刻存取及/或正由之作用。例如非限制 (圖1説明作業系統134 '應用程式135、其它程式模組 136及程式資料137。 訂 電腦110可亦包含其它可移除/不可㈣、揮發/非揮發 電腦儲存媒體。只做爲範例,圖!說明之硬碟機Μ!讀^ 不可移除非揮發儲存媒體、磁碟機151讀寫可移除、非揮 發磁碟152、及光碟機155讀寫如CDR〇M*其它光學媒體 之可移除非揮發光碟156。可用於範例作業環境之其它^ 移除/不可移除、揮發/非象發電腦儲存媒體包含但不限於 磁帶E、快閃記憶體卡、數位多樣式-光碟、數位影音帶、' 固態RAM、固態ROM及類似者。硬碟機141通常由如介 面1 4 0之不可移除記憶體介面和系統匯流排〖2 i相連,而 磁碟機1 5 1及光碟機1 5 5通常由如介面丨5 〇之可移除記憶 -8 - 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) 548600 A7 B7 五、發明説明(6 ) 體介面系統匯流排1 2 1相連。 上述及圖1説明之磁碟機及其相關電腦儲存媒體儲存電 腦1 1 0之電腦可讀指令、資料結構、程式模組及其它資 料。例如圖1説明之硬碟機14 i儲存作業系統144、應用程 式1 4 5、其它程式模組i 4 6及程式資料1 4 7。這些元件可 和作業系統1 3 4、應用程式1 3 5、其它程式模組丨3 6及程 式資料1 3 7相同或不-同。作業系統丨4 4、應用程式丨4 $、 其它程式模組1 4 6及程式資料1 4 7在此之不同編號,至少 説明彼此不同。 使用者可由如鍵盤162、麥克風163、定位裝置161 (如 滑氣、執跡球或觸控板)之輸入裝置將命令及資訊輸入電 鈿1 1 〇。其它輸入裝置(未顯示)可包含搖桿、遊戲控制 °°碟型愤星天線、掃描器或類似者。這些及其它輸入裝 置常由如和系統匯流排耦合之使用者輸入介面丨6 〇和處理 單το 1 2 0連接,但亦可由其它介面及匯流排架構連接,如 平仃埠、逛戲連接埠、通用串列匯流排(usb”監控器 1 9 1或其它顯示裝置亦由如視訊介面丨9 〇之介面和系統匯 心排1 2 1連接。除了監控器,電腦亦可包含其它周邊輸出 裝置,如揚聲器197及印表機196,可由輸出周邊介面 190連接。 私鈿1 1 0可在網路環境利用邏輯連接對如遠端電腦i 〇 2-或多個遠端電腦作用。遠端電腦180可爲個人電腦、 掌上型電腦、伺服器、路由器、網路電腦、同級裝置或其 它網路節點,並通常包含許多或所有上述和電腦i 10有關 -9 -Binding 548600 Five A7 B7, invention description (4) The group can be located in the local and remote computer storage media containing the memory storage device 0 Figure 1 is an example system used to implement the present invention includes a computer in the form of 1 1 0 universal Computing device. The components of the computer 1 10 may include, but are not limited to, a processing unit 120, a system memory 1.3, and a system bus 1 2 1 to couple various system components including the system memory and the processing unit 120. The system bus 1 2 1 can be multiple types of bus architectures, including memory buses or memory controllers, peripheral buses, and local buses using multiple bus architectures. This architecture includes, but is not limited to, industry standard architecture (ISA) buses, microchannel architecture (MCA) buses, enhanced industry standard architecture (EISA) buses, video standards association (VESA) local buses, and peripheral component interconnects (PCI) bus (also known as Mezzanine bus). Computer 1 10 usually contains a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by computer 110 and includes both volatile and non-volatile media, removable and non-removable media. Computer-readable media may include, for example but not limited to, computer storage media and communication media. Computer storage media includes any method or technology that contains volatile and non-volatile, removable and non-removable media and is used to store information such as computer-readable instructions, data instructions, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash-memory or other memory technology, CDROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage, or other magnetic storage A storage device, or any other medium that can be used to store desired information and accessible by computer 100. Communication media is usually implemented on modulated data signals such as carrier waves or other transmission media. Computers can be used on this paper. The Chinese standard (CNS) A4 specification (210 X 297 mm) is bound. 548600 5. Description of the invention (5) Reading instructions and data Structure, program module, or its delivery medium. " Modulation data 'and ^ any information ^. X, 777 A signal has one or more of its characteristics changed or the signal will change ^ ^ Λ, the code will be changed. Examples of communication media are not limited to wired media such as wired network or single-line connection, and audio media such as the sound 1 line or wireless media. Any combination of the above should also be included in the scope of computer-readable media. The system memory U0 includes computer storage media in the form of volatile and / or non-volatile memory such as read-only note sounds, π η λ bee dagger, sister (R0M) 13 and random access memory (RAM) 132. The basic input / output system (B10S), which is usually stored in ROM 131, has a basic subroutine that assists in the transfer of resources between components in the computer 1G, such as at startup. Generally, the data and / or programs possessed by the RAM 132 can be immediately accessed and / or used by the processing U12G. For example, non-limiting (FIG. 1 illustrates the operating system 134 'application 135, other program modules 136, and program data 137. The ordering computer 110 may also include other removable / non-volatile, volatile / non-volatile computer storage media. It is only used as Example, figure! Illustrated hard drive M! Read ^ Non-removable non-volatile storage media, removable drive 151 read and write removable, non-volatile drive 152, and optical drive 155 read and write such as CDROM * other optical Removable non-volatile disc 156 of the media. Others that can be used in the example operating environment ^ Removable / non-removable, volatile / non-volatile computer storage media including but not limited to tape E, flash memory card, digital multi-style- Optical discs, digital audio and video tapes, solid-state RAM, solid-state ROM, and the like. The hard disk drive 141 is usually connected by a non-removable memory interface such as interface 1 4 0 and the system bus [2 i, and the disk drive 1 5 1 And CD-ROM drive 1 5 5 usually consists of removable memory such as interface 丨 5 -8-This paper size applies Chinese National Standard (CNS) A4 specification (210 X 297 mm) 548600 A7 B7 V. Description of invention (6) Interface system bus 1 2 1 connected. The above and illustrated in Figure 1 The disc drive and its related computer storage media store computer-readable instructions, data structures, program modules, and other data of the computer 1 1 0. For example, the hard drive 14 i illustrated in FIG. 1 stores the operating system 144, and the application programs 1 4 5, Other program modules i 4 6 and program data 1 4 7. These components can be the same as or different from the operating system 1 3 4, application programs 1 3 5, other program modules 丨 3 6 and program data 1 3 7. System 丨 4 4, application 丨 4 $, other program modules 1 4 6 and program data 1 4 7 The different numbers here at least explain that they are different from each other. Users can use such as keyboard 162, microphone 163, pointing device 161 (such as Input devices such as airbags, trackballs, or touchpads input commands and information into the electronic 1 1 0. Other input devices (not shown) may include joysticks, game control °° disc-shaped antenna antenna, scanner or Similarly, these and other input devices are often connected by a user input interface coupled to the system bus, such as 6o, and a processing order, το 1 2 0, but they can also be connected by other interfaces and bus architectures, such as flat port, Play port, universal serial The bus (usb) monitor 1 91 or other display devices are also connected to the system bus 1 2 1 by an interface such as a video interface 丨 9 〇 In addition to the monitor, the computer can also include other peripheral output devices, such as speakers 197 And printer 196, which can be connected through the output peripheral interface 190. The private computer 110 can use a logical connection in a network environment to act as a remote computer i 02-or multiple remote computers. The remote computer 180 can be personal Computer, palmtop, server, router, network computer, peer device, or other network node, and usually contains many or all of the above related to computer i 10-9-

548600 A7 ___B7 五、發明説明(7~~ " ~ 之元件。圖1所示邏輯連接包含區域網路(LAN) 171及廣域 網路(WAN) 173,但亦可包含其它網路。此種網路環境常 見於辦公室、企業電腦網路、内部網路及網際網路。 當用於LAN網路環境、電腦丨丨〇由網路介面或配接器 1 7 0和LAN 171連接。當用於WAN網路環境,電腦11〇常 包含數據機172或其它裝置,用以對WAN 173建立通訊 (如網際網路)。數據機1 7 2可爲内建或外接,由使用者輸 入介面1 6 0或其它適當架構可和系統匯流排丨2 1連接。在 網路環境,顯示和電腦11〇或其部份有關之程式模組可儲 存於遠端記憶體儲存裝置。例如(但非限制),圖丨説明之 遠端應用程式185位於遠端電腦18〇中。要知道所示之網 路連接爲範例,而可使用在電腦間建立通訊鏈結之其它裝 圖2 <移動裝置2 〇 〇方塊圖爲範例計算環境。移動裝置 200包含微處理器2〇2、記憶體2〇4、輸入/輸出(1/〇)元件 206及用以和遠端電腦或其它移動裝置通訊之通訊介面 2 0 8。在一實旅例耦合上述元件以由適當之匯流排2丨〇彼 此通訊。 圮憶體2 0 4以如隨機存取記憶體(RAM)之非揮發子記憶 體實施’具有電池備用模組(未顯示)以在移動裝置2〇〇之 一般力關掉時不會喪失儲存記憶體2 〇 4之資訊。部份記憶 體204最好配置爲執行程式之可定址記憶體,另一部份記 憶體2 0 4最好用於儲存,如以模擬磁碟機之儲存。 5己憶體2 0 4包含作業系統2 1 2、應用程式2 1 4以及物件 裝 訂548600 A7 ___B7 V. Elements of the invention (7 ~~ " ~. The logical connections shown in Figure 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but other networks can also be included. Such networks Road environment is common in offices, corporate computer networks, intranets and the Internet. When used in a LAN network environment, computers are connected via a network interface or adapter 170 and LAN 171. When used in In a WAN network environment, the computer 11 usually includes a modem 172 or other device for establishing communication with the WAN 173 (such as the Internet). The modem 1 72 can be built-in or external, and the user input interface 16 0 or other appropriate structure can be connected to the system bus 丨 2 1. In a network environment, the program module that displays the computer 11 or a part thereof can be stored in a remote memory storage device. For example (but not limited to) The remote application 185 illustrated in Figure 丨 is located in the remote computer 18o. You must know that the network connection shown is an example, and you can use other equipment to establish a communication link between the computers. Figure 2 < Mobile device 2 〇 〇 The block diagram is an example computing environment. Mobile device 200 Contains microprocessor 202, memory 204, input / output (1 / 〇) component 206, and communication interface 208 for communicating with remote computers or other mobile devices. The above is coupled in a practical example. The components communicate with each other by the appropriate buses 2 and 0. The memory body 204 is implemented in non-volatile memory such as random access memory (RAM) with a battery backup module (not shown) for mobile devices. When the general force of 2000 is turned off, the information of the storage memory 2 will not be lost. Part of the memory 204 is preferably configured as an addressable memory for executing programs, and the other part of the memory 204 is preferably used. In storage, such as the storage of an analog disk drive. 5 Ji Yi 2 0 4 contains the operating system 2 1 2, applications 2 1 4 and object binding.

548600 A7 ____B7 五、發明説明(8 ) 儲存2 1 6。作用時作業系統2丨2由處理器2〇2自記憶體 2 04執行。在一較佳實施例作業系統212爲微軟公司之商 用視窗CE brand作業系統。作業系統2丨2最好是爲移動裝 置而設計的,並實行可由應用2M經一組暴露應用程式介 面反方法使用之資料庫特性。物件儲存216之物件由應用 2 1 4及作業系統2 1 2維持爲至少部份回應對暴露應用程式 介面及方法之呼叫。 通汛介面208表示使移動裝置2〇〇送收資訊之多種裝置 及技術。該裝置包含有線及無線數據機、衛星接收器及廣 播調諧器。移動裝置2 00可亦直接和電腦連接以交換資 料。在此情形通訊介面2 0 8可爲可傳送流出資訊之紅外線 收送器或串列或平行通訊連接。 輸入/輸出元件2 0 6包含多種輸入裝置如觸感式螢幕、按 鍵、轉子及麥克風,以及多種輸出裝置包含聲彻堡器、 振動裝置及顯示器。上述裝置只做爲舉例,而非均爲移動 裝置2 0 0所必備。另外在本發明範圍中可有其它輸入/輸出 裝置連接移動裝置200,或在其中。 本發明 < 實施例提供利用辨識少有及新多字元字將本文 分段t万法及裝置。另有實施例提供辨識部分非字典字語 音之方法及系統。圖3是本—發明一實施例之各種元件方塊 圖。圖4是利用圖3元件之本發明實施例方法流程圖。 在圖4之步驟400,圖3之斷字器3〇2辨識輸入本文3〇〇 中出現於小詞彙記錄組(亦稱爲字典)3〇4之鄰近字元组 合。詞彙記綠組3 04小是因儲存之各字文法資訊量有限。 -11 - 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐) 548600 A7 B7 五、發明説明(9 ) 4果兄綠組3 〇 4包含字數不必很少,事實上在一些實施例 小詞彙記綠組3 04有大量字。 在本發明一實施例斷字器3 0 2利用稱爲線索(trie)之資料 結構’在小詞彙記綠組3 04中搜尋字。在線索,字並非順 序列出而以狀態鏈表示取代。各狀態表示一個別字元,並 ^含一或多個子狀態,而各子狀態具有一字元,在目前狀 怨丰元於子1司彙記錄組3 0 4之至少一字中後發生。各狀態 亦表示目前字元是否出現在一字之最終字元,該字由處理 目前字元之狀態鏈形成。 利用線索資料結構,字元串中可能之字如"ABcd"可平 订决疋。例如系統可在如字元"A "有關之狀態開始。若該 狀心表示+元"a "單獨爲小詞彙記錄組3 〇 4之一字,將辨 識"A”爲字串之可能分段。然後系統檢查是否有字元"B" 由冬元A "狀態延伸之子狀態。若有"B "子狀態,檢查 狀態是否字元”B"是任何字之最終字元。若是如此, 辨識字串"AB"爲可能分段。然後系統檢杏有 狀態延伸之子狀態。若字元"一C"無子狀子;“ 則狀悲延伸。系統停止尋找目前之鏈,並開始搜尋由字元 "B"開始之新鏈。對輸入字串各字元重覆開始新鏈之處 理,以對各字元可能爲鏈之開始予以檢查。 旦在步驟4 0 0辨識儲存於小詞彙記錄組3 0 4之字,圖 4夂万法繼續到步驟4〇2,在此斷字器3〇2利用適當之命 名準則3 0 6辨識是否有字可能未儲存於小詞囊記綠組,但 表示如人名或地名之適當名字。 -12- 548600 A7 - ——______B7 五、發明説明(1〇 ) -- 由小巧囊i己綠組3 〇 4及適當命名準則3 〇 6辨識之字置於 字格’供給存取大詞彙記綠組3 1 2之詞彙查閱3 1 〇。大詞 囊記錄組312包含之詞彙資訊較小詞彙記錄組3 04多。事 實上在命多貫施例小詞彙記綠組3 〇 4建自大詞彙記錄組 3 12,並週期性由之參考更新。 利用大同彙記錄組3丨2,在圖4步驟4 〇 6,詞彙查閲3 i 〇 擴充格中各字儲存於字格之詞彙資訊量。此額外資訊包含 以下項目,和字的起源,此字可否做爲專有名詞,及該字 其它詞彙及文法細節。 字格及其擴充詞彙資訊由詞彙查閲3丨〇送至衍生構詞 314。在圖4之步驟408,衍生構詞314結合字格中相鄰字 π段形成較大多段字。例如衍生構詞元件3丨4可將字尾字 元串、中綴字元串及字首字元串附加、插入及prepend至 其b分段以形成較大字。在一些實施例,這些衍生構詞準 則有些或全部在步驟4〇2由斷字器302引用,而非在步驟 4 0 8由構詞元件3丨4引用。但構詞元件3丨4用之優點是使 大詞彙記錄組有較多資訊輸入衍生構詞準則。另外衍生構 詞元件3 1 4可結合辨識及引用其它名稱實體之分段,如 人、機構及地理位置之名稱,及其它適當名稱、及如日期 及時間之其它單元。 - 由衍生構詞3 1 4架構之較大字和該較大字之詞彙資訊一 起加入字格。在大部分實施例由衍生構詞3 1 4架構之較大 字不會取代較小分段,而是和較小分段一起放在格中。 衍生構周314產生之擴充字格供至在圖4步驟41〇辨識本 -13 - 本紙張尺度適用巾g a家標準(CNS) μ規格(21QX297公爱)548600 A7 ____B7 V. Description of the invention (8) Store 2 1 6 When operating, the operating system 2 丨 2 is executed by the processor 202 from the memory 204. In a preferred embodiment, the operating system 212 is a commercial Windows CE brand operating system from Microsoft Corporation. The operating system 2 丨 2 is preferably designed for mobile devices and implements database features that can be used by Application 2M via a set of exposed application program interface methods. Objects in the object store 216 are maintained by the application 2 1 4 and the operating system 2 1 2 at least partially in response to calls to exposed application interfaces and methods. The flood interface 208 represents various devices and technologies that enable the mobile device 2000 to send and receive information. The unit includes wired and wireless modems, satellite receivers and broadcast tuners. The mobile device 2000 can also be connected directly to a computer to exchange data. In this case, the communication interface 208 can be an infrared receiver or a serial or parallel communication connection capable of transmitting outgoing information. The input / output components 206 include a variety of input devices such as a touch screen, keys, rotor, and microphone, and a variety of output devices including an acoustic fort, vibrator, and display. The above devices are only examples, and are not necessarily required for mobile devices 2000. In addition, other input / output devices may be connected to the mobile device 200 or within the scope of the present invention. The embodiment of the present invention provides a method and a device for segmenting this text by identifying rare and new multi-character characters. Another embodiment provides a method and system for recognizing partial non-dictionary phonetics. Fig. 3 is a block diagram of various components according to an embodiment of the present invention. FIG. 4 is a flowchart of a method according to an embodiment of the present invention using the components of FIG. 3. In step 400 of FIG. 4, the hyphenator 300 of FIG. 3 recognizes the adjacent character combination that appears in the small vocabulary record group (also referred to as the dictionary) 300 in the text 300 in this text. Vocabulary Green Group 3 04 is due to the limited amount of grammatical information stored in each character. -11-This paper size applies the Chinese National Standard (CNS) A4 specification (210X 297 mm) 548600 A7 B7 V. Description of the invention (9) 4 Fruit brother green group 3 〇 The number of words included does not have to be very small, in fact in some Example Small Vocabulary The green group 3 04 has a large number of characters. In one embodiment of the present invention, the word breaker 3 02 uses a data structure called a trie to search for words in the small vocabulary green group 304. In clues, the words are not listed in sequence but replaced by state chain representations. Each state represents a character, and contains one or more sub-states, and each sub-state has a character, which occurs after the current state of resentment Feng Yuan in at least one word of the sub-record 1 of the group 3 304. Each state also indicates whether the current character appears in the final character of a character, which is formed by a state chain that processes the current character. With the clue data structure, possible words in the character string such as " ABcd " can be fixed. For example, the system may start in a state such as the character " A ". If the center of mind indicates that the + element " a " is a single word in the small vocabulary record group 3 04, "A" will be recognized as a possible segment of the string. Then the system checks whether there is a character " B " A child state extended from the "A" state of Dongyuan. If there is a "B" substate, check whether the state "B" is the final character of any word. If so, the recognition string " AB " is a possible segmentation. Then the system checks the child status of the state extension. If the character " 一 C " has no sons, then the situation will be extended. The system stops looking for the current chain and starts searching for a new chain starting with the character " B ". Repeatedly start each character of the input string The processing of the new chain is to check that each character may be the beginning of the chain. Once in step 4 0 0, the words stored in the small vocabulary record group 3 0 4 are identified, and the method in FIG. 4 is continued to step 4 02. This hyphenator 30 uses the appropriate naming criterion 3 0 6 to identify whether a word may not be stored in the green group of small words, but it indicates an appropriate name such as a person's name or place name. -12- 548600 A7-______B7 5 Description of the invention (1〇)-The words identified by the small capsule i Jilu group 3 04 and the appropriate naming criterion 3 06 are placed in the character grid 'for access to the large vocabulary record green group 3 1 2 vocabulary review 3 1 〇. The large vocabulary record group 312 contains more vocabulary information than the small vocabulary record group 3 04. In fact, the small vocabulary record green group 3 in the multiple-duration example is built from the large vocabulary record group 3 12 and periodically Reference update. Use Datonghui record group 3 丨 2, in step 4 〇6, vocabulary lookup 3 i 〇 The amount of vocabulary information stored for each word in the grid. This additional information includes the following items, and the origin of the word, whether the word can be used as a proper noun, and other vocabulary and grammatical details of the word. Refer to 3 丨 〇 and send it to the derived word formation 314. In step 408 of FIG. 4, the derived word formation 314 combines the adjacent π segments in the grid to form a large multi-segment word. For example, the derived word formation element 3 丨 4 can end the suffix. The character string, infix character string, and prefix character string are appended, inserted, and prepend to their b segment to form a larger word. In some embodiments, some or all of these derived word formation criteria are broken by hyphenation at step 402. 302 is used instead of the word formation element 3 丨 4 in step 408. However, the advantage of the word formation element 3 丨 4 is that the large vocabulary record group has more information to enter the derived word formation criterion. In addition, the derived word formation Element 3 1 4 can be used to identify and reference segments of other name entities, such as names of people, institutions, and geographic locations, and other appropriate names, and other units such as dates and times.-Derived from the structure of 3 1 4 Larger words and vocabulary information The character grid is added. In most embodiments, the larger word constructed by the derivative word 3 1 4 will not replace the smaller segment, but will be placed in the cell with the smaller segment. The extended word generated by the derivative word 314 The grid is provided to identify the book at step 41 in Figure 4.13-This paper size is suitable for household standard (CNS) μ specifications (21QX297 public love)

裝 訂 548600 A7Binding 548600 A7

心新字辨識器320。要辨識新 320先在字格搜尋不是格中字—部 ^ ’冑字辨識器 辨識串列,新字辨識3 2 0決定串列中所:子元串列。對各 之可能性機率。在—實施例, =:爲單字元字 獨立字機率平均或才目加而達成,丨中^ 別字元之 本文章節中個別字元爲單一 二子機率是指在 在-實施例-字元::、::Γ可能性。 于凡馬獨JL罕機率爲·· N(W〇rd(c)) N(c) (等式1) IWP(c) 其中N(W,爲資料中字元c以單 N⑷爲字元。出現在資料中之次數(爲單字人數, 字),及聰⑷爲該字元之獨 =多字疋 。丨析貝科“各子凡機率,其利用檢查剖析結 (―及算各字元出現及各字元以獨立字出現次數。 在-實施例’用於形成獨立字機率之剖析資料不夠大到 包含所有罕π。只要該資料包含常以單字 。 ^見(子凡 ,在-實施例,在使用新字辨識器前,計算出現在 所有字兀之獨立字機率。然後於獨立字機率儲存M2儲 這些機率,在新字辨識中新—字辨識器32〇存取。 當新字辨識器3 2 0找到不是一多字元字之部份之單一字 το串列,即存取儲存322中儲存之這些字元之獨立字機 率。然後將該串列中字元之獨立字機率相加或平均以形成 該字元串列之總獨立字機率。 548600 A7 B7 五、發明説明(12 ) 將字元串列總獨立字機率和臨界機率比較,決定該字元 較可能形成單一字或是單字元串列。若總獨立字機率在臨 界之下,認爲該串列字元形成單一字且將此新字加進字 格。若總獨立字機率在臨界之上,認爲該字元_列爲單字 元字串列。 在一實施例,串列中不同字元數使用不同臨界。例如二 字元串列爲一臨界而四字元串列爲不同臨界。在一些實施 例,字元串列限制爲具二、三或四個字元。 對具不只二字元之單字元串列,本發明一實施例決定該 字元各可能群組之總獨立字機率。例如若字元串列爲 n ABC”,字辨識器3 2 0分別產生串列π AB”、',BC··及n ABC1, 之總獨立字機率。總獨立字機率小於該臨界之各串列然後 加到字格做爲新字。故若”BCn&"ABC"之總獨立字機率小 於該臨界且"ΑΒΠ之總獨立字機率在該臨界上^則”BC,f及 ” ABC”將加入字格視爲可能字在"AB"不會加到字格中。 在圖3及4之實施例,在新字加到字格前在圖4之步驟 4 1 2利用部分語音辨識器3 2 4決定其部分語音。要知道單 一新字可代表許多部分語音。故部分語音辨識器3 2 4是要 辨識新字可能代表之所有部分語音。 部分語音辨識器3 2 4利用—該字中各字元之部分語音機率 決定新字可能代表之語音部分。字元部分語音機率表示該 字元在特定部分語音之字之可能,其中字長及字中之字元 位置既定。例如部分語音機率可指字元” A ”在三字元名詞 中第二字元之機率。 -15- 本紙張尺度適用中國國家標準(CNS) A4規格(210 X 297公釐) A7心 新 字 dentifier 320. To recognize the new 320, search for the character in the grid instead of the character in the grid—the ^ ’胄 character recognizer recognizes the string, and the new character recognition 3 2 0 determines the place in the string: the sub-string. Probability of each possibility. In the embodiment, =: is achieved by averaging or increasing the probability of independent words of single-character characters. In the section of this article, the individual characters are single two sub-probabilities, which means that in the embodiment-character: :, :: Γ possibility. The probability of JL Yu Fanma ’s independence is ... N (W〇rd (c)) N (c) (Equation 1) IWP (c) where N (W is the character c in the data and the single N⑷ is the character. The number of times in the data (the number of single characters, word), and the unique character of the character = multi-character 丨. Analysis of Beco's "probability of each child, the use of inspection and analysis results (and count the occurrence of each character And the number of occurrences of each character as an independent word. The profiling data used to form independent word probabilities in the-embodiment 'is not large enough to include all rare π. As long as the data contains often single words. ^ See (Zi Fan, in-embodiment Before using the new word recognizer, calculate the individual word probabilities that appear in all characters. Then store these probabilities in the independent word probabilities M2 and store them in the new word recognizer—word recognizer 32. When the new word recognizes The device 3 2 0 finds a single word το string that is not part of a multi-character word, that is, accesses the independent word probabilities of these characters stored in the storage 322. Then adds the independent word probabilities of the characters in the string Or average to form the total independent word probability of the character string. 548600 A7 B7 V. Description of the invention (12) The character string is totaled The comparison of the independent word probability and the critical probability determines that the character is more likely to form a single word or a single character string. If the total independent word probability is below the threshold, the serial character is considered to form a single word and the new character is added Enter the character grid. If the total independent word probability is above the threshold, the character_column is considered to be a single-character string string. In one embodiment, different numbers of characters in the string use different thresholds. For example, a two-character string string The four-character string is a different threshold. In some embodiments, the character string is limited to two, three, or four characters. For a single-character string with more than two characters, the present invention is implemented For example, determine the total independent word probability of each possible group of the character. For example, if the character string is n ABC ”, the word recognizer 3 2 0 generates the strings π AB”, ', BC ··, and n ABC1, respectively. Probability of total independent words. All strings with total probabilities less than the threshold are added to the character grid as new words. Therefore, if the total probabilities of "BCn & " ABC " are less than the threshold and the total independence of " ΑΒΠ If the word probability is above this threshold, then "BC, f and" ABC "will consider adding characters as possible. The word "&AB;" will not be added to the character grid. In the embodiment of Figs. 3 and 4, before the new word is added to the character grid, step 4 1 2 of Fig. 4 is used to determine the part of it. Speech. Know that a single new word can represent many parts of speech. So part of the speech recognizer 3 2 4 is to recognize all the parts of the speech that the new word may represent. Part of the speech recognizer 3 2 4 uses-part of each character in the word The probability of speech determines the part of speech that a new word may represent. The partial speech probability of a character indicates the possibility of the character in a particular part of the speech, where the length of the character and the position of the character in the character are fixed. For example, some speech probability may refer to a character. " A "probability of the second character in a three-character noun. -15- This paper size applies to Chinese National Standard (CNS) A4 (210 X 297 mm) A7

548600 五、發明説明( 部分語音機率可以p,Cat,L表示,其中p代Μ 率’ Cat表心子邵分語音或其類別之縮寫,l〇c是字中字 元位置’ ALen爲字長(字元數目)。W如字元在四字元動 詞第二字元之機以Pv24表示,字元在:字元名詞第一字 兀之機率以Pnl2表示’及字元在四字元形容詞第三字元之 機率以Pa34表示。 本發明一實施例限制部分語音爲名詞(1〇、動詞(V)及形 容詞(a),並限制字長爲2到4字元以限制各字元需計算之 機率數目。故該機率長度屬性可爲2 , 3或4,及位置屬性 可爲1,2,3或4。此値組之限制使各字元之部分語音機 率最多2 7個。 在一實施例,各部分語音機率由數於語文字典中找到之 字數決定,該字有特定長度及部分語音且該字元在其特定 位置。故要決定字元"A"之Pn23,本發明實拖例數字典中 有A做爲第二字元之三字元名詞數目。然後將此計數除以 出現該字元之字數目,而得到部分語音機率。以如下等式 表示:548600 V. Description of the invention (Some probabilities of speech can be expressed by p, Cat, L, where p is the rate of 'Cat', which is the abbreviation of speech or its category, loc is the position of the character in the word, 'ALen is the word length (word The number of characters). W If the character is in the second character of the four-character verb, it is represented by Pv24, the character is in: the probability of the first character of the character noun is represented by Pnl2 ', and the character is in the third character of the four-character adjective. The probability of a character is represented by Pa34. An embodiment of the present invention restricts part of the speech to be a noun (10, verb (V), and adjective (a), and limits the word length to 2 to 4 characters to limit the calculation of each character. The number of probabilities. Therefore, the probabilistic length attribute can be 2, 3, or 4, and the position attribute can be 1, 2, 3, or 4. The limitation of this group makes the partial voice probability of each character up to 27. In one implementation For example, the probability of each part of speech is determined by the number of words found in the language dictionary. The word has a certain length and part of the speech and the character is in its specific position. Therefore, to determine the character "Pn23" of the character "A", the present invention The number of three-character nouns in the dictionary is A as the second character. Then This count is divided by the number of characters of the character appears to give some voice probability expressed in the following equation:

裝 (等式2)(Equation 2)

P,cat,loc,len(c)= N(cat,loc,len(c))N(^) 其中P,cat,loc,len(c)爲字元c之部分語音機率, N(cat,loc,len(c))爲特定部分語音及長度之字數,其中該字 元出現在特定位置,及N(c)爲出現該字元之字總數。 在一實施例,決定部分語音機率之計數只在字典之主要 詞執行。在一實施例使用85,135個主要詞之字典。 -16 - 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐) 548600 A7 B7 五、發明説明(14 ) 在一實施例,在收到一段要處理之本文前先計算該語文 中各字元之部分語音機率。圖3顯示這些部分語音機率儲 存於部分語音(POS)機率儲存3 2 6。 當新字辨識器3 2 0辨識新字,部分語音辨識器3 2 4由存 取POS機率儲存326擷取字中各字元之適當部分語音機 率。例如對新字"AB”,部分語音辨識器3 2 4存取字元” A π 之部分語音機率Ρη12,Ρν12及Pal2,及字元ΠΒ,’之部分語 、音機率Ρη22,Ρν22及Pa22。然後將類似部分語音之部分語 音機率相加或平均以形成該字之總機率。故字元” A π之 Ρη12和字元"Β”之Ρη22相加以提供字” ΑΒΠ爲名詞之總機 率。類似地字元” A ”之Ρν 12和字元"Β π之Ρν22相加以提供 字爲π ΑΒΠ爲動詞之總機率。 然後將各總機率和一臨界比較決定該字是否可能表示一 或多部分語音。故該字爲名詞之總機率和臨界比較,該字 爲動詞之總機率和臨界比較,及該字爲形容詞之總機率和 臨界比較。要知道先前雖使用單一臨界,在其它實施例對 不同語音使用不同臨界。 總機率超過臨界之各部分語音視爲新字之可能部分語 音。故當部分語音辨識器3 2 4將字加入字格,即將一總機 率超過臨界之部分語音分·配-給該字。在一實施例若一字有 多部分語音超過臨界,對各部分語吾將該字均插入格中一 次。故當字爲名詞之總機率超過臨界且字爲動詞之總機率 超過臨界,該字將一次以名詞,一次以動詞加入字格。在 其它實施例該字只加到字格一次,且分配最高總機率之部 -17- 本紙張尺度適用中國國家標準(CNS) Α4規格(210 X 297公釐) 548600 A7 B7P, cat, loc, len (c) = N (cat, loc, len (c)) N (^) where P, cat, loc, len (c) is part of the speech probability of the character c, N (cat, loc, len (c)) is the number of words and length of words in a specific part, where the character appears in a specific position, and N (c) is the total number of words in which the character appears. In one embodiment, the counting of the partial speech probability is performed only on the main words of the dictionary. A dictionary of 85,135 main words is used in one embodiment. -16-This paper size applies Chinese National Standard (CNS) A4 specification (210X 297 mm) 548600 A7 B7 V. Description of the invention (14) In one embodiment, calculate the language before receiving a piece of text to be processed. Part of the speech probability of each character. Figure 3 shows that these partial voice probabilities are stored in partial voice (POS) probability stores 3 2 6. When the new word recognizer 3 2 0 recognizes the new word, the partial speech recognizer 3 2 4 retrieves the appropriate partial speech probability of each character in the word from the storage POS probability storage 326. For example, for the new word " AB ", some speech recognizers 3 2 4 access part of the speech probabilities Pη12, Pν12 and Pal2 of the character" A π, and part of the speech and probabilities Pη22, Pν22 and Pa22 of the character ΠB, '. . Partial speech probabilities of similar partial speech are then added or averaged to form the total probability of the word. Therefore, the addition of Pη12 of character "A π" and Pη22 of character "quoteB" provides the total probability of the word "ABII". Similarly, the "Pν12" of character "A" and the pν22 of character "Bπ" are added together. Provide the word as π ΑΒΠ as the total probability of the verb. Then compare the total probability with a threshold to determine whether the word may represent one or more parts of the speech. Therefore, the word is the total probability and critical comparison of the noun, and the word is the total probability of the verb. Comparison of the probability and the threshold of the word, and the total probability and the threshold of the adjective. It must be known that although a single threshold was used previously, in other embodiments different thresholds are used for different speeches. The parts of the speech whose total probability exceeds the threshold are considered as possible new words. Part of the voice. So when the part of the speech recognizer 3 2 4 adds the word to the character grid, that is, a part of the voice with a total probability exceeding the threshold is assigned to the word. In one embodiment, if a part of the word has more than the threshold, the right I insert the word into the cell once in each part of the word. Therefore, the total probability of the word being a noun exceeds the threshold and the total probability of the word being a verb exceeds the threshold. Word lattice. In other embodiments, the word applied only to a word lattice, and the highest total probability distribution portion of the present paper -17- applies China National Standard Scale (CNS) Α4 size (210 X 297 mm) 548600 A7 B7

分語音。 在一實施例若總部分語音機率均未超過臨界,部分达立 辨識器3 2 4將字插入字格並將該字以名詞表示。 刀居骨 決定新字部分語音之方法雖和本發明用以辨識新字、 法結合,但本發明之此二態樣無需一起實行。故無需^万 辨識新字邵分語骨之方法即可使用以辨識新字之方、去λ 用以辨識新字部分語音之方法可和任何用以辨識新字、,且 術一起使用。 又技 部分語音辨識器324產生之字格供至句法剖析 在圖4之步驟4 1 4利用擴充字格執行句法分析。在 :, 例,利用由下而上分析圖(bottom_up chart二析由=τ:施 片語建立增大片語產生句法剖析,執行句法剖析广要:及 較大片語’句法剖析器3 16引用文法準則檢查 戈:互 詞彙表示以決定形成較大字或片語之結合方弋^片浯之 例使用二進位文法描述二相連字或片語以=‘結「實施 句法剖析器3 16執行之句法分析考量擴充字万式。 段。限制剖析器只結合表示原始輸入本 、斤有刀 十人甲相鄰字开八 段,且取終分析持續到整個輸入本文。 刀 產生和二重疊分段有關或-組未表示全部新 之有效剖析。 .— 子_ &lt;刀#又 在剖析中,句法剖析器3 16將決定新字 之任何新字是否根據部分語音辨識器3 =32^辨識 語音形成部分有效剖才斤。若新字未形成部八〆〈新罕邵分 會辨識爲本文之新字。在本發明一此舍刀有效剖析,不 二貝、把例,新字辨識器 -18 -Sub-voice. In an embodiment, if the total partial speech probability does not exceed the threshold, the partial recognition recognizer 3 2 4 inserts the word into the character grid and expresses the word as a noun. Although the method of deciding the sound of a part of a new character is combined with the method for recognizing a new character and the method of the present invention, the two aspects of the present invention need not be implemented together. Therefore, the method of recognizing new characters can be used without the need to identify new characters, and the method of removing lambda to recognize part of the speech of new characters can be used with any method that recognizes new characters. Another technique is to use the character grid generated by the partial speech recognizer 324 for syntactic analysis. In step 4 1 4 of FIG. 4, the syntactic analysis is performed by using the extended character grid. In: For example, use the bottom-up analysis chart (bottom_up chart to analyze syntactic parsing from = τ: applying phrases to build increasing phrases, perform syntactic parsing broadly: and larger phrases' syntax parser 3 16 quote grammar Rule check: Mutual vocabulary representation to determine the combination of larger words or phrases. Example of using a binary grammar to describe a concatenated word or phrase with = 'Knot' Implementation of a parser 3 16 Syntax analysis performed Consider the extended word Wan. Segment. Restrict the parser to only combine the original input text, the ten characters with a knife, and the eight adjacent words, and the final analysis continues to the entire input text. The knife generation is related to the two overlapping segments or -The group does not represent all new valid parsing.. — 子 _ &lt; Knife # In the parsing again, the parser 3 16 will decide whether any new words of the new word are based on the partial speech recognizer 3 = 32 ^ to identify the speech forming part Effective analysis is only worthwhile. If the new word does not form part Hachiman <New Han Shao Branch identified as the new word in this article. In the present invention, the knife is effectively analyzed, and the new word recognizer-18-

548600 五、發明說明(π 3 2 0辨識之字約有一半未形成部分有效 時捨棄。選擇新字辨識E32g辨識之卿_因此在剖析 政:析’而在許多情形對形成本文有 :做馬郅分有 在—測試,若無新字辨識器32()提供 ^重要的。 3 i 6剖析之句子約有2 i %無法剖析。 ’句法剖析器 在些實施例,句法剖析器3丨6產生多 析’各代表輸入本文之不同有效分段。在」固有效句法剖 效句法均送到邏輯形式產生器318,辨,久二施例這些有 關係。‘然討使用該語義_,在有 ^法間之語義 對輸入字串正確之叫浙 a、。析中選擇最可能 416。 $ &quot;斤。此語義辨識示於圖4之步驟 裝 參照特定實施例描述’精於本技術者將了解可 ,文吏形式及細節而未偏移本發明之精神及範圍。 訂 -19- 本紙張尺度適用中國國家標準(CNS) Α4规格(210X297公釐)548600 V. Description of the invention (about half of the characters of π 3 2 0 that are not formed are discarded when they are not valid. Choose the new character to identify E32g. The _ is therefore analyzing the politics: analysis', and in many cases the formation of this article is: There is a point in the test—if no new word recognizer 32 () provides ^ important. About 2 i% of 3 i 6 parsed sentences cannot be parsed. 'Syntax parser in some embodiments, syntax parser 3 丨 6 "Generate multi-analysis" each represents the different effective segmentation of the input text. In "solid effective syntax", the parsing syntax is sent to the logical form generator 318, which is related to the two examples. "Ran uses this semantic_, in The semantics of the ^ method is correct for the input string. It is called Zhea. The most likely choice is 416. $ &quot; pound. This semantic identification is shown in the steps of Figure 4 and is described with reference to a specific embodiment. It will be understood that the form and details of the official can be understood without deviating from the spirit and scope of the present invention. Revision-19- This paper size applies to the Chinese National Standard (CNS) A4 specification (210X297 mm)

Claims (1)

090124532號專利申請案 中文申請專利範圍替換本(92年5月) 六、申請專利範圍 1. 種用以由一未分段語言將 法’該方法包含: 辨識輸入字元串列之單字元串列; 決定各字元之獨立字機率,該機率表 元字之可能; 予元串列分段之方 该字元為單字 以決定單字 示單字元串 結合單字元串列中各字元之獨立字機率, 元串列之總獨立字機率;以及 若總獨立字機率小於臨界值,以單—字表 列0 2.如申請專利範圍第Η之方法,其中於輪入字元串列辨 識早一竽元串列包含執詞彙查閱,以將輸入字元串列 之字元群組為多字元字,並搜尋輸入字元串列 ^ 一 字未包含之單字元串列。 子70 3·如$請專利範圍第2項之方法,其中辨識輸入字元串列 中單一字元串列之步驟另包含在搜尋輸入字元串列中^ 字元字未包含之單字㈣列前,執行衍二 它多字元字。 辨減其 4·如申請專利範圍第3項之方法,其中辨識 .f兀串列 中早一字元串列之步驟另包含在搜尋輪入字元串列中夕 字元竽未包含之單字元事列前’執行適當夕 識其它之多字元字。 饵辨4以辨 5.如申請專利範圍第1項之方法,其中決定一 ^ 一 ,&gt; , 子兀獨立全 機率包含計算資料中字元以單字元字出現之次數,、、 該次數除以資料中該字元出現之次數。 並將 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公羞:)090124532 Patent Application Chinese Application for Patent Scope Replacement (May 1992) VI. Scope of Patent Application 1. A method for using a non-segmented language method. The method includes: identifying a single character string of an input character string Column; determine the independent word probability of each character, the probability table of the possibility of the element word; the square string segmentation of the character is a single character to determine the single character display single character string combined with the independence of each character in the single character string Word probability, the total independent word probability of the meta string; and if the total independent word probability is less than the critical value, the single word list is 0. 2. The method according to the first scope of the patent application, where the identification of the rotating character string is early. A character string contains a vocabulary lookup to group the character groups of the input character string into a multi-character word, and search for a single character string that is not included in the input character string ^. Sub 70 3. If the method of the second item of the patent scope is requested, the step of identifying a single character string in the input character string is additionally included in the search input character string ^ A single character queue that is not included in the character string Before, execute it to multi-character characters. 4. The method of reducing the number of items, such as the third item in the scope of patent application, in which the step of identifying the earlier character string in the .f string is also included in the search rotation character string. The first thing to do is to implement the appropriate multi-character characters. Decide 4 to discern 5. The method of item 1 of the scope of patent application, which decides ^ 1, &gt;, the independent full probability includes the number of times a character in the data appears as a single character, divided by the number of times The number of occurrences of the character in the data. And this paper size applies the Chinese National Standard (CNS) A4 specification (210X 297 public shame :) 548600 A8 B8 C8 D8 六、申請專利範圍 6. 如申請專利範圍第1項之方法,其中結合獨立字機率包 含平均獨立字機率。 7. 如申請專利範圍第1項之方法,其中結合獨立字機率包 含將獨立字機率相加。 8. 如申請專利範圍第1項之方法,另包含將部分語音分配 給表示單字元串列之單一字。 9. 如申請專利範圍第8項之方法,其中分配部分語音包 含: 決定單一字中各字元之部分語音機率,該機率表示根 據單一字長及字元在單一字之位置,該字元於特定部分 語音之字中之可能; 結合單一字之所有字元部分語音機率以決定總部分語 音機率;以及 若總部分語音機率超過一臨界值,將和總部分語音機 率有關之部分語音配給該單一字。 10. 如申請專利範圍第9項之方法,其中決定字元之部分語 音機率包含: 將字典中表示特定部分語音之字計數,該語音具單一 字長且包含該字元和該單一字位置相同以提供部分語音 計數;及 將部分語音計數除以字典中包含該字元之字數,以形 成該字元之部分語音機率。 11. 如申請專利範圍第9項之方法,其中決定部分語音機率 包含自記憶體擷取先前決定之部分語音機率。 -2- 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐) A BCD 548600 六、申請專利範圍 12. 如申請專利範圍第9項之方法,另包含若所有部分語音 之總部分語音機率均未超過臨界,將單一字配為名詞。 13. 如申請專利範圍第1項之方法,另包含: 辨識早字元串列中之早字元子串列· 結合單字元子串列各字元之獨立字機率,以決定單字 元子串列之總獨立字機率;以及 若總獨立字機率小於臨界值,將單字元子串列以單一 字表示。 14. 一種辨識一字之部分語音之方法,該方法包含: 決定該字中各字元之字元機率,該機率表示該字元在 特定部分語音之字出現之可能; 結合該字中所有字元之字元機率以形成總部分語音機 率;及 若總部分語音機率超過臨界,以表示部分語音表示該 字。 15. 如申請專利範圍第1 4項之方法,其中決定字元機率包 含: 將字典中表示特定部分語音之所有字計數,該語音長 和該字相同且該字元和在該字中之位置相同,以產生部 分語音字計數;以及 將部分語音字計數除以字典中具該字元之字數,以形 成字元機率。 16. 如申請專利範圍第1 4項之方法,其中決定字元機率、結 合字元機率及以表示部分語音表示字之步驟在多部分語 -3- 本紙張尺度適用中國國家標準(CNS) Α4規格(210 X 297公釐)548600 A8 B8 C8 D8 6. Scope of patent application 6. For the method of applying for item 1 of the patent scope, the combination of independent word probability includes the average independent word probability. 7. The method according to item 1 of the scope of patent application, wherein combining the independent word probabilities includes adding the independent word probabilities. 8. If the method of the first scope of patent application, the method further includes allocating part of the speech to a single word representing a single-character string. 9. The method of claim 8 in the patent application scope, wherein allocating partial speech includes: determining a partial speech probability of each character in a single character, the probability indicates that according to the single word length and the position of the character in the single character, the character is Possibility of words in a specific part of speech; combining all part of speech probabilities of a single character to determine the total part speech probability; and if the total part speech probability exceeds a critical value, allocating part of the speech related to the total part speech probability to the single part word. 10. The method of claim 9 in the scope of patent application, wherein the partial speech probability of determining a character includes: counting a word representing a specific part of speech in a dictionary, the speech having a single word length and containing the character and the same position of the single word To provide a partial speech count; and divide the partial speech count by the number of words in the dictionary containing the character to form a partial speech probability of the character. 11. The method of item 9 of the patent application, wherein determining part of the speech probability includes retrieving part of the previously determined speech probability from memory. -2- This paper size applies to China National Standard (CNS) A4 specifications (210X 297 mm) A BCD 548600 6. Application for patent scope 12. If the method of item 9 of the patent scope is applied, it also includes the total part of all voice The probability of speech does not exceed the threshold, and a single word is used as a noun. 13. If the method of the scope of application for the first item of the patent, further includes: identifying the early character sub-strings in the early character string; combining the individual word probabilities of each character in the single-character sub-string to determine the single-character sub-string The total independent word probability of the column; and if the total independent word probability is less than the critical value, the single-character substring is represented by a single word. 14. A method for recognizing part of the sound of a character, the method comprising: determining a character probability of each character in the character, the probability indicating a possibility of the character appearing in the character of a specific part of the speech; combining all characters in the character; The probabilities of the characters of the yuan form the total partial speech probability; and if the total partial speech probability exceeds the threshold, the partial speech is represented by the word. 15. The method of claim 14 in the scope of patent application, wherein the probability of determining a character includes: counting all words in a dictionary representing a specific part of speech, the speech length is the same as the word and the character and its position in the word Same to generate a partial phonetic word count; and divide the partial phonetic word count by the number of words with that character in the dictionary to form a character probability. 16. For the method of applying for item No. 14 in the scope of patent application, the steps of determining the probability of characters, combining the probability of characters, and expressing characters in parts of speech are in multi-part languages. -3- This paper applies Chinese National Standard (CNS) Α4. Specifications (210 X 297 mm) 548600 晋重覆。 17· —種用以在—夫分殿音五.彡 a -丄 禾刀奴~又 &lt; 一子疋串中 該系統包含: 卞 &lt; 糸說, -斷字器,使用詞彙記錄組辨識字元串中之字;^ 4 m識器’利用決定字元串列中未由斷字哭^· 於字中而形成單一字之另字元,辨識字元串中之新。字置- 18. 如申請專利範圍第17項之系統,另 器,辨識新字辨識器辨識之各新字之部分語音”辨養 19. 如申請專利範圍第18項之系统,其中部分語音辨識 用一組字元機率辨識新字之部分語:矛 字中單字元有關,及各字元機率指相關= 度相同(字中出現之位置和在該新字位置相同之可崎 該可能表示特定部分語音。 此 20· =申請專利範圍第19項之系統,其中部分語音辨識器為 口新丰中所有字元之字元機率以產生總部分語音機率。 21· ^申請專利範圍第2〇項之系統,其中部分語音辨識器綱 總邵分語骨機率和臨界值比較,以決定和總部分語音賴 率有關之部分語音是否應分配給該新字d ^申叫專利範圍第1 7項之系統,另包含句法剖析器根揭 斷字态及新字辨識器辨識之字產生字元辛之句法剖析。 申叫專利範圍第2 2項之系統,其中之句法剖析器由句 法剖析排除一新字。 24·如申凊專利範圍第1 7項之系統,其中字元形成單字元导 &lt;機率是由將資料中該字元形成單字元字之計數除以驾 -4- 1家標準(CNS) A4規格(210X297公釐)548600 Repeated. 17 · —A kind of used in—Fu Fangdian Yin V. 彡 a-丄 禾 刀 奴 ~ Again &lt; a sub-string of strings The system contains: 卞 &lt; 糸 said,-word breaker, using vocabulary record group identification Characters in the character string; ^ 4 m recognizer 'uses another character that determines a single character in the character string that was not crying by a hyphen ^ · in the character, and recognizes the newness in the character string. Word placement-18. If the system of the 17th area of the patent application is applied, another part, the speech recognition of each new word recognized by the new word recognizer "identification 19. If the system of the 18th area of the patent application is applied, some of the speech recognition Use a set of character probabilities to identify parts of a new word: the single character in the spear word is related, and the probability of each character refers to the same degree (the position in the word is the same as the position of the new word in Kazaki, which may indicate a specific Partial speech. This 20 · = system of item 19 of the scope of patent application, in which part of the speech recognizer is the probabilities of all characters in Houxinfeng to generate the total part of the speech probability. 21 · ^ Item 20 of the scope of patent application System, in which part of the speech recognizer program compares the total probabilities of bones and the critical value to determine whether part of the speech related to the total part of speech depends on the new word. The system also includes the syntactic parser to detect the root of the grammar and the new parser to generate the character syntactic parsing of the character. The system of claim No. 22 of the patent scope, in which the parser is excluded by the syntactic parser. New words. 24. The system of item 17 in the scope of patent application, in which the characters form a single character guide &lt; the probability is obtained by dividing the count of single characters in the data by the driving standard of 4-1 (CNS) A4 size (210X297 mm) 裝 訂 8 8 8 8 ABCD 548600 ^ r ^ 六、申請專利範圍 資料中包含該字元之字數決定。 25. —種電腦可讀媒體,具有電腦可執行指令,用以執行以 下步騾: 辨識輸入字元串列中之單字元串列; 決定各字元之獨立字機率,該機率表示字元以單字元 字出現之可能; 結合單字元串列中各字元之獨立字機率,以決定單字 元串列之總獨立字機率;以及 若總獨立字機率小於臨界值,以單一字表示單字元串 歹1J 〇 本紙張尺度適用中國國家標準(CNS) A4規格(210X 297公釐)Binding 8 8 8 8 ABCD 548600 ^ r ^ VI. Scope of Patent Application The number of characters in this document is determined. 25. — A computer-readable medium with computer-executable instructions to perform the following steps: Recognize a single character string in the input character string; Determine the independent word probability of each character, which indicates that the character The possibility of single-character characters appearing; Combine the independent word probabilities of each character in the single-character string to determine the total independent word probability of the single-character string; and if the total independent word probability is less than the threshold, use a single character to represent the single-character string J1J 〇 This paper size is applicable to China National Standard (CNS) A4 (210X 297mm)
TW90124532A 2000-10-04 2001-10-04 Method and system for identifying attributes of new words in non-segmented text TW548600B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US23752200P 2000-10-04 2000-10-04

Publications (1)

Publication Number Publication Date
TW548600B true TW548600B (en) 2003-08-21

Family

ID=22894078

Family Applications (1)

Application Number Title Priority Date Filing Date
TW90124532A TW548600B (en) 2000-10-04 2001-10-04 Method and system for identifying attributes of new words in non-segmented text

Country Status (2)

Country Link
CN (1) CN1193304C (en)
TW (1) TW548600B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI426399B (en) * 2005-11-23 2014-02-11 Dun & Bradstreet Corp Method and apparatus of searching and matching input data to stored data

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100347723C (en) * 2005-07-15 2007-11-07 清华大学 Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost
CN101271450B (en) * 2007-03-19 2010-09-29 株式会社东芝 Method and device for cutting language model
CN100478961C (en) * 2007-09-17 2009-04-15 中国科学院计算技术研究所 New word of short-text discovering method and system
CN100489863C (en) * 2007-09-27 2009-05-20 中国科学院计算技术研究所 New word discovering method and system thereof
CN101882226B (en) * 2010-06-24 2013-07-24 汉王科技股份有限公司 Method and device for improving language discrimination among characters
CN107203542A (en) * 2016-03-17 2017-09-26 阿里巴巴集团控股有限公司 Phrase extracting method and device
CN109858010B (en) * 2018-11-26 2023-01-24 平安科技(深圳)有限公司 Method and device for recognizing new words in field, computer equipment and storage medium
CN109858011B (en) * 2018-11-30 2022-08-19 平安科技(深圳)有限公司 Standard word bank word segmentation method, device, equipment and computer readable storage medium
CN112101308B (en) * 2020-11-11 2021-02-09 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI426399B (en) * 2005-11-23 2014-02-11 Dun & Bradstreet Corp Method and apparatus of searching and matching input data to stored data

Also Published As

Publication number Publication date
CN1193304C (en) 2005-03-16
CN1369877A (en) 2002-09-18

Similar Documents

Publication Publication Date Title
US7069207B2 (en) Linguistically intelligent text compression
US7421386B2 (en) Full-form lexicon with tagged data and methods of constructing and using the same
US5634084A (en) Abbreviation and acronym/initialism expansion procedures for a text to speech reader
EP1585030B1 (en) Automatic Capitalization Through User Modeling
US7490034B2 (en) Lexicon with sectionalized data and method of using the same
JP2008108274A (en) Computer program for parsing text within corpus and recording medium therefor
US20020123877A1 (en) Method and apparatus for performing machine translation using a unified language model and translation model
JP2005251206A (en) Word collection method and system for use in word segmentation
CA2533328A1 (en) Grammatically correct contraction spelling suggestions for french
TW548600B (en) Method and system for identifying attributes of new words in non-segmented text
JP2006065387A (en) Text sentence search device, method, and program
US6968308B1 (en) Method for segmenting non-segmented text using syntactic parse
JP6787755B2 (en) Document search device
Šantić et al. Automatic diacritics restoration in Croatian texts
JP4007413B2 (en) Natural language processing system, natural language processing method, and computer program
JP4024137B2 (en) Quantity expression search device
JP4155970B2 (en) Information processing apparatus, synonym database generation method, and synonym database generation program
JP2004287781A (en) Importance calculation device
JP7022789B2 (en) Document search device, document search method and computer program
JP3972697B2 (en) Natural language processing system, natural language processing method, and computer program
KR101450795B1 (en) Apparatus and method for anaphora resolution
JP2009009583A (en) Method for segmenting non-segmented text using syntactic parse
Theeramunkong et al. Towards automatic grammar acquisition from a bracketed corpus
US7599829B2 (en) Phonetic searching using partial characters
JP2012027729A (en) Search device, search method, and program

Legal Events

Date Code Title Description
GD4A Issue of patent certificate for granted invention patent
MM4A Annulment or lapse of patent due to non-payment of fees