CN1168029C - 从连续的中文文本中分离出中文词的方法 - Google Patents

从连续的中文文本中分离出中文词的方法 Download PDF

Info

Publication number
CN1168029C
CN1168029C CNB991231104A CN99123110A CN1168029C CN 1168029 C CN1168029 C CN 1168029C CN B991231104 A CNB991231104 A CN B991231104A CN 99123110 A CN99123110 A CN 99123110A CN 1168029 C CN1168029 C CN 1168029C
Authority
CN
China
Prior art keywords
words
word
text
chinese
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB991231104A
Other languages
English (en)
Chinese (zh)
Other versions
CN1254891A (zh
Inventor
���¡���Ī����
安东尼奥·扎莫拉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN1254891A publication Critical patent/CN1254891A/zh
Application granted granted Critical
Publication of CN1168029C publication Critical patent/CN1168029C/zh
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)
CNB991231104A 1993-03-03 1994-02-18 从连续的中文文本中分离出中文词的方法 Expired - Fee Related CN1168029C (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/025,464 US5448474A (en) 1993-03-03 1993-03-03 Method for isolation of Chinese words from connected Chinese text
US08/025,464 1993-03-03

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN94101382A Division CN1095576C (zh) 1993-03-03 1994-02-18 使用数据结构从输入文本识别出词的方法

Publications (2)

Publication Number Publication Date
CN1254891A CN1254891A (zh) 2000-05-31
CN1168029C true CN1168029C (zh) 2004-09-22

Family

ID=21826213

Family Applications (2)

Application Number Title Priority Date Filing Date
CNB991231104A Expired - Fee Related CN1168029C (zh) 1993-03-03 1994-02-18 从连续的中文文本中分离出中文词的方法
CN94101382A Expired - Fee Related CN1095576C (zh) 1993-03-03 1994-02-18 使用数据结构从输入文本识别出词的方法

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN94101382A Expired - Fee Related CN1095576C (zh) 1993-03-03 1994-02-18 使用数据结构从输入文本识别出词的方法

Country Status (5)

Country Link
US (1) US5448474A (enExample)
JP (1) JP2741835B2 (enExample)
KR (1) KR0122518B1 (enExample)
CN (2) CN1168029C (enExample)
TW (1) TW261677B (enExample)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997040452A1 (en) * 1996-04-23 1997-10-30 Language Engineering Corporation Automated natural language translation
US6760695B1 (en) 1992-08-31 2004-07-06 Logovista Corporation Automated natural language processing
US6278967B1 (en) 1992-08-31 2001-08-21 Logovista Corporation Automated system for generating natural language translations that are domain-specific, grammar rule-based, and/or based on part-of-speech analysis
JPH07182465A (ja) * 1993-12-22 1995-07-21 Hitachi Ltd 文字認識方法
US5806021A (en) * 1995-10-30 1998-09-08 International Business Machines Corporation Automatic segmentation of continuous text using statistical approaches
US6470306B1 (en) 1996-04-23 2002-10-22 Logovista Corporation Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens
CN1114165C (zh) * 1998-02-13 2003-07-09 微软公司 中文文本中的字词分割方法
US6640006B2 (en) 1998-02-13 2003-10-28 Microsoft Corporation Word segmentation in chinese text
US6175834B1 (en) 1998-06-24 2001-01-16 Microsoft Corporation Consistency checker for documents containing japanese text
US6694055B2 (en) * 1998-07-15 2004-02-17 Microsoft Corporation Proper name identification in chinese
JP2000132560A (ja) 1998-10-23 2000-05-12 Matsushita Electric Ind Co Ltd 中国語テレテキスト処理方法及び装置
CN1143232C (zh) 1998-11-30 2004-03-24 皇家菲利浦电子有限公司 正文的自动分割
US7099876B1 (en) 1998-12-15 2006-08-29 International Business Machines Corporation Method, system and computer program product for storing transliteration and/or phonetic spelling information in a text string class
US6389386B1 (en) 1998-12-15 2002-05-14 International Business Machines Corporation Method, system and computer program product for sorting text strings
US6460015B1 (en) 1998-12-15 2002-10-01 International Business Machines Corporation Method, system and computer program product for automatic character transliteration in a text string object
US6496844B1 (en) 1998-12-15 2002-12-17 International Business Machines Corporation Method, system and computer program product for providing a user interface with alternative display language choices
US6185524B1 (en) 1998-12-31 2001-02-06 Lernout & Hauspie Speech Products N.V. Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores
US6731802B1 (en) 2000-01-14 2004-05-04 Microsoft Corporation Lattice and method for identifying and normalizing orthographic variations in Japanese text
US6968308B1 (en) 1999-11-17 2005-11-22 Microsoft Corporation Method for segmenting non-segmented text using syntactic parse
US6678409B1 (en) * 2000-01-14 2004-01-13 Microsoft Corporation Parameterized word segmentation of unsegmented text
US6513003B1 (en) 2000-02-03 2003-01-28 Fair Disclosure Financial Network, Inc. System and method for integrated delivery of media and synchronized transcription
JP4048169B2 (ja) * 2001-06-11 2008-02-13 博 石倉 スペースの自動生成によって文章入力を支援するシステム
US20050060150A1 (en) * 2003-09-15 2005-03-17 Microsoft Corporation Unsupervised training for overlapping ambiguity resolution in word segmentation
US20070214189A1 (en) * 2006-03-10 2007-09-13 Motorola, Inc. System and method for consistency checking in documents
US8539349B1 (en) 2006-10-31 2013-09-17 Hewlett-Packard Development Company, L.P. Methods and systems for splitting a chinese character sequence into word segments
US8428932B2 (en) * 2006-12-13 2013-04-23 Nathan S. Ross Connected text data stream comprising coordinate logic to identify and validate segmented words in the connected text
KR101638442B1 (ko) * 2009-11-24 2016-07-12 한국전자통신연구원 중국어 구문 분절 방법 및 장치
US9767095B2 (en) 2010-05-21 2017-09-19 Western Standard Publishing Company, Inc. Apparatus, system, and method for computer aided translation
JP5372110B2 (ja) * 2011-10-28 2013-12-18 シャープ株式会社 情報出力装置、情報出力方法、及びコンピュータプログラム
IL224482B (en) 2013-01-29 2018-08-30 Verint Systems Ltd System and method for keyword spotting using representative dictionary
CN103679165B (zh) * 2013-12-31 2017-02-08 北京百度网讯科技有限公司 Ocr字符识别方法及系统
JP6476618B2 (ja) * 2014-07-07 2019-03-06 富士通株式会社 伸長方法、伸長プログラムおよび伸長装置
IL242218B (en) 2015-10-22 2020-11-30 Verint Systems Ltd A system and method for maintaining a dynamic dictionary
IL242219B (en) 2015-10-22 2020-11-30 Verint Systems Ltd System and method for keyword searching using both static and dynamic dictionaries
CN107168952B (zh) * 2017-05-15 2021-06-04 北京百度网讯科技有限公司 基于人工智能的信息生成方法和装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4327421A (en) * 1976-05-13 1982-04-27 Transtech International Corporation Chinese printing system
US4679951A (en) * 1979-11-06 1987-07-14 Cornell Research Foundation, Inc. Electronic keyboard system and method for reproducing selected symbolic language characters
US4365235A (en) * 1980-12-31 1982-12-21 International Business Machines Corporation Chinese/Kanji on-line recognition system
US4484305A (en) * 1981-12-14 1984-11-20 Paul Ho Phonetic multilingual word processor
JPH0724055B2 (ja) * 1984-07-31 1995-03-15 株式会社日立製作所 単語分割処理方法
JPS61105671A (ja) * 1984-10-29 1986-05-23 Hitachi Ltd 自然言語処理装置
US4742516A (en) * 1985-01-14 1988-05-03 Sumitomo Electric Industries, Ltd. Method for transmitting voice information
KR880001588Y1 (ko) * 1985-02-18 1988-05-04 최영수 단어 암기 용구
JPS61255468A (ja) * 1985-05-08 1986-11-13 Toshiba Corp 機械翻訳処理装置
JPS6231467A (ja) * 1985-08-01 1987-02-10 Toshiba Corp 文章作成装置
US4669901A (en) * 1985-09-03 1987-06-02 Feng I Ming Keyboard device for inputting oriental characters by touch
GB8629908D0 (en) * 1986-12-15 1987-01-28 Kemano Ltd Words & characters computer input device
JPS63284676A (ja) * 1987-05-16 1988-11-21 Ricoh Co Ltd 文字列処理装置
US5079702A (en) * 1990-03-15 1992-01-07 Paul Ho Phonetic multi-lingual word processor
JPH04299767A (ja) * 1991-03-28 1992-10-22 Ricoh Co Ltd 形態素解析装置
US5161245A (en) * 1991-05-01 1992-11-03 Apple Computer, Inc. Pattern recognition system having inter-pattern spacing correction

Also Published As

Publication number Publication date
JPH06325076A (ja) 1994-11-25
JP2741835B2 (ja) 1998-04-22
CN1095576C (zh) 2002-12-04
US5448474A (en) 1995-09-05
KR0122518B1 (ko) 1997-11-20
CN1254891A (zh) 2000-05-31
TW261677B (enExample) 1995-11-01
KR940022314A (ko) 1994-10-20
CN1100542A (zh) 1995-03-22

Similar Documents

Publication Publication Date Title
CN1168029C (zh) 从连续的中文文本中分离出中文词的方法
US7197449B2 (en) Method for extracting name entities and jargon terms using a suffix tree data structure
US5706365A (en) System and method for portable document indexing using n-gram word decomposition
Angell et al. Automatic spelling correction using a trigram similarity measure
US6671856B1 (en) Method, system, and program for determining boundaries in a string using a dictionary
JP2726568B2 (ja) 文字認識方法及び装置
US10095755B2 (en) Fast identification of complex strings in a data stream
JPH0877173A (ja) 文字列修正システムとその方法
US20030065652A1 (en) Method and apparatus for indexing and searching data
JP2001034623A (ja) 情報検索方法と情報検索装置
JPS63254559A (ja) 複合ワードのためのスペリング援助方法
CN101432686A (zh) 单词表和其它文本的有效存储和检索
US5560037A (en) Compact hyphenation point data
JP2693914B2 (ja) 検索システム
KR101694179B1 (ko) 모음 제거 기반 인덱스 생성 방법 및 장치
JPS63244259A (ja) キ−ワ−ド抽出装置
EP0878766A2 (en) Method for converting formatted documents to ordered word lists
JPH056398A (ja) 文書登録装置及び文書検索装置
CN1280757C (zh) 自动搜寻文件中关键词的方法与系统
Theeramunkong et al. Pattern-based features vs. statistical-based features in decision trees for word segmentation
CN1288580C (zh) 用于词汇获取和词边界识别的方法和系统
JP3831392B2 (ja) 言語知識獲得プログラム
Marukawa et al. A High Speed Word Matching Algorithm for Handwritten Chinese Character Recognition.
Rani et al. Post-processing methodology for word level Telugu character recognition systems using Unicode Approximation Models
KR20020003701A (ko) 디지털 문서의 키워드를 자동으로 추출하는 방법

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20040922