JP2002082945A - 自然言語処理システム用トークナイザ - Google Patents
自然言語処理システム用トークナイザInfo
- Publication number
- JP2002082945A JP2002082945A JP2001219846A JP2001219846A JP2002082945A JP 2002082945 A JP2002082945 A JP 2002082945A JP 2001219846 A JP2001219846 A JP 2001219846A JP 2001219846 A JP2001219846 A JP 2001219846A JP 2002082945 A JP2002082945 A JP 2002082945A
- Authority
- JP
- Japan
- Prior art keywords
- token
- punctuation
- segmentation
- segmenting
- tokens
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/226—Validation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US21957900P | 2000-07-20 | 2000-07-20 | |
| US09/822976 | 2001-03-30 | ||
| US09/822,976 US7092871B2 (en) | 2000-07-20 | 2001-03-30 | Tokenizer for a natural language processing system |
| US60/219579 | 2001-03-30 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| JP2002082945A true JP2002082945A (ja) | 2002-03-22 |
| JP2002082945A5 JP2002082945A5 (enExample) | 2008-09-04 |
Family
ID=26914031
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| JP2001219846A Withdrawn JP2002082945A (ja) | 2000-07-20 | 2001-07-19 | 自然言語処理システム用トークナイザ |
Country Status (5)
| Country | Link |
|---|---|
| US (2) | US7092871B2 (enExample) |
| EP (1) | EP1178408B1 (enExample) |
| JP (1) | JP2002082945A (enExample) |
| AT (1) | ATE421729T1 (enExample) |
| DE (1) | DE60137477D1 (enExample) |
Families Citing this family (44)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9076448B2 (en) * | 1999-11-12 | 2015-07-07 | Nuance Communications, Inc. | Distributed real time speech recognition system |
| US7050977B1 (en) * | 1999-11-12 | 2006-05-23 | Phoenix Solutions, Inc. | Speech-enabled server for internet website and method |
| US7725307B2 (en) | 1999-11-12 | 2010-05-25 | Phoenix Solutions, Inc. | Query engine for processing voice based queries including semantic decoding |
| US7392185B2 (en) | 1999-11-12 | 2008-06-24 | Phoenix Solutions, Inc. | Speech based learning/training system using semantic decoding |
| JP2002268665A (ja) * | 2001-03-13 | 2002-09-20 | Oki Electric Ind Co Ltd | テキスト音声合成装置 |
| US7493253B1 (en) | 2002-07-12 | 2009-02-17 | Language And Computing, Inc. | Conceptual world representation natural language understanding system and method |
| US20050256715A1 (en) * | 2002-10-08 | 2005-11-17 | Yoshiyuki Okimoto | Language model generation and accumulation device, speech recognition device, language model creation method, and speech recognition method |
| JP2004198872A (ja) * | 2002-12-20 | 2004-07-15 | Sony Electronics Inc | 端末装置およびサーバ |
| RU2348071C2 (ru) * | 2003-12-30 | 2009-02-27 | Гугл Инк. | Способы и системы сегментации текста |
| US8051096B1 (en) * | 2004-09-30 | 2011-11-01 | Google Inc. | Methods and systems for augmenting a token lexicon |
| US20060242593A1 (en) * | 2005-04-26 | 2006-10-26 | Sharp Laboratories Of America, Inc. | Printer emoticon detector & converter |
| US7930354B2 (en) * | 2005-12-21 | 2011-04-19 | Research In Motion Limited | System and method for reviewing attachment content on a mobile device |
| US8595304B2 (en) | 2005-12-21 | 2013-11-26 | Blackberry Limited | System and method for reviewing attachment content on a mobile device |
| US7958164B2 (en) * | 2006-02-16 | 2011-06-07 | Microsoft Corporation | Visual design of annotated regular expression |
| US7860881B2 (en) | 2006-03-09 | 2010-12-28 | Microsoft Corporation | Data parsing with annotated patterns |
| US7987168B2 (en) * | 2006-04-08 | 2011-07-26 | James Walter Haddock | Method for managing information |
| US20070294217A1 (en) * | 2006-06-14 | 2007-12-20 | Nec Laboratories America, Inc. | Safety guarantee of continuous join queries over punctuated data streams |
| US7823138B2 (en) * | 2006-11-14 | 2010-10-26 | Microsoft Corporation | Distributed testing for computing features |
| US8875013B2 (en) * | 2008-03-25 | 2014-10-28 | International Business Machines Corporation | Multi-pass validation of extensible markup language (XML) documents |
| US8521516B2 (en) * | 2008-03-26 | 2013-08-27 | Google Inc. | Linguistic key normalization |
| US8301437B2 (en) * | 2008-07-24 | 2012-10-30 | Yahoo! Inc. | Tokenization platform |
| US20140372119A1 (en) * | 2008-09-26 | 2014-12-18 | Google, Inc. | Compounded Text Segmentation |
| US8428933B1 (en) | 2009-12-17 | 2013-04-23 | Shopzilla, Inc. | Usage based query response |
| US8775160B1 (en) | 2009-12-17 | 2014-07-08 | Shopzilla, Inc. | Usage based query response |
| CN102479191B (zh) | 2010-11-22 | 2014-03-26 | 阿里巴巴集团控股有限公司 | 提供多粒度分词结果的方法及其装置 |
| US9208134B2 (en) * | 2012-01-10 | 2015-12-08 | King Abdulaziz City For Science And Technology | Methods and systems for tokenizing multilingual textual documents |
| US9141606B2 (en) * | 2012-03-29 | 2015-09-22 | Lionbridge Technologies, Inc. | Methods and systems for multi-engine machine translation |
| CN103425691B (zh) | 2012-05-22 | 2016-12-14 | 阿里巴巴集团控股有限公司 | 一种搜索方法和系统 |
| US9002852B2 (en) | 2012-11-15 | 2015-04-07 | Adobe Systems Incorporated | Mining semi-structured social media |
| US9626414B2 (en) * | 2014-04-14 | 2017-04-18 | International Business Machines Corporation | Automatic log record segmentation |
| US10388270B2 (en) * | 2014-11-05 | 2019-08-20 | At&T Intellectual Property I, L.P. | System and method for text normalization using atomic tokens |
| US10409909B2 (en) * | 2014-12-12 | 2019-09-10 | Omni Ai, Inc. | Lexical analyzer for a neuro-linguistic behavior recognition system |
| US10318591B2 (en) | 2015-06-02 | 2019-06-11 | International Business Machines Corporation | Ingesting documents using multiple ingestion pipelines |
| US10013404B2 (en) * | 2015-12-03 | 2018-07-03 | International Business Machines Corporation | Targeted story summarization using natural language processing |
| US10248738B2 (en) | 2015-12-03 | 2019-04-02 | International Business Machines Corporation | Structuring narrative blocks in a logical sequence |
| US10013450B2 (en) | 2015-12-03 | 2018-07-03 | International Business Machines Corporation | Using knowledge graphs to identify potential inconsistencies in works of authorship |
| CN106874256A (zh) * | 2015-12-11 | 2017-06-20 | 北京国双科技有限公司 | 识别领域命名实体的方法及装置 |
| US10963641B2 (en) * | 2017-06-16 | 2021-03-30 | Microsoft Technology Licensing, Llc | Multi-lingual tokenization of documents and associated queries |
| US11003854B2 (en) * | 2018-10-30 | 2021-05-11 | International Business Machines Corporation | Adjusting an operation of a system based on a modified lexical analysis model for a document |
| US11176329B2 (en) | 2020-02-18 | 2021-11-16 | Bank Of America Corporation | Source code compiler using natural language input |
| US11250128B2 (en) | 2020-02-18 | 2022-02-15 | Bank Of America Corporation | System and method for detecting source code anomalies |
| US11501071B2 (en) * | 2020-07-08 | 2022-11-15 | International Business Machines Corporation | Word and image relationships in combined vector space |
| US12373641B2 (en) | 2021-02-27 | 2025-07-29 | Walmart Apollo, Llc | Methods and apparatus for natural language understanding in conversational systems using machine learning processes |
| US11960842B2 (en) | 2021-02-27 | 2024-04-16 | Walmart Apollo, Llc | Methods and apparatus for natural language understanding in conversational systems using machine learning processes |
Family Cites Families (18)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5225981A (en) * | 1986-10-03 | 1993-07-06 | Ricoh Company, Ltd. | Language analyzer for morphemically and syntactically analyzing natural languages by using block analysis and composite morphemes |
| US4777617A (en) * | 1987-03-12 | 1988-10-11 | International Business Machines Corporation | Method for verifying spelling of compound words |
| US5487147A (en) * | 1991-09-05 | 1996-01-23 | International Business Machines Corporation | Generation of error messages and error recovery for an LL(1) parser |
| US5634084A (en) * | 1995-01-20 | 1997-05-27 | Centigram Communications Corporation | Abbreviation and acronym/initialism expansion procedures for a text to speech reader |
| US5828991A (en) * | 1995-06-30 | 1998-10-27 | The Research Foundation Of The State University Of New York | Sentence reconstruction using word ambiguity resolution |
| US5794177A (en) * | 1995-07-19 | 1998-08-11 | Inso Corporation | Method and apparatus for morphological analysis and generation of natural language text |
| US5806021A (en) | 1995-10-30 | 1998-09-08 | International Business Machines Corporation | Automatic segmentation of continuous text using statistical approaches |
| US5870700A (en) * | 1996-04-01 | 1999-02-09 | Dts Software, Inc. | Brazilian Portuguese grammar checker |
| US6016467A (en) * | 1997-05-27 | 2000-01-18 | Digital Equipment Corporation | Method and apparatus for program development using a grammar-sensitive editor |
| US5963742A (en) * | 1997-09-08 | 1999-10-05 | Lucent Technologies, Inc. | Using speculative parsing to process complex input data |
| GB9806085D0 (en) * | 1998-03-23 | 1998-05-20 | Xerox Corp | Text summarisation using light syntactic parsing |
| US6401060B1 (en) * | 1998-06-25 | 2002-06-04 | Microsoft Corporation | Method for typographical detection and replacement in Japanese text |
| WO2000011576A1 (en) | 1998-08-24 | 2000-03-02 | Virtual Research Associates, Inc. | Natural language sentence parser |
| KR100749289B1 (ko) * | 1998-11-30 | 2007-08-14 | 코닌클리케 필립스 일렉트로닉스 엔.브이. | 텍스트의 자동 세그멘테이션 방법 및 시스템 |
| US6523172B1 (en) * | 1998-12-17 | 2003-02-18 | Evolutionary Technologies International, Inc. | Parser translator system and method |
| US6269189B1 (en) * | 1998-12-29 | 2001-07-31 | Xerox Corporation | Finding selected character strings in text and providing information relating to the selected character strings |
| US6185524B1 (en) * | 1998-12-31 | 2001-02-06 | Lernout & Hauspie Speech Products N.V. | Method and apparatus for automatic identification of word boundaries in continuous text and computation of word boundary scores |
| US6442524B1 (en) * | 1999-01-29 | 2002-08-27 | Sony Corporation | Analyzing inflectional morphology in a spoken language translation system |
-
2001
- 2001-03-30 US US09/822,976 patent/US7092871B2/en not_active Expired - Fee Related
- 2001-07-19 DE DE60137477T patent/DE60137477D1/de not_active Expired - Lifetime
- 2001-07-19 EP EP01117480A patent/EP1178408B1/en not_active Expired - Lifetime
- 2001-07-19 JP JP2001219846A patent/JP2002082945A/ja not_active Withdrawn
- 2001-07-19 AT AT01117480T patent/ATE421729T1/de not_active IP Right Cessation
-
2005
- 2005-07-15 US US11/182,477 patent/US7269547B2/en not_active Expired - Fee Related
Also Published As
| Publication number | Publication date |
|---|---|
| EP1178408A3 (en) | 2002-05-29 |
| US20030023425A1 (en) | 2003-01-30 |
| DE60137477D1 (de) | 2009-03-12 |
| US20050251381A1 (en) | 2005-11-10 |
| EP1178408A2 (en) | 2002-02-06 |
| ATE421729T1 (de) | 2009-02-15 |
| US7092871B2 (en) | 2006-08-15 |
| EP1178408B1 (en) | 2009-01-21 |
| US7269547B2 (en) | 2007-09-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| JP2002082945A (ja) | 自然言語処理システム用トークナイザ | |
| US5890103A (en) | Method and apparatus for improved tokenization of natural language text | |
| US6246976B1 (en) | Apparatus, method and storage medium for identifying a combination of a language and its character code system | |
| US7523102B2 (en) | Content search in complex language, such as Japanese | |
| JP5538820B2 (ja) | 2カ国語コーパスからの変換マッピングの自動抽出プログラム | |
| US6363373B1 (en) | Method and apparatus for concept searching using a Boolean or keyword search engine | |
| US6167370A (en) | Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures | |
| JP5113750B2 (ja) | 定義の抽出 | |
| JP4263371B2 (ja) | 文書をパージングするシステム及び方法 | |
| CN1618064A (zh) | 翻译方法、已翻译句子的输入方法、记录介质、程序与计算机设备 | |
| JP2003108184A (ja) | 入力モードバイアスを適用するための方法およびシステム | |
| KR20020063118A (ko) | 언어학적으로 지능적인 텍스트 압축방법 및 그 처리장치 | |
| US20040186706A1 (en) | Translation system, dictionary updating server, translation method, and program and recording medium for use therein | |
| US7328404B2 (en) | Method for predicting the readings of japanese ideographs | |
| US7684975B2 (en) | Morphological analyzer, natural language processor, morphological analysis method and program | |
| EP1290574B1 (en) | System and method for matching a textual input to a lexical knowledge base and for utilizing results of that match | |
| CN100422987C (zh) | 网络中智能信息处理的方法和系统 | |
| JP2000148754A (ja) | マルチリンガル・システム,マルチリンガル処理方法およびマルチリンガル処理のプログラムを記憶した媒体 | |
| JP3691773B2 (ja) | 文章解析方法とその方法を利用可能な文章解析装置 | |
| KR20060043583A (ko) | 언어 데이터의 로그의 압축 방법 및 시스템 | |
| JP2943791B2 (ja) | 言語識別装置,言語識別方法および言語識別のプログラムを記録した記録媒体 | |
| US20050102278A1 (en) | Expanded search keywords | |
| EP1605371A1 (en) | Content search in complex language, such as japanese | |
| JP3267168B2 (ja) | 自然言語変換システム | |
| JP2004362007A (ja) | 文書登録装置、文書検索装置、プログラム及び記憶媒体 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| A521 | Request for written amendment filed |
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20080717 |
|
| A621 | Written request for application examination |
Free format text: JAPANESE INTERMEDIATE CODE: A621 Effective date: 20080717 |
|
| A761 | Written withdrawal of application |
Free format text: JAPANESE INTERMEDIATE CODE: A761 Effective date: 20090706 |