JP4638599B2 - データ表示テキストの言語および文字セットを特定する方法 - Google Patents

データ表示テキストの言語および文字セットを特定する方法 Download PDF

Info

Publication number
JP4638599B2
JP4638599B2 JP2000524742A JP2000524742A JP4638599B2 JP 4638599 B2 JP4638599 B2 JP 4638599B2 JP 2000524742 A JP2000524742 A JP 2000524742A JP 2000524742 A JP2000524742 A JP 2000524742A JP 4638599 B2 JP4638599 B2 JP 4638599B2
Authority
JP
Japan
Prior art keywords
language
function
character
character set
byte
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2000524742A
Other languages
English (en)
Japanese (ja)
Other versions
JP2001526425A (ja
JP2001526425A5 (enExample
Inventor
ロバート デイビッド パウェル
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of JP2001526425A publication Critical patent/JP2001526425A/ja
Publication of JP2001526425A5 publication Critical patent/JP2001526425A5/ja
Application granted granted Critical
Publication of JP4638599B2 publication Critical patent/JP4638599B2/ja
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
JP2000524742A 1997-12-11 1998-12-04 データ表示テキストの言語および文字セットを特定する方法 Expired - Fee Related JP4638599B2 (ja)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US08/987,565 US6157905A (en) 1997-12-11 1997-12-11 Identifying language and character set of data representing text
US08/987,565 1997-12-11
PCT/US1998/025814 WO1999030252A1 (en) 1997-12-11 1998-12-04 Identifying language and character set of data representing text

Publications (3)

Publication Number Publication Date
JP2001526425A JP2001526425A (ja) 2001-12-18
JP2001526425A5 JP2001526425A5 (enExample) 2006-01-26
JP4638599B2 true JP4638599B2 (ja) 2011-02-23

Family

ID=25533370

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2000524742A Expired - Fee Related JP4638599B2 (ja) 1997-12-11 1998-12-04 データ表示テキストの言語および文字セットを特定する方法

Country Status (6)

Country Link
US (1) US6157905A (enExample)
EP (2) EP1038239B1 (enExample)
JP (1) JP4638599B2 (enExample)
AT (1) ATE289434T1 (enExample)
DE (2) DE69838763T2 (enExample)
WO (1) WO1999030252A1 (enExample)

Families Citing this family (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539118B1 (en) * 1998-12-31 2003-03-25 International Business Machines Corporation System and method for evaluating character sets of a message containing a plurality of character sets
US7039637B2 (en) 1998-12-31 2006-05-02 International Business Machines Corporation System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search
US7031002B1 (en) 1998-12-31 2006-04-18 International Business Machines Corporation System and method for using character set matching to enhance print quality
US6760887B1 (en) 1998-12-31 2004-07-06 International Business Machines Corporation System and method for highlighting of multifont documents
US7103532B1 (en) 1998-12-31 2006-09-05 International Business Machines Corp. System and method for evaluating character in a message
US6718519B1 (en) 1998-12-31 2004-04-06 International Business Machines Corporation System and method for outputting character sets in best available fonts
US6813747B1 (en) 1998-12-31 2004-11-02 International Business Machines Corporation System and method for output of multipart documents
US6658151B2 (en) * 1999-04-08 2003-12-02 Ricoh Co., Ltd. Extracting information from symbolically compressed document images
US7191114B1 (en) 1999-08-27 2007-03-13 International Business Machines Corporation System and method for evaluating character sets to determine a best match encoding a message
US7155672B1 (en) * 2000-05-23 2006-12-26 Spyglass, Inc. Method and system for dynamic font subsetting
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
TW561360B (en) * 2000-08-22 2003-11-11 Ibm Method and system for case conversion
GB2366940B (en) * 2000-09-06 2004-08-11 Ericsson Telefon Ab L M Text language detection
JP2002189627A (ja) * 2000-12-21 2002-07-05 Tsubasa System Co Ltd 文書データの変換方法
US7900143B2 (en) * 2000-12-27 2011-03-01 Intel Corporation Large character set browser
JP2002268665A (ja) * 2001-03-13 2002-09-20 Oki Electric Ind Co Ltd テキスト音声合成装置
US20040205675A1 (en) * 2002-01-11 2004-10-14 Thangaraj Veerappan System and method for determining a document language and refining the character set encoding based on the document language
US7020338B1 (en) 2002-04-08 2006-03-28 The United States Of America As Represented By The National Security Agency Method of identifying script of line of text
US20040078191A1 (en) * 2002-10-22 2004-04-22 Nokia Corporation Scalable neural network-based language identification from written text
FR2848688A1 (fr) * 2002-12-17 2004-06-18 France Telecom Identification de langue d'un texte
US20060015630A1 (en) 2003-11-12 2006-01-19 The Trustees Of Columbia University In The City Of New York Apparatus method and medium for identifying files using n-gram distribution of data
US7865355B2 (en) * 2004-07-30 2011-01-04 Sap Aktiengesellschaft Fast text character set recognition
US7305385B1 (en) * 2004-09-10 2007-12-04 Aol Llc N-gram based text searching
US7612897B2 (en) * 2004-09-24 2009-11-03 Seiko Epson Corporation Method of managing the printing of characters and a printing device employing method
US7729900B2 (en) * 2004-09-29 2010-06-01 Microsoft Corporation Method and computer-readable medium for consistent configuration of language support across operating system and application programs
US9122655B2 (en) * 2004-11-15 2015-09-01 International Business Machines Corporation Pre-translation testing of bi-directional language display
US8027832B2 (en) * 2005-02-11 2011-09-27 Microsoft Corporation Efficient language identification
JP4314204B2 (ja) * 2005-03-11 2009-08-12 株式会社東芝 文書管理方法、システム及びプログラム
US7774293B2 (en) * 2005-03-17 2010-08-10 University Of Maryland System and methods for assessing risk using hybrid causal logic
EP1746516A1 (en) * 2005-07-20 2007-01-24 Microsoft Corporation Character generator
US7689531B1 (en) * 2005-09-28 2010-03-30 Trend Micro Incorporated Automatic charset detection using support vector machines with charset grouping
US7711673B1 (en) * 2005-09-28 2010-05-04 Trend Micro Incorporated Automatic charset detection using SIM algorithm with charset grouping
GB0524354D0 (en) * 2005-11-30 2006-01-04 Ibm Method, system and computer program product for composing a reply to a text message received in a messaging application
KR100814641B1 (ko) * 2006-10-23 2008-03-18 성균관대학교산학협력단 사용자 주도형 음성 서비스 시스템 및 그 서비스 방법
US20080243477A1 (en) * 2007-03-30 2008-10-02 Rulespace Llc Multi-staged language classification
US9141607B1 (en) * 2007-05-30 2015-09-22 Google Inc. Determining optical character recognition parameters
US8315482B2 (en) * 2007-06-26 2012-11-20 Microsoft Corporation Integrated platform for user input of digital ink
US8233726B1 (en) * 2007-11-27 2012-07-31 Googe Inc. Image-domain script and language identification
US8266514B2 (en) * 2008-06-26 2012-09-11 Microsoft Corporation Map service
US8107671B2 (en) 2008-06-26 2012-01-31 Microsoft Corporation Script detection service
US8073680B2 (en) * 2008-06-26 2011-12-06 Microsoft Corporation Language detection service
US8019596B2 (en) * 2008-06-26 2011-09-13 Microsoft Corporation Linguistic service platform
US8224641B2 (en) * 2008-11-19 2012-07-17 Stratify, Inc. Language identification for documents containing multiple languages
US8224642B2 (en) * 2008-11-20 2012-07-17 Stratify, Inc. Automated identification of documents as not belonging to any language
US8326602B2 (en) * 2009-06-05 2012-12-04 Google Inc. Detecting writing systems and languages
US8468011B1 (en) * 2009-06-05 2013-06-18 Google Inc. Detecting writing systems and languages
US9454514B2 (en) * 2009-09-02 2016-09-27 Red Hat, Inc. Local language numeral conversion in numeric computing
US20110087962A1 (en) * 2009-10-14 2011-04-14 Qualcomm Incorporated Method and apparatus for the automatic predictive selection of input methods for web browsers
US8560466B2 (en) 2010-02-26 2013-10-15 Trend Micro Incorporated Method and arrangement for automatic charset detection
CN102479187B (zh) * 2010-11-23 2016-09-14 盛乐信息技术(上海)有限公司 基于奇偶校验的gbk字符查询系统及其实现方法
US9535895B2 (en) * 2011-03-17 2017-01-03 Amazon Technologies, Inc. n-Gram-based language prediction
GB2489512A (en) * 2011-03-31 2012-10-03 Clearswift Ltd Classifying data using fingerprint of character encoding
EP2734956A4 (en) * 2011-07-20 2014-12-31 Tata Consultancy Services Ltd METHOD AND SYSTEM FOR DIFFERENTIATION OF TEXT INFORMATION INTEGRATED IN VIDEO CONTENT INTERNET INFORMATION
US8769404B2 (en) * 2012-01-03 2014-07-01 International Business Machines Corporation Rule-based locale definition generation for a new or customized locale support
US20140129928A1 (en) * 2012-11-06 2014-05-08 Psyentific Mind Inc. Method and system for representing capitalization of letters while preserving their category similarity to lowercase letters
US9195644B2 (en) 2012-12-18 2015-11-24 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Short phrase language identification
US9372850B1 (en) * 2012-12-19 2016-06-21 Amazon Technologies, Inc. Machined book detection
US10650103B2 (en) 2013-02-08 2020-05-12 Mz Ip Holdings, Llc Systems and methods for incentivizing user feedback for translation processing
US9231898B2 (en) 2013-02-08 2016-01-05 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9031829B2 (en) 2013-02-08 2015-05-12 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US8996352B2 (en) 2013-02-08 2015-03-31 Machine Zone, Inc. Systems and methods for correcting translations in multi-user multi-lingual communications
US9600473B2 (en) 2013-02-08 2017-03-21 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9298703B2 (en) 2013-02-08 2016-03-29 Machine Zone, Inc. Systems and methods for incentivizing user feedback for translation processing
US9244894B1 (en) * 2013-09-16 2016-01-26 Arria Data2Text Limited Method and apparatus for interactive reports
TWI508561B (zh) 2013-11-27 2015-11-11 Wistron Corp 電子節目表單之產生裝置及電子節目表單之產生方法
JP6300512B2 (ja) * 2013-12-19 2018-03-28 株式会社ソリトンシステムズ 判定装置、判定方法、及び、プログラム
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US10162811B2 (en) * 2014-10-17 2018-12-25 Mz Ip Holdings, Llc Systems and methods for language detection
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US9477652B2 (en) 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
US9734142B2 (en) * 2015-09-22 2017-08-15 Facebook, Inc. Universal translation
JP6655331B2 (ja) * 2015-09-24 2020-02-26 Dynabook株式会社 電子機器及び方法
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US9734143B2 (en) 2015-12-17 2017-08-15 Facebook, Inc. Multi-media context language processing
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US9747283B2 (en) 2015-12-28 2017-08-29 Facebook, Inc. Predicting future translations
US10002125B2 (en) 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US10765956B2 (en) 2016-01-07 2020-09-08 Machine Zone Inc. Named entity recognition on chat data
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10120860B2 (en) * 2016-12-21 2018-11-06 Intel Corporation Methods and apparatus to identify a count of n-grams appearing in a corpus
US10180935B2 (en) 2016-12-30 2019-01-15 Facebook, Inc. Identifying multiple languages in a content item
US10769387B2 (en) 2017-09-21 2020-09-08 Mz Ip Holdings, Llc System and method for translating chat messages
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
EP3662467B1 (en) * 2018-10-11 2021-07-07 Google LLC Speech generation using crosslingual phoneme mapping
JP6781905B1 (ja) * 2019-07-26 2020-11-11 株式会社Fronteo 情報処理装置、自然言語処理システム、制御方法、および制御プログラム

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5261009A (en) * 1985-10-15 1993-11-09 Palantir Corporation Means for resolving ambiguities in text passed upon character context
US5062143A (en) * 1990-02-23 1991-10-29 Harris Corporation Trigram-based method of language identification
US5592667A (en) * 1991-05-29 1997-01-07 Triada, Ltd. Method of storing compressed data for accelerated interrogation
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
GB9220404D0 (en) * 1992-08-20 1992-11-11 Nat Security Agency Method of identifying,retrieving and sorting documents
US5608622A (en) * 1992-09-11 1997-03-04 Lucent Technologies Inc. System for analyzing translations
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
US5510981A (en) * 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
US5548507A (en) * 1994-03-14 1996-08-20 International Business Machines Corporation Language identification process using coded language words
SE513456C2 (sv) * 1994-05-10 2000-09-18 Telia Ab Metod och anordning vid tal- till textomvandling
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction
US6070140A (en) * 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5761687A (en) * 1995-10-04 1998-06-02 Apple Computer, Inc. Character-based correction arrangement with correction propagation
JPH11514764A (ja) * 1995-10-31 1999-12-14 エス.エム. ハーツ,フレデリック 所望のオブジェクトのカスタム化された電子識別のためのシステム
US5982933A (en) * 1996-01-12 1999-11-09 Canon Kabushiki Kaisha Information processing method, information processing apparatus, and storage medium
EP0849723A3 (en) * 1996-12-20 1998-12-30 ATR Interpreting Telecommunications Research Laboratories Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US6073098A (en) * 1997-11-21 2000-06-06 At&T Corporation Method and apparatus for generating deterministic approximate weighted finite-state automata

Also Published As

Publication number Publication date
DE69829074T2 (de) 2005-06-30
DE69829074D1 (de) 2005-03-24
EP1038239B1 (en) 2005-02-16
JP2001526425A (ja) 2001-12-18
EP1038239A1 (en) 2000-09-27
EP1498827A3 (en) 2005-02-16
WO1999030252A1 (en) 1999-06-17
EP1498827A2 (en) 2005-01-19
DE69838763D1 (de) 2008-01-03
EP1498827B1 (en) 2007-11-21
ATE289434T1 (de) 2005-03-15
US6157905A (en) 2000-12-05
DE69838763T2 (de) 2008-10-30

Similar Documents

Publication Publication Date Title
JP4638599B2 (ja) データ表示テキストの言語および文字セットを特定する方法
US8255376B2 (en) Augmenting queries with synonyms from synonyms map
US8762358B2 (en) Query language determination using query terms and interface language
US7475063B2 (en) Augmenting queries with synonyms selected using language statistics
JP3121568B2 (ja) 言語を特定する方法およびシステム
US6415250B1 (en) System and method for identifying language using morphologically-based techniques
JP5173141B2 (ja) 効率のよい言語識別
KR100330801B1 (ko) 언어식별장치및언어식별방법
JP4301515B2 (ja) 文章表示方法、情報処理装置、情報処理システム、プログラム
JP2000194696A (ja) サンプルテキスト基調言語自動識別方法
US7835903B2 (en) Simplifying query terms with transliteration
CN104077346A (zh) 文档制作支援装置、方法及程序
EP2016486A2 (en) Processing of query terms
JP7040155B2 (ja) 情報処理装置、情報処理方法及びプログラム
JP7326637B2 (ja) チャンキング実行システム、チャンキング実行方法、及びプログラム
JP2943791B2 (ja) 言語識別装置,言語識別方法および言語識別のプログラムを記録した記録媒体
JP5412137B2 (ja) 機械学習装置及び方法
JP3471381B2 (ja) 文字列処理方法
JP6210194B2 (ja) 文書分析システム、方法およびプログラム
CN114461797A (zh) 电容分类识别方法、装置、存储介质以及电子设备
CN114461798A (zh) 元器件分类识别方法、装置、存储介质以及电子设备
JPS61184682A (ja) 仮名漢字変換装置
JPH04180160A (ja) 文節切出し装置
JPH07334515A (ja) 情報検索方法および装置

Legal Events

Date Code Title Description
A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20051125

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20051125

RD04 Notification of resignation of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7424

Effective date: 20051125

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20070525

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20070824

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20070914

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20071212

RD13 Notification of appointment of power of sub attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7433

Effective date: 20071213

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A821

Effective date: 20071213

A911 Transfer to examiner for re-examination before appeal (zenchi)

Free format text: JAPANESE INTERMEDIATE CODE: A911

Effective date: 20080124

A912 Re-examination (zenchi) completed and case transferred to appeal board

Free format text: JAPANESE INTERMEDIATE CODE: A912

Effective date: 20080208

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20101126

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20131203

Year of fee payment: 3

R150 Certificate of patent or registration of utility model

Ref document number: 4638599

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

Free format text: JAPANESE INTERMEDIATE CODE: R150

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

S111 Request for change of ownership or part of ownership

Free format text: JAPANESE INTERMEDIATE CODE: R313113

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250

LAPS Cancellation because of no payment of annual fees