ATE289434T1 - Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten - Google Patents

Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Info

Publication number
ATE289434T1
ATE289434T1 AT98962916T AT98962916T ATE289434T1 AT E289434 T1 ATE289434 T1 AT E289434T1 AT 98962916 T AT98962916 T AT 98962916T AT 98962916 T AT98962916 T AT 98962916T AT E289434 T1 ATE289434 T1 AT E289434T1
Authority
AT
Austria
Prior art keywords
data values
language
series
facility
character set
Prior art date
Application number
AT98962916T
Other languages
German (de)
English (en)
Inventor
Robert David Powell
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Application granted granted Critical
Publication of ATE289434T1 publication Critical patent/ATE289434T1/de

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
AT98962916T 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten ATE289434T1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US08/987,565 US6157905A (en) 1997-12-11 1997-12-11 Identifying language and character set of data representing text
PCT/US1998/025814 WO1999030252A1 (en) 1997-12-11 1998-12-04 Identifying language and character set of data representing text

Publications (1)

Publication Number Publication Date
ATE289434T1 true ATE289434T1 (de) 2005-03-15

Family

ID=25533370

Family Applications (1)

Application Number Title Priority Date Filing Date
AT98962916T ATE289434T1 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Country Status (6)

Country Link
US (1) US6157905A (enExample)
EP (2) EP1038239B1 (enExample)
JP (1) JP4638599B2 (enExample)
AT (1) ATE289434T1 (enExample)
DE (2) DE69838763T2 (enExample)
WO (1) WO1999030252A1 (enExample)

Families Citing this family (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539118B1 (en) * 1998-12-31 2003-03-25 International Business Machines Corporation System and method for evaluating character sets of a message containing a plurality of character sets
US7039637B2 (en) 1998-12-31 2006-05-02 International Business Machines Corporation System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search
US7031002B1 (en) 1998-12-31 2006-04-18 International Business Machines Corporation System and method for using character set matching to enhance print quality
US6760887B1 (en) 1998-12-31 2004-07-06 International Business Machines Corporation System and method for highlighting of multifont documents
US7103532B1 (en) 1998-12-31 2006-09-05 International Business Machines Corp. System and method for evaluating character in a message
US6718519B1 (en) 1998-12-31 2004-04-06 International Business Machines Corporation System and method for outputting character sets in best available fonts
US6813747B1 (en) 1998-12-31 2004-11-02 International Business Machines Corporation System and method for output of multipart documents
US6658151B2 (en) * 1999-04-08 2003-12-02 Ricoh Co., Ltd. Extracting information from symbolically compressed document images
US7191114B1 (en) 1999-08-27 2007-03-13 International Business Machines Corporation System and method for evaluating character sets to determine a best match encoding a message
US7155672B1 (en) * 2000-05-23 2006-12-26 Spyglass, Inc. Method and system for dynamic font subsetting
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
TW561360B (en) * 2000-08-22 2003-11-11 Ibm Method and system for case conversion
GB2366940B (en) * 2000-09-06 2004-08-11 Ericsson Telefon Ab L M Text language detection
JP2002189627A (ja) * 2000-12-21 2002-07-05 Tsubasa System Co Ltd 文書データの変換方法
US7900143B2 (en) * 2000-12-27 2011-03-01 Intel Corporation Large character set browser
JP2002268665A (ja) * 2001-03-13 2002-09-20 Oki Electric Ind Co Ltd テキスト音声合成装置
US20040205675A1 (en) * 2002-01-11 2004-10-14 Thangaraj Veerappan System and method for determining a document language and refining the character set encoding based on the document language
US7020338B1 (en) 2002-04-08 2006-03-28 The United States Of America As Represented By The National Security Agency Method of identifying script of line of text
US20040078191A1 (en) * 2002-10-22 2004-04-22 Nokia Corporation Scalable neural network-based language identification from written text
FR2848688A1 (fr) * 2002-12-17 2004-06-18 France Telecom Identification de langue d'un texte
US20060015630A1 (en) 2003-11-12 2006-01-19 The Trustees Of Columbia University In The City Of New York Apparatus method and medium for identifying files using n-gram distribution of data
US7865355B2 (en) * 2004-07-30 2011-01-04 Sap Aktiengesellschaft Fast text character set recognition
US7305385B1 (en) * 2004-09-10 2007-12-04 Aol Llc N-gram based text searching
US7612897B2 (en) * 2004-09-24 2009-11-03 Seiko Epson Corporation Method of managing the printing of characters and a printing device employing method
US7729900B2 (en) * 2004-09-29 2010-06-01 Microsoft Corporation Method and computer-readable medium for consistent configuration of language support across operating system and application programs
US9122655B2 (en) * 2004-11-15 2015-09-01 International Business Machines Corporation Pre-translation testing of bi-directional language display
US8027832B2 (en) * 2005-02-11 2011-09-27 Microsoft Corporation Efficient language identification
JP4314204B2 (ja) * 2005-03-11 2009-08-12 株式会社東芝 文書管理方法、システム及びプログラム
US7774293B2 (en) * 2005-03-17 2010-08-10 University Of Maryland System and methods for assessing risk using hybrid causal logic
EP1746516A1 (en) * 2005-07-20 2007-01-24 Microsoft Corporation Character generator
US7689531B1 (en) * 2005-09-28 2010-03-30 Trend Micro Incorporated Automatic charset detection using support vector machines with charset grouping
US7711673B1 (en) * 2005-09-28 2010-05-04 Trend Micro Incorporated Automatic charset detection using SIM algorithm with charset grouping
GB0524354D0 (en) * 2005-11-30 2006-01-04 Ibm Method, system and computer program product for composing a reply to a text message received in a messaging application
KR100814641B1 (ko) * 2006-10-23 2008-03-18 성균관대학교산학협력단 사용자 주도형 음성 서비스 시스템 및 그 서비스 방법
US20080243477A1 (en) * 2007-03-30 2008-10-02 Rulespace Llc Multi-staged language classification
US9141607B1 (en) * 2007-05-30 2015-09-22 Google Inc. Determining optical character recognition parameters
US8315482B2 (en) * 2007-06-26 2012-11-20 Microsoft Corporation Integrated platform for user input of digital ink
US8233726B1 (en) * 2007-11-27 2012-07-31 Googe Inc. Image-domain script and language identification
US8266514B2 (en) * 2008-06-26 2012-09-11 Microsoft Corporation Map service
US8107671B2 (en) 2008-06-26 2012-01-31 Microsoft Corporation Script detection service
US8073680B2 (en) * 2008-06-26 2011-12-06 Microsoft Corporation Language detection service
US8019596B2 (en) * 2008-06-26 2011-09-13 Microsoft Corporation Linguistic service platform
US8224641B2 (en) * 2008-11-19 2012-07-17 Stratify, Inc. Language identification for documents containing multiple languages
US8224642B2 (en) * 2008-11-20 2012-07-17 Stratify, Inc. Automated identification of documents as not belonging to any language
US8326602B2 (en) * 2009-06-05 2012-12-04 Google Inc. Detecting writing systems and languages
US8468011B1 (en) * 2009-06-05 2013-06-18 Google Inc. Detecting writing systems and languages
US9454514B2 (en) * 2009-09-02 2016-09-27 Red Hat, Inc. Local language numeral conversion in numeric computing
US20110087962A1 (en) * 2009-10-14 2011-04-14 Qualcomm Incorporated Method and apparatus for the automatic predictive selection of input methods for web browsers
US8560466B2 (en) 2010-02-26 2013-10-15 Trend Micro Incorporated Method and arrangement for automatic charset detection
CN102479187B (zh) * 2010-11-23 2016-09-14 盛乐信息技术(上海)有限公司 基于奇偶校验的gbk字符查询系统及其实现方法
US9535895B2 (en) * 2011-03-17 2017-01-03 Amazon Technologies, Inc. n-Gram-based language prediction
GB2489512A (en) * 2011-03-31 2012-10-03 Clearswift Ltd Classifying data using fingerprint of character encoding
EP2734956A4 (en) * 2011-07-20 2014-12-31 Tata Consultancy Services Ltd METHOD AND SYSTEM FOR DIFFERENTIATION OF TEXT INFORMATION INTEGRATED IN VIDEO CONTENT INTERNET INFORMATION
US8769404B2 (en) * 2012-01-03 2014-07-01 International Business Machines Corporation Rule-based locale definition generation for a new or customized locale support
US20140129928A1 (en) * 2012-11-06 2014-05-08 Psyentific Mind Inc. Method and system for representing capitalization of letters while preserving their category similarity to lowercase letters
US9195644B2 (en) 2012-12-18 2015-11-24 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Short phrase language identification
US9372850B1 (en) * 2012-12-19 2016-06-21 Amazon Technologies, Inc. Machined book detection
US10650103B2 (en) 2013-02-08 2020-05-12 Mz Ip Holdings, Llc Systems and methods for incentivizing user feedback for translation processing
US9231898B2 (en) 2013-02-08 2016-01-05 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9031829B2 (en) 2013-02-08 2015-05-12 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US8996352B2 (en) 2013-02-08 2015-03-31 Machine Zone, Inc. Systems and methods for correcting translations in multi-user multi-lingual communications
US9600473B2 (en) 2013-02-08 2017-03-21 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9298703B2 (en) 2013-02-08 2016-03-29 Machine Zone, Inc. Systems and methods for incentivizing user feedback for translation processing
US9244894B1 (en) * 2013-09-16 2016-01-26 Arria Data2Text Limited Method and apparatus for interactive reports
TWI508561B (zh) 2013-11-27 2015-11-11 Wistron Corp 電子節目表單之產生裝置及電子節目表單之產生方法
JP6300512B2 (ja) * 2013-12-19 2018-03-28 株式会社ソリトンシステムズ 判定装置、判定方法、及び、プログラム
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US10162811B2 (en) * 2014-10-17 2018-12-25 Mz Ip Holdings, Llc Systems and methods for language detection
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US9477652B2 (en) 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
US9734142B2 (en) * 2015-09-22 2017-08-15 Facebook, Inc. Universal translation
JP6655331B2 (ja) * 2015-09-24 2020-02-26 Dynabook株式会社 電子機器及び方法
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US9734143B2 (en) 2015-12-17 2017-08-15 Facebook, Inc. Multi-media context language processing
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US9747283B2 (en) 2015-12-28 2017-08-29 Facebook, Inc. Predicting future translations
US10002125B2 (en) 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US10765956B2 (en) 2016-01-07 2020-09-08 Machine Zone Inc. Named entity recognition on chat data
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10120860B2 (en) * 2016-12-21 2018-11-06 Intel Corporation Methods and apparatus to identify a count of n-grams appearing in a corpus
US10180935B2 (en) 2016-12-30 2019-01-15 Facebook, Inc. Identifying multiple languages in a content item
US10769387B2 (en) 2017-09-21 2020-09-08 Mz Ip Holdings, Llc System and method for translating chat messages
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
EP3662467B1 (en) * 2018-10-11 2021-07-07 Google LLC Speech generation using crosslingual phoneme mapping
JP6781905B1 (ja) * 2019-07-26 2020-11-11 株式会社Fronteo 情報処理装置、自然言語処理システム、制御方法、および制御プログラム

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5261009A (en) * 1985-10-15 1993-11-09 Palantir Corporation Means for resolving ambiguities in text passed upon character context
US5062143A (en) * 1990-02-23 1991-10-29 Harris Corporation Trigram-based method of language identification
US5592667A (en) * 1991-05-29 1997-01-07 Triada, Ltd. Method of storing compressed data for accelerated interrogation
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
GB9220404D0 (en) * 1992-08-20 1992-11-11 Nat Security Agency Method of identifying,retrieving and sorting documents
US5608622A (en) * 1992-09-11 1997-03-04 Lucent Technologies Inc. System for analyzing translations
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
US5510981A (en) * 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
US5548507A (en) * 1994-03-14 1996-08-20 International Business Machines Corporation Language identification process using coded language words
SE513456C2 (sv) * 1994-05-10 2000-09-18 Telia Ab Metod och anordning vid tal- till textomvandling
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction
US6070140A (en) * 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5761687A (en) * 1995-10-04 1998-06-02 Apple Computer, Inc. Character-based correction arrangement with correction propagation
JPH11514764A (ja) * 1995-10-31 1999-12-14 エス.エム. ハーツ,フレデリック 所望のオブジェクトのカスタム化された電子識別のためのシステム
US5982933A (en) * 1996-01-12 1999-11-09 Canon Kabushiki Kaisha Information processing method, information processing apparatus, and storage medium
EP0849723A3 (en) * 1996-12-20 1998-12-30 ATR Interpreting Telecommunications Research Laboratories Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US6073098A (en) * 1997-11-21 2000-06-06 At&T Corporation Method and apparatus for generating deterministic approximate weighted finite-state automata

Also Published As

Publication number Publication date
DE69829074T2 (de) 2005-06-30
DE69829074D1 (de) 2005-03-24
EP1038239B1 (en) 2005-02-16
JP2001526425A (ja) 2001-12-18
EP1038239A1 (en) 2000-09-27
EP1498827A3 (en) 2005-02-16
JP4638599B2 (ja) 2011-02-23
WO1999030252A1 (en) 1999-06-17
EP1498827A2 (en) 2005-01-19
DE69838763D1 (de) 2008-01-03
EP1498827B1 (en) 2007-11-21
US6157905A (en) 2000-12-05
DE69838763T2 (de) 2008-10-30

Similar Documents

Publication Publication Date Title
ATE289434T1 (de) Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten
WO2004051555A3 (en) Method and apparatus for improved information transactions
FI20010644A7 (fi) Merkkisekvenssin kielen määrittäminen
EP1107169A3 (en) Method and apparatus for performing document structure analysis
WO2001096980A3 (en) Method and system for text analysis
Coch Evaluating and comparing three text-production techniques
WO2003014966A3 (en) An apparatus and method for extracting information from a formatted document
BR0106463A (pt) Determinação da fonte de texto em uma imagem
JP2004355386A (ja) 質問応答システムにおける質問会話中継方法及び装置、質問会話中継プログラム、質問会話中継プログラムを記録した記録媒体
DE60018117D1 (de) Methodologie um pflanzen und andere organismen mit von der normalen probe abweichenden charakteristiken zu identifizieren
JP2885489B2 (ja) 文書内容検索装置
CN116992824A (zh) 一种将LaTex公式转为自然语言的方法及系统
CN112434138A (zh) 一种基于关键比对的证词矛盾提取方法及系统
PL360312A1 (en) Method for capturing a complete data set of forms provided with graphic characters
JPH09259148A5 (enExample)
Mäkinen Testing a stylometric tool in the study of Middle English documentary texts
Sankar Rath et al. Heuristics for identification of bibliographic elements from title pages
Politis Competence of human capital for digital printing
Blomskog Essays on the functioning of the Swedish labour market
Torres Adult children of alcoholics in the consecrated life and the response of religious communities to their ACOA members
Nilsson Equality of opportunity, heterogeneity and poverty
JP2001142876A (ja) 漢字検索システム
Barteld Dealing with Spelling Variation in Non-Standard Texts
Williams et al. Percutaneous penetration studies for risk assessment
Woods et al. WRITING IN EARLY MESOPOTAMIA PROJECT

Legal Events

Date Code Title Description
RER Ceased as to paragraph 5 lit. 3 law introducing patent treaties