DE69829074D1 - Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten - Google Patents

Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Info

Publication number
DE69829074D1
DE69829074D1 DE69829074T DE69829074T DE69829074D1 DE 69829074 D1 DE69829074 D1 DE 69829074D1 DE 69829074 T DE69829074 T DE 69829074T DE 69829074 T DE69829074 T DE 69829074T DE 69829074 D1 DE69829074 D1 DE 69829074D1
Authority
DE
Germany
Prior art keywords
data values
language
series
facility
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
DE69829074T
Other languages
English (en)
Other versions
DE69829074T2 (de
Inventor
David Powell
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of DE69829074D1 publication Critical patent/DE69829074D1/de
Application granted granted Critical
Publication of DE69829074T2 publication Critical patent/DE69829074T2/de
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Controls And Circuits For Display Device (AREA)
DE69829074T 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten Expired - Lifetime DE69829074T2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US987565 1997-12-11
US08/987,565 US6157905A (en) 1997-12-11 1997-12-11 Identifying language and character set of data representing text
PCT/US1998/025814 WO1999030252A1 (en) 1997-12-11 1998-12-04 Identifying language and character set of data representing text

Publications (2)

Publication Number Publication Date
DE69829074D1 true DE69829074D1 (de) 2005-03-24
DE69829074T2 DE69829074T2 (de) 2005-06-30

Family

ID=25533370

Family Applications (2)

Application Number Title Priority Date Filing Date
DE69838763T Expired - Lifetime DE69838763T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten
DE69829074T Expired - Lifetime DE69829074T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Family Applications Before (1)

Application Number Title Priority Date Filing Date
DE69838763T Expired - Lifetime DE69838763T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Country Status (6)

Country Link
US (1) US6157905A (de)
EP (2) EP1498827B1 (de)
JP (1) JP4638599B2 (de)
AT (1) ATE289434T1 (de)
DE (2) DE69838763T2 (de)
WO (1) WO1999030252A1 (de)

Families Citing this family (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539118B1 (en) * 1998-12-31 2003-03-25 International Business Machines Corporation System and method for evaluating character sets of a message containing a plurality of character sets
US7103532B1 (en) 1998-12-31 2006-09-05 International Business Machines Corp. System and method for evaluating character in a message
US6813747B1 (en) 1998-12-31 2004-11-02 International Business Machines Corporation System and method for output of multipart documents
US7039637B2 (en) 1998-12-31 2006-05-02 International Business Machines Corporation System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search
US6718519B1 (en) 1998-12-31 2004-04-06 International Business Machines Corporation System and method for outputting character sets in best available fonts
US7031002B1 (en) 1998-12-31 2006-04-18 International Business Machines Corporation System and method for using character set matching to enhance print quality
US6760887B1 (en) 1998-12-31 2004-07-06 International Business Machines Corporation System and method for highlighting of multifont documents
US6658151B2 (en) * 1999-04-08 2003-12-02 Ricoh Co., Ltd. Extracting information from symbolically compressed document images
US7191114B1 (en) 1999-08-27 2007-03-13 International Business Machines Corporation System and method for evaluating character sets to determine a best match encoding a message
US7155672B1 (en) * 2000-05-23 2006-12-26 Spyglass, Inc. Method and system for dynamic font subsetting
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
TW561360B (en) * 2000-08-22 2003-11-11 Ibm Method and system for case conversion
GB2366940B (en) * 2000-09-06 2004-08-11 Ericsson Telefon Ab L M Text language detection
JP2002189627A (ja) * 2000-12-21 2002-07-05 Tsubasa System Co Ltd 文書データの変換方法
US7900143B2 (en) * 2000-12-27 2011-03-01 Intel Corporation Large character set browser
JP2002268665A (ja) * 2001-03-13 2002-09-20 Oki Electric Ind Co Ltd テキスト音声合成装置
US20040205675A1 (en) * 2002-01-11 2004-10-14 Thangaraj Veerappan System and method for determining a document language and refining the character set encoding based on the document language
US7020338B1 (en) 2002-04-08 2006-03-28 The United States Of America As Represented By The National Security Agency Method of identifying script of line of text
US20040078191A1 (en) * 2002-10-22 2004-04-22 Nokia Corporation Scalable neural network-based language identification from written text
FR2848688A1 (fr) * 2002-12-17 2004-06-18 France Telecom Identification de langue d'un texte
ES2423491T3 (es) 2003-11-12 2013-09-20 The Trustees Of Columbia University In The City Of New York Aparato, procedimiento y medio para detectar una anomalía de carga útil usando la distribución en n-gramas de datos normales
US7865355B2 (en) * 2004-07-30 2011-01-04 Sap Aktiengesellschaft Fast text character set recognition
US7305385B1 (en) * 2004-09-10 2007-12-04 Aol Llc N-gram based text searching
US7612897B2 (en) * 2004-09-24 2009-11-03 Seiko Epson Corporation Method of managing the printing of characters and a printing device employing method
US7729900B2 (en) * 2004-09-29 2010-06-01 Microsoft Corporation Method and computer-readable medium for consistent configuration of language support across operating system and application programs
US9122655B2 (en) * 2004-11-15 2015-09-01 International Business Machines Corporation Pre-translation testing of bi-directional language display
US8027832B2 (en) * 2005-02-11 2011-09-27 Microsoft Corporation Efficient language identification
JP4314204B2 (ja) * 2005-03-11 2009-08-12 株式会社東芝 文書管理方法、システム及びプログラム
US7774293B2 (en) * 2005-03-17 2010-08-10 University Of Maryland System and methods for assessing risk using hybrid causal logic
EP1746516A1 (de) * 2005-07-20 2007-01-24 Microsoft Corporation Buchstabengenerator
US7711673B1 (en) * 2005-09-28 2010-05-04 Trend Micro Incorporated Automatic charset detection using SIM algorithm with charset grouping
US7689531B1 (en) * 2005-09-28 2010-03-30 Trend Micro Incorporated Automatic charset detection using support vector machines with charset grouping
GB0524354D0 (en) * 2005-11-30 2006-01-04 Ibm Method, system and computer program product for composing a reply to a text message received in a messaging application
KR100814641B1 (ko) * 2006-10-23 2008-03-18 성균관대학교산학협력단 사용자 주도형 음성 서비스 시스템 및 그 서비스 방법
US20080243477A1 (en) * 2007-03-30 2008-10-02 Rulespace Llc Multi-staged language classification
US9141607B1 (en) * 2007-05-30 2015-09-22 Google Inc. Determining optical character recognition parameters
US8315482B2 (en) * 2007-06-26 2012-11-20 Microsoft Corporation Integrated platform for user input of digital ink
US8233726B1 (en) * 2007-11-27 2012-07-31 Googe Inc. Image-domain script and language identification
US8019596B2 (en) * 2008-06-26 2011-09-13 Microsoft Corporation Linguistic service platform
US8107671B2 (en) 2008-06-26 2012-01-31 Microsoft Corporation Script detection service
US8266514B2 (en) * 2008-06-26 2012-09-11 Microsoft Corporation Map service
US8073680B2 (en) * 2008-06-26 2011-12-06 Microsoft Corporation Language detection service
US8224641B2 (en) 2008-11-19 2012-07-17 Stratify, Inc. Language identification for documents containing multiple languages
US8224642B2 (en) * 2008-11-20 2012-07-17 Stratify, Inc. Automated identification of documents as not belonging to any language
US8468011B1 (en) * 2009-06-05 2013-06-18 Google Inc. Detecting writing systems and languages
US8326602B2 (en) * 2009-06-05 2012-12-04 Google Inc. Detecting writing systems and languages
US9454514B2 (en) * 2009-09-02 2016-09-27 Red Hat, Inc. Local language numeral conversion in numeric computing
US20110087962A1 (en) * 2009-10-14 2011-04-14 Qualcomm Incorporated Method and apparatus for the automatic predictive selection of input methods for web browsers
US8560466B2 (en) 2010-02-26 2013-10-15 Trend Micro Incorporated Method and arrangement for automatic charset detection
CN102479187B (zh) * 2010-11-23 2016-09-14 盛乐信息技术(上海)有限公司 基于奇偶校验的gbk字符查询系统及其实现方法
US9535895B2 (en) * 2011-03-17 2017-01-03 Amazon Technologies, Inc. n-Gram-based language prediction
GB2489512A (en) * 2011-03-31 2012-10-03 Clearswift Ltd Classifying data using fingerprint of character encoding
WO2013054348A2 (en) * 2011-07-20 2013-04-18 Tata Consultancy Services Limited A method and system for differentiating textual information embedded in streaming news video
US8769404B2 (en) * 2012-01-03 2014-07-01 International Business Machines Corporation Rule-based locale definition generation for a new or customized locale support
US20140129928A1 (en) * 2012-11-06 2014-05-08 Psyentific Mind Inc. Method and system for representing capitalization of letters while preserving their category similarity to lowercase letters
US9195644B2 (en) 2012-12-18 2015-11-24 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Short phrase language identification
US9372850B1 (en) * 2012-12-19 2016-06-21 Amazon Technologies, Inc. Machined book detection
US9600473B2 (en) 2013-02-08 2017-03-21 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US10650103B2 (en) 2013-02-08 2020-05-12 Mz Ip Holdings, Llc Systems and methods for incentivizing user feedback for translation processing
US9231898B2 (en) 2013-02-08 2016-01-05 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9298703B2 (en) 2013-02-08 2016-03-29 Machine Zone, Inc. Systems and methods for incentivizing user feedback for translation processing
US8996352B2 (en) 2013-02-08 2015-03-31 Machine Zone, Inc. Systems and methods for correcting translations in multi-user multi-lingual communications
US9031829B2 (en) 2013-02-08 2015-05-12 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9244894B1 (en) * 2013-09-16 2016-01-26 Arria Data2Text Limited Method and apparatus for interactive reports
TWI508561B (zh) 2013-11-27 2015-11-11 Wistron Corp 電子節目表單之產生裝置及電子節目表單之產生方法
JP6300512B2 (ja) * 2013-12-19 2018-03-28 株式会社ソリトンシステムズ 判定装置、判定方法、及び、プログラム
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US10162811B2 (en) * 2014-10-17 2018-12-25 Mz Ip Holdings, Llc Systems and methods for language detection
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US9477652B2 (en) 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
US9734142B2 (en) * 2015-09-22 2017-08-15 Facebook, Inc. Universal translation
JP6655331B2 (ja) * 2015-09-24 2020-02-26 Dynabook株式会社 電子機器及び方法
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US9734143B2 (en) 2015-12-17 2017-08-15 Facebook, Inc. Multi-media context language processing
US9747283B2 (en) 2015-12-28 2017-08-29 Facebook, Inc. Predicting future translations
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US10002125B2 (en) 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US10765956B2 (en) 2016-01-07 2020-09-08 Machine Zone Inc. Named entity recognition on chat data
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10120860B2 (en) * 2016-12-21 2018-11-06 Intel Corporation Methods and apparatus to identify a count of n-grams appearing in a corpus
US10180935B2 (en) 2016-12-30 2019-01-15 Facebook, Inc. Identifying multiple languages in a content item
US10769387B2 (en) 2017-09-21 2020-09-08 Mz Ip Holdings, Llc System and method for translating chat messages
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
EP3662467B1 (de) * 2018-10-11 2021-07-07 Google LLC Spracherzeugung unter verwendung von sprachübergreifender phonemabbildung
JP6781905B1 (ja) * 2019-07-26 2020-11-11 株式会社Fronteo 情報処理装置、自然言語処理システム、制御方法、および制御プログラム

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5261009A (en) * 1985-10-15 1993-11-09 Palantir Corporation Means for resolving ambiguities in text passed upon character context
US5062143A (en) * 1990-02-23 1991-10-29 Harris Corporation Trigram-based method of language identification
US5592667A (en) * 1991-05-29 1997-01-07 Triada, Ltd. Method of storing compressed data for accelerated interrogation
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
GB9220404D0 (en) * 1992-08-20 1992-11-11 Nat Security Agency Method of identifying,retrieving and sorting documents
US5608622A (en) * 1992-09-11 1997-03-04 Lucent Technologies Inc. System for analyzing translations
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
US5510981A (en) * 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
US5548507A (en) * 1994-03-14 1996-08-20 International Business Machines Corporation Language identification process using coded language words
SE513456C2 (sv) * 1994-05-10 2000-09-18 Telia Ab Metod och anordning vid tal- till textomvandling
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction
US6070140A (en) * 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5761687A (en) * 1995-10-04 1998-06-02 Apple Computer, Inc. Character-based correction arrangement with correction propagation
AU7674996A (en) * 1995-10-31 1997-05-22 Herz, Frederick S.M. System for customized electronic identification of desirable objects
US5982933A (en) * 1996-01-12 1999-11-09 Canon Kabushiki Kaisha Information processing method, information processing apparatus, and storage medium
EP0849723A3 (de) * 1996-12-20 1998-12-30 ATR Interpreting Telecommunications Research Laboratories Spracherkennungsapparat mit Mitteln zum Eliminieren von Kandidatenfehlern
US6073098A (en) * 1997-11-21 2000-06-06 At&T Corporation Method and apparatus for generating deterministic approximate weighted finite-state automata

Also Published As

Publication number Publication date
JP4638599B2 (ja) 2011-02-23
DE69838763T2 (de) 2008-10-30
EP1498827B1 (de) 2007-11-21
DE69829074T2 (de) 2005-06-30
WO1999030252A1 (en) 1999-06-17
EP1038239A1 (de) 2000-09-27
EP1038239B1 (de) 2005-02-16
EP1498827A2 (de) 2005-01-19
ATE289434T1 (de) 2005-03-15
JP2001526425A (ja) 2001-12-18
EP1498827A3 (de) 2005-02-16
DE69838763D1 (de) 2008-01-03
US6157905A (en) 2000-12-05

Similar Documents

Publication Publication Date Title
DE69838763D1 (de) Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten
Evans et al. An evaluation of syntactic simplification rules for people with autism
WO2004051555A3 (en) Method and apparatus for improved information transactions
Nation Using small corpora to investigate learner needs
WO2001096980A3 (en) Method and system for text analysis
RU2004135023A (ru) Способ ввода текста
Coch Evaluating and comparing three text-production techniques
KR940009861A (ko) 변량교체장치
de Kok et al. PP Attachment: where do we stand?
Pandey A Roadmap for Scripts of the Landa Family
JP2017201483A (ja) 表モチーフ抽出装置、分類器学習装置、表種類分類装置、方法、及びプログラム
Sankar Rath et al. Heuristics for identification of bibliographic elements from title pages
Hamed Automatic diacritization as prerequisite towards the automatic generation of Arabic lexical recognition tests
Le Nguyen et al. A new sentence reduction based on decision tree model
Nuopponen Concept systems and analysis of special language texts. Towards a terminological text analysis
Eberendu et al. Readability Analysis of Nigeria National Daily Newspapers
Eriksson et al. The internationalisation of service companies: two case studies in the staffing industry
Barteld Dealing with Spelling Variation in Non-Standard Texts
JPWO2002095614A1 (ja) 言語・文字コード系識別処理方法
Wright et al. Self-Assessment of Ethical Decision Making Predispositions
Battistella Why we curse: A neuro-psycho-social theory of speech
Lindholm et al. Competitor intelligence process: how companies plan, collect, analyse, communicate and utilise competitor intelligence
Hurskainen Enriching text with tone marks: An application to Kinyarwanda language
Conyne On expanding horizons
JP2001142876A (ja) 漢字検索システム

Legal Events

Date Code Title Description
8364 No opposition during term of opposition