DE69829074T2 - Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten - Google Patents

Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten Download PDF

Info

Publication number
DE69829074T2
DE69829074T2 DE69829074T DE69829074T DE69829074T2 DE 69829074 T2 DE69829074 T2 DE 69829074T2 DE 69829074 T DE69829074 T DE 69829074T DE 69829074 T DE69829074 T DE 69829074T DE 69829074 T2 DE69829074 T2 DE 69829074T2
Authority
DE
Germany
Prior art keywords
language
data values
string
values
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
DE69829074T
Other languages
German (de)
English (en)
Other versions
DE69829074D1 (de
Inventor
David Robert POWELL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of DE69829074D1 publication Critical patent/DE69829074D1/de
Application granted granted Critical
Publication of DE69829074T2 publication Critical patent/DE69829074T2/de
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
DE69829074T 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten Expired - Lifetime DE69829074T2 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US08/987,565 US6157905A (en) 1997-12-11 1997-12-11 Identifying language and character set of data representing text
US987565 1997-12-11
PCT/US1998/025814 WO1999030252A1 (en) 1997-12-11 1998-12-04 Identifying language and character set of data representing text

Publications (2)

Publication Number Publication Date
DE69829074D1 DE69829074D1 (de) 2005-03-24
DE69829074T2 true DE69829074T2 (de) 2005-06-30

Family

ID=25533370

Family Applications (2)

Application Number Title Priority Date Filing Date
DE69829074T Expired - Lifetime DE69829074T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten
DE69838763T Expired - Lifetime DE69838763T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Family Applications After (1)

Application Number Title Priority Date Filing Date
DE69838763T Expired - Lifetime DE69838763T2 (de) 1997-12-11 1998-12-04 Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten

Country Status (6)

Country Link
US (1) US6157905A (https=)
EP (2) EP1038239B1 (https=)
JP (1) JP4638599B2 (https=)
AT (1) ATE289434T1 (https=)
DE (2) DE69829074T2 (https=)
WO (1) WO1999030252A1 (https=)

Families Citing this family (89)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6718519B1 (en) 1998-12-31 2004-04-06 International Business Machines Corporation System and method for outputting character sets in best available fonts
US7103532B1 (en) 1998-12-31 2006-09-05 International Business Machines Corp. System and method for evaluating character in a message
US7031002B1 (en) 1998-12-31 2006-04-18 International Business Machines Corporation System and method for using character set matching to enhance print quality
US6539118B1 (en) * 1998-12-31 2003-03-25 International Business Machines Corporation System and method for evaluating character sets of a message containing a plurality of character sets
US6813747B1 (en) 1998-12-31 2004-11-02 International Business Machines Corporation System and method for output of multipart documents
US7039637B2 (en) 1998-12-31 2006-05-02 International Business Machines Corporation System and method for evaluating characters in an inputted search string against a character table bank comprising a predetermined number of columns that correspond to a plurality of pre-determined candidate character sets in order to provide enhanced full text search
US6760887B1 (en) 1998-12-31 2004-07-06 International Business Machines Corporation System and method for highlighting of multifont documents
US6658151B2 (en) * 1999-04-08 2003-12-02 Ricoh Co., Ltd. Extracting information from symbolically compressed document images
US7191114B1 (en) 1999-08-27 2007-03-13 International Business Machines Corporation System and method for evaluating character sets to determine a best match encoding a message
US7155672B1 (en) * 2000-05-23 2006-12-26 Spyglass, Inc. Method and system for dynamic font subsetting
US6668085B1 (en) * 2000-08-01 2003-12-23 Xerox Corporation Character matching process for text converted from images
TW561360B (en) * 2000-08-22 2003-11-11 Ibm Method and system for case conversion
GB2366940B (en) * 2000-09-06 2004-08-11 Ericsson Telefon Ab L M Text language detection
JP2002189627A (ja) * 2000-12-21 2002-07-05 Tsubasa System Co Ltd 文書データの変換方法
US7900143B2 (en) * 2000-12-27 2011-03-01 Intel Corporation Large character set browser
JP2002268665A (ja) * 2001-03-13 2002-09-20 Oki Electric Ind Co Ltd テキスト音声合成装置
US20040205675A1 (en) * 2002-01-11 2004-10-14 Thangaraj Veerappan System and method for determining a document language and refining the character set encoding based on the document language
US7020338B1 (en) 2002-04-08 2006-03-28 The United States Of America As Represented By The National Security Agency Method of identifying script of line of text
US20040078191A1 (en) * 2002-10-22 2004-04-22 Nokia Corporation Scalable neural network-based language identification from written text
FR2848688A1 (fr) * 2002-12-17 2004-06-18 France Telecom Identification de langue d'un texte
JP4662944B2 (ja) 2003-11-12 2011-03-30 ザ トラスティーズ オブ コロンビア ユニヴァーシティ イン ザ シティ オブ ニューヨーク 正常データのnグラム分布を用いてペイロード異常を検出するための装置、方法、及び媒体
US7865355B2 (en) * 2004-07-30 2011-01-04 Sap Aktiengesellschaft Fast text character set recognition
US7305385B1 (en) * 2004-09-10 2007-12-04 Aol Llc N-gram based text searching
US7612897B2 (en) * 2004-09-24 2009-11-03 Seiko Epson Corporation Method of managing the printing of characters and a printing device employing method
US7729900B2 (en) * 2004-09-29 2010-06-01 Microsoft Corporation Method and computer-readable medium for consistent configuration of language support across operating system and application programs
US9122655B2 (en) * 2004-11-15 2015-09-01 International Business Machines Corporation Pre-translation testing of bi-directional language display
US8027832B2 (en) * 2005-02-11 2011-09-27 Microsoft Corporation Efficient language identification
JP4314204B2 (ja) * 2005-03-11 2009-08-12 株式会社東芝 文書管理方法、システム及びプログラム
US7774293B2 (en) * 2005-03-17 2010-08-10 University Of Maryland System and methods for assessing risk using hybrid causal logic
EP1746516A1 (en) * 2005-07-20 2007-01-24 Microsoft Corporation Character generator
US7689531B1 (en) * 2005-09-28 2010-03-30 Trend Micro Incorporated Automatic charset detection using support vector machines with charset grouping
US7711673B1 (en) * 2005-09-28 2010-05-04 Trend Micro Incorporated Automatic charset detection using SIM algorithm with charset grouping
GB0524354D0 (en) * 2005-11-30 2006-01-04 Ibm Method, system and computer program product for composing a reply to a text message received in a messaging application
KR100814641B1 (ko) * 2006-10-23 2008-03-18 성균관대학교산학협력단 사용자 주도형 음성 서비스 시스템 및 그 서비스 방법
US20080243477A1 (en) * 2007-03-30 2008-10-02 Rulespace Llc Multi-staged language classification
US9141607B1 (en) * 2007-05-30 2015-09-22 Google Inc. Determining optical character recognition parameters
US8315482B2 (en) * 2007-06-26 2012-11-20 Microsoft Corporation Integrated platform for user input of digital ink
US8233726B1 (en) * 2007-11-27 2012-07-31 Googe Inc. Image-domain script and language identification
US8107671B2 (en) 2008-06-26 2012-01-31 Microsoft Corporation Script detection service
US8073680B2 (en) * 2008-06-26 2011-12-06 Microsoft Corporation Language detection service
US8266514B2 (en) 2008-06-26 2012-09-11 Microsoft Corporation Map service
US8019596B2 (en) * 2008-06-26 2011-09-13 Microsoft Corporation Linguistic service platform
US8224641B2 (en) 2008-11-19 2012-07-17 Stratify, Inc. Language identification for documents containing multiple languages
US8224642B2 (en) * 2008-11-20 2012-07-17 Stratify, Inc. Automated identification of documents as not belonging to any language
US8468011B1 (en) * 2009-06-05 2013-06-18 Google Inc. Detecting writing systems and languages
US8326602B2 (en) * 2009-06-05 2012-12-04 Google Inc. Detecting writing systems and languages
US9454514B2 (en) * 2009-09-02 2016-09-27 Red Hat, Inc. Local language numeral conversion in numeric computing
US20110087962A1 (en) * 2009-10-14 2011-04-14 Qualcomm Incorporated Method and apparatus for the automatic predictive selection of input methods for web browsers
US8560466B2 (en) 2010-02-26 2013-10-15 Trend Micro Incorporated Method and arrangement for automatic charset detection
CN102479187B (zh) * 2010-11-23 2016-09-14 盛乐信息技术(上海)有限公司 基于奇偶校验的gbk字符查询系统及其实现方法
US9535895B2 (en) * 2011-03-17 2017-01-03 Amazon Technologies, Inc. n-Gram-based language prediction
GB2489512A (en) * 2011-03-31 2012-10-03 Clearswift Ltd Classifying data using fingerprint of character encoding
US20140163969A1 (en) * 2011-07-20 2014-06-12 Tata Consultancy Services Limited Method and system for differentiating textual information embedded in streaming news video
US8769404B2 (en) * 2012-01-03 2014-07-01 International Business Machines Corporation Rule-based locale definition generation for a new or customized locale support
US20140129928A1 (en) * 2012-11-06 2014-05-08 Psyentific Mind Inc. Method and system for representing capitalization of letters while preserving their category similarity to lowercase letters
US9195644B2 (en) 2012-12-18 2015-11-24 Lenovo Enterprise Solutions (Singapore) Pte. Ltd. Short phrase language identification
US9372850B1 (en) * 2012-12-19 2016-06-21 Amazon Technologies, Inc. Machined book detection
US9231898B2 (en) 2013-02-08 2016-01-05 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9600473B2 (en) 2013-02-08 2017-03-21 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9031829B2 (en) 2013-02-08 2015-05-12 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US10650103B2 (en) 2013-02-08 2020-05-12 Mz Ip Holdings, Llc Systems and methods for incentivizing user feedback for translation processing
US9298703B2 (en) 2013-02-08 2016-03-29 Machine Zone, Inc. Systems and methods for incentivizing user feedback for translation processing
US8996352B2 (en) 2013-02-08 2015-03-31 Machine Zone, Inc. Systems and methods for correcting translations in multi-user multi-lingual communications
US9244894B1 (en) * 2013-09-16 2016-01-26 Arria Data2Text Limited Method and apparatus for interactive reports
TWI508561B (zh) 2013-11-27 2015-11-11 Wistron Corp 電子節目表單之產生裝置及電子節目表單之產生方法
JP6300512B2 (ja) * 2013-12-19 2018-03-28 株式会社ソリトンシステムズ 判定装置、判定方法、及び、プログラム
US9740687B2 (en) 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities
US10162811B2 (en) * 2014-10-17 2018-12-25 Mz Ip Holdings, Llc Systems and methods for language detection
US9864744B2 (en) 2014-12-03 2018-01-09 Facebook, Inc. Mining multi-lingual data
US10067936B2 (en) 2014-12-30 2018-09-04 Facebook, Inc. Machine translation output reranking
US9830404B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Analyzing language dependency structures
US9830386B2 (en) 2014-12-30 2017-11-28 Facebook, Inc. Determining trending topics in social media
US9477652B2 (en) 2015-02-13 2016-10-25 Facebook, Inc. Machine learning dialect identification
US9734142B2 (en) * 2015-09-22 2017-08-15 Facebook, Inc. Universal translation
JP6655331B2 (ja) * 2015-09-24 2020-02-26 Dynabook株式会社 電子機器及び方法
US10133738B2 (en) 2015-12-14 2018-11-20 Facebook, Inc. Translation confidence scores
US9734143B2 (en) 2015-12-17 2017-08-15 Facebook, Inc. Multi-media context language processing
US10002125B2 (en) 2015-12-28 2018-06-19 Facebook, Inc. Language model personalization
US9747283B2 (en) 2015-12-28 2017-08-29 Facebook, Inc. Predicting future translations
US9805029B2 (en) 2015-12-28 2017-10-31 Facebook, Inc. Predicting future translations
US10765956B2 (en) 2016-01-07 2020-09-08 Machine Zone Inc. Named entity recognition on chat data
US10902215B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10902221B1 (en) 2016-06-30 2021-01-26 Facebook, Inc. Social hash for language models
US10120860B2 (en) * 2016-12-21 2018-11-06 Intel Corporation Methods and apparatus to identify a count of n-grams appearing in a corpus
US10180935B2 (en) 2016-12-30 2019-01-15 Facebook, Inc. Identifying multiple languages in a content item
WO2019060353A1 (en) 2017-09-21 2019-03-28 Mz Ip Holdings, Llc SYSTEM AND METHOD FOR TRANSLATION OF KEYBOARD MESSAGES
US10380249B2 (en) 2017-10-02 2019-08-13 Facebook, Inc. Predicting future trending topics
CN112334974B (zh) * 2018-10-11 2024-07-05 谷歌有限责任公司 使用跨语言音素映射的语音生成
JP6781905B1 (ja) * 2019-07-26 2020-11-11 株式会社Fronteo 情報処理装置、自然言語処理システム、制御方法、および制御プログラム

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5261009A (en) * 1985-10-15 1993-11-09 Palantir Corporation Means for resolving ambiguities in text passed upon character context
US5062143A (en) * 1990-02-23 1991-10-29 Harris Corporation Trigram-based method of language identification
US5592667A (en) * 1991-05-29 1997-01-07 Triada, Ltd. Method of storing compressed data for accelerated interrogation
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
GB9220404D0 (en) * 1992-08-20 1992-11-11 Nat Security Agency Method of identifying,retrieving and sorting documents
US5608622A (en) * 1992-09-11 1997-03-04 Lucent Technologies Inc. System for analyzing translations
US5428707A (en) * 1992-11-13 1995-06-27 Dragon Systems, Inc. Apparatus and methods for training speech recognition systems and their users and otherwise improving speech recognition performance
US5510981A (en) * 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
US5548507A (en) * 1994-03-14 1996-08-20 International Business Machines Corporation Language identification process using coded language words
SE513456C2 (sv) * 1994-05-10 2000-09-18 Telia Ab Metod och anordning vid tal- till textomvandling
US5594809A (en) * 1995-04-28 1997-01-14 Xerox Corporation Automatic training of character templates using a text line image, a text line transcription and a line image source model
US5883986A (en) * 1995-06-02 1999-03-16 Xerox Corporation Method and system for automatic transcription correction
US6070140A (en) * 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
US5774588A (en) * 1995-06-07 1998-06-30 United Parcel Service Of America, Inc. Method and system for comparing strings with entries of a lexicon
US5761687A (en) * 1995-10-04 1998-06-02 Apple Computer, Inc. Character-based correction arrangement with correction propagation
AU7674996A (en) * 1995-10-31 1997-05-22 Herz, Frederick S.M. System for customized electronic identification of desirable objects
US5982933A (en) * 1996-01-12 1999-11-09 Canon Kabushiki Kaisha Information processing method, information processing apparatus, and storage medium
EP0849723A3 (en) * 1996-12-20 1998-12-30 ATR Interpreting Telecommunications Research Laboratories Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US6073098A (en) * 1997-11-21 2000-06-06 At&T Corporation Method and apparatus for generating deterministic approximate weighted finite-state automata

Also Published As

Publication number Publication date
EP1038239B1 (en) 2005-02-16
WO1999030252A1 (en) 1999-06-17
DE69838763D1 (de) 2008-01-03
DE69838763T2 (de) 2008-10-30
EP1498827B1 (en) 2007-11-21
JP2001526425A (ja) 2001-12-18
DE69829074D1 (de) 2005-03-24
ATE289434T1 (de) 2005-03-15
EP1038239A1 (en) 2000-09-27
EP1498827A2 (en) 2005-01-19
EP1498827A3 (en) 2005-02-16
JP4638599B2 (ja) 2011-02-23
US6157905A (en) 2000-12-05

Similar Documents

Publication Publication Date Title
DE69829074T2 (de) Identifizierung der sprache und des zeichensatzes aus text-repräsentierenden daten
DE60029845T2 (de) System zum identifizieren der verhältnisse zwischen bestandteilen in aufgaben vom typ informations-wiederauffindung
DE3853894T2 (de) Auf Paradigmen basierende morphologische Textanalyse für natürliche Sprachen.
DE69722971T2 (de) Automatisches sprachenerkennungssystem für die mehrsprachige optische zeichenerkennung
DE69428590T2 (de) Auf kombiniertem lexikon und zeichenreihenwahrscheinlichkeit basierte handschrifterkennung
US7359851B2 (en) Method of identifying the language of a textual passage using short word and/or n-gram comparisons
DE69710459T2 (de) Identifizierung von wörtern im japanischem text durch ein rechnersystem
DE69513369T2 (de) Verfahren und vorrichtung zur zusammenfassung statischer prozesse in eine auf regeln basierende grammatikalisch definierte natuerliche sprache
DE69229204T2 (de) Iteratives Verfahren zum Suchen von Satzteilen und Informationsauffindungssystem, welches dieses benützt
DE69330633T2 (de) Verfahren und Apparat zum Vergleichen von semantischen Mustern für das Wiederauffinden von Texten
EP1665132B1 (de) Verfahren und system zum erfassen von daten aus mehreren maschinell lesbaren dokumenten
DE69229537T2 (de) Verfahren und Gerät zur Dokumentverarbeitung
DE69820343T2 (de) Linguistisches Suchsystem
DE69432575T2 (de) Dokumentenerkennungssystem mit verbesserter Wirksamkeit der Dokumentenerkennung
DE69424350T2 (de) Kontextsensitive Methode zum Auffinden von Informationen über ein Wort in einem elektronischen Wörterbuch
DE3788488T2 (de) Sprachenübersetzungssystem.
DE10343228A1 (de) Verfahren und Systeme zum Organisieren elektronischer Dokumente
DE102004003878A1 (de) System und Verfahren zum Identifizieren eines speziellen Wortgebrauchs in einem Dokument
DE4232507A1 (de) Verfahren zum Kennzeichnen, Wiederauffinden und Sortieren von Dokumenten
DE60118399T2 (de) System und verfahren zur automatischen aufbereitung und suche von abgetasteten dokumenten
DE10308550A1 (de) System und Verfahren zur automatischen Daten-Prüfung und -Korrektur
DE112018005272T5 (de) Suchen von mehrsprachigen dokumenten auf grundlage einer extraktion der dokumentenstruktur
DE102005051617B4 (de) Automatisches, computerbasiertes Ähnlichkeitsberechnungssystem zur Quantifizierung der Ähnlichkeit von Textausdrücken
KR20170140808A (ko) 단어 사이의 불확실성에 따른 단어 공백의 비대칭 포맷팅을 위한 시스템 및 방법
DE102013224331A1 (de) System und Verfahren zur Bereitstellung prädiktiver Anfragen

Legal Events

Date Code Title Description
8364 No opposition during term of opposition