WO2002027539A1 - Procede pour normaliser la casse - Google Patents

Procede pour normaliser la casse Download PDF

Info

Publication number
WO2002027539A1
WO2002027539A1 PCT/SE2001/002069 SE0102069W WO0227539A1 WO 2002027539 A1 WO2002027539 A1 WO 2002027539A1 SE 0102069 W SE0102069 W SE 0102069W WO 0227539 A1 WO0227539 A1 WO 0227539A1
Authority
WO
WIPO (PCT)
Prior art keywords
input word
word type
assigned
case
group
Prior art date
Application number
PCT/SE2001/002069
Other languages
English (en)
Inventor
Eva Ejerhed
Original Assignee
Hapax Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hapax Ltd filed Critical Hapax Ltd
Priority to EP01970463A priority Critical patent/EP1325429B1/fr
Priority to AU2001290464A priority patent/AU2001290464A1/en
Priority to DE60136478T priority patent/DE60136478D1/de
Publication of WO2002027539A1 publication Critical patent/WO2002027539A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • G06F40/129Handling non-Latin characters, e.g. kana-to-kanji conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the invention is based on the recognition that local information, such as the occurrence and location of upper case letters in word types, together with global information, such as the occurrence of word types that only differ with respect to the case of one or more letters, can be used to determine whether the distinction of case of the letter is significant or not.
  • a method for automatically distinguishing significant from insignificant distinctions of upper and lower case in a number of input word types by means of a computer is provided.
  • an input word type is assigned to one of a number of disjoint local groups based on the case, and position, of the letters that make up the word type.
  • said input word type is reassigned to one of a number of disjoint global groups, based on which local groups the case variants of the input word type are assigned to.
  • cases are normalized for said input word type in accordance with predetermined rules associated with the global group said input word type is assigned to.
  • a large number of word types that have been identified in a very large text database are input to a computer.
  • the word types are input as they appear in the text database, i.e. the cases of the letters of the word types are maintained.
  • two word tokens in the text database that are identical except for the case of one or more letters will be input as two different word types, whereas two word tokens in the text database that are identical also in terms of the case of the letters will be input as one word type.
  • the method which is performed fully automatically by means of a computer, then makes use of both local information and global information regarding cases of the word types.
  • the predetermined rules also include rules that detect when no normalization is to be done, which happens when the cases of letters in the word types are considered to be significant. In this way, the cases are preserved for those input word types that do not have any case variants, and for those input word types that have case variants for which the case difference is considered to be significant, whereas the cases are normalized for input word types for which the case difference is considered to be insignificant.
  • An advantage of this method is that the number of word types that, for example, should be stored in a database, is decreased. At the same time, the information conveyed by the case is preserved when the case is considered to be significant. Thus, the size of the database will be decreased which will decrease the costs of the database and increase the speed of look up in the database .
  • each input word type is associated with a frequency that indicates the number of occurrences of the input word type in the natural language text .
  • the case variants of an input word type are then normalized in accordance with predetermined rules associated with (a) the global group that the input word type is assigned to, and (b) the frequency of the case variants of the input word type.
  • predetermined rules associated with (a) the global group that the input word type is assigned to, and (b) the frequency of the case variants of the input word type.
  • the additional information regarding the number of times each word type has occurred in the natural language text is used in the determination of whether and how an input word type should be normalized.
  • information regarding the frequency of each case variant of a word type may indicate that the default normalization associated with the global group of the case variants should not be applied.
  • each input word type is associated with a sentence position that indicates whether the input word type occurred in a sentence internal position and/or in a sentence initial position in the natural language text.
  • the case variants of an input word type are then normalized in accordance with predetermined rules referring to the global group of the input word type and to the sentence positions of the case variants of said input word type.
  • information regarding each specific group of case variants can be weighed in when determining whether and how an input word type should be normalized. For example, information regarding the sentence position of each case variant of a word type may indicate that the default normalization associated with the global group of the case variants should not be applied.
  • each word type that begins with an alphabetic character is assigned to one of four disjoint local groups in step 110.
  • a word type is assigned to a local group on the basis of the case of the initial letter of the word type and the case of non-initial letters of the word type. More specifically, in step 115A, each word type that has an upper case initial letter and no lower case non-initial letters is assigned to a first local group (LG1) .
  • each word type that has an upper case initial letter and at least one lower case non-initial letter is assigned to a second local group (LG2) .
  • LG4 After the identification of the local information, i.e. the information that can be found by just considering each word type in its local contexts of occurrence, each word type is reassigned to one of four disjoint global groups in step 120. A word type is reassigned to a global group on the basis of the local groups to which the case variants of the word type are assigned.
  • the identification of case variants i.e. word types that are equal to each other except for the case of one or more letters, can be done in several different ways that are obvious to a person skilled in the art. When all case variants have been found for a common word type, the local groups to which the case variants are assigned are identified.
  • the word type is assigned to the third global group (GG3) in step 125C. If at least one case variant of a word type is assigned to the first local group, and at least one case variant of the word type is assigned to the second local group, and at least one case variant of the word type is assigned to the third local group, then the word type is assigned to the fourth global group (GG4) in step 125D.
  • global information i.e. information that can be found by analyzing the occurrence of a word type and case variants of the word type in an entire text database, is identified.
  • the two word types "UNESCO" and "Unesco” have been input to the method. These word types are case variants of a common word type.
  • the case variant "UNESCO” is assigned to the first local group and the case variant "Unesco” is assigned to the second local group.
  • the two case variants are associated with their respective frequencies and the frequency of the case variant "UNESCO" is larger than the frequency of the case variant "Unesco”.
  • case variants are assigned to the first global group, for which the default normal form is the case variant assigned to the second local group, they will be normalized to the case variant that is assigned to the first local group instead, i.e. the case variant "UNESCO". This is due to the fact that the respective frequencies of the case variants override the predetermined rules associated with the global group.
  • case variants are acronyms for which the case variant assigned to the first global group is considered to be the normal form.
  • the case of each input word type is normalized in accordance with predetermined rules associated with (1) the global group of the input word type and (2) the sentence position of each case variant of the input word type. More specifically, the cases of an input word type are normalized according to the same rules as in the embodiment described with reference to figure 1 with two exceptions. If an input word type is assigned to the third or fourth global group, the normalization will not be performed if the case variant assigned to the second global group is associated with a sentence position indicating that the input word type occurred in a sentence internal position in the natural language text.
  • the sentence positions of the case variants indicate that the predetermined rules associated with the global group should not be used.
  • the difference of cases between the case variants convey information that should be preserved. More specifically, the case variant "Bill” could both be a name and an ordinary noun. If, on the other hand, the two word types "Car” and “car” have been input to the method, and the sentence position information about the case variant "Car” indicates that this case variant only occurs in a sentence initial position, while the sentence position information about the case variant "car” indicates that this case variant only occurs in a sentence internal position, then the rules of the embodiment described with reference to figure 1 would be used and the two case variants are normalized to the case variant that is assigned to the third local group, i.e. the case variant "car".
  • the embodiments described above can be implemented in a computer program comprising computer-executable instructions for performing the steps.
  • the computer program can then be stored on any computer readable media and the embodiments may then be performed by means of a general purpose computer accessing this media.
  • the embodiments can also be implemented directly in hardware, such as one or more computer processors that are arranged to perform the steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)
  • Electrical Discharge Machining, Electrochemical Machining, And Combined Machining (AREA)
  • Separation Using Semi-Permeable Membranes (AREA)
  • Organic Low-Molecular-Weight Compounds And Preparation Thereof (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention concerne un procédé permettant de différencier automatiquement des variantes de casse significatives de variantes non significatives (majuscules, minuscules) dans plusieurs types de mot d'entrée au moyen d'un ordinateur. Selon le procédé, un type de mot d'entrée est attribué à l'un des groupes locaux disjoints en fonction de la casse, et de la position, des caractères qui forment le type de mot d'entrée. De plus, le type de mot d'entrée est attribué à l'un des groupes globaux disjoints en fonction des groupes locaux auxquels les variantes de casse du type de mot d'entrée sont attribuées. Enfin, les casses du type de mot d'entrée sont normalisées selon des règles prédéterminées associées au groupe global auquel le type de mot d'entrée est attribué.
PCT/SE2001/002069 2000-09-26 2001-09-26 Procede pour normaliser la casse WO2002027539A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP01970463A EP1325429B1 (fr) 2000-09-26 2001-09-26 Procede pour normaliser la casse
AU2001290464A AU2001290464A1 (en) 2000-09-26 2001-09-26 Method for normalizing case
DE60136478T DE60136478D1 (de) 2000-09-26 2001-09-26 Chstaben.

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SE0003433A SE524595C2 (sv) 2000-09-26 2000-09-26 Förfarande och datorprogram för normalisering av stilkast
SE0003433-0 2000-09-26

Publications (1)

Publication Number Publication Date
WO2002027539A1 true WO2002027539A1 (fr) 2002-04-04

Family

ID=20281160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2001/002069 WO2002027539A1 (fr) 2000-09-26 2001-09-26 Procede pour normaliser la casse

Country Status (8)

Country Link
US (1) US6385630B1 (fr)
EP (1) EP1325429B1 (fr)
AT (1) ATE413651T1 (fr)
AU (1) AU2001290464A1 (fr)
DE (1) DE60136478D1 (fr)
ES (1) ES2316474T3 (fr)
SE (1) SE524595C2 (fr)
WO (1) WO2002027539A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6739719B2 (en) 2002-06-13 2004-05-25 Essilor International Compagnie Generale D'optique Lens blank convenient for masking unpleasant odor and/or delivering a pleasant odor upon edging and/or surfacing, and perfume delivering lens

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108630A1 (en) * 2003-11-19 2005-05-19 Wasson Mark D. Extraction of facts from text
US20050216256A1 (en) * 2004-03-29 2005-09-29 Mitra Imaging Inc. Configurable formatting system and method
US8225231B2 (en) 2005-08-30 2012-07-17 Microsoft Corporation Aggregation of PC settings
US8521516B2 (en) * 2008-03-26 2013-08-27 Google Inc. Linguistic key normalization
US20100087169A1 (en) * 2008-10-02 2010-04-08 Microsoft Corporation Threading together messages with multiple common participants
US8385952B2 (en) 2008-10-23 2013-02-26 Microsoft Corporation Mobile communications device user interface
US20100107100A1 (en) 2008-10-23 2010-04-29 Schneekloth Jason S Mobile Device Style Abstraction
US8411046B2 (en) 2008-10-23 2013-04-02 Microsoft Corporation Column organization of content
JP5412096B2 (ja) * 2008-12-03 2014-02-12 株式会社やまびこ 携帯式チェンソーの動力ユニット構造
US8355698B2 (en) 2009-03-30 2013-01-15 Microsoft Corporation Unlock screen
US8175653B2 (en) 2009-03-30 2012-05-08 Microsoft Corporation Chromeless user interface
US8238876B2 (en) 2009-03-30 2012-08-07 Microsoft Corporation Notifications
US8836648B2 (en) 2009-05-27 2014-09-16 Microsoft Corporation Touch pull-in gesture
US20120159395A1 (en) 2010-12-20 2012-06-21 Microsoft Corporation Application-launching interface for multiple modes
US20120159383A1 (en) 2010-12-20 2012-06-21 Microsoft Corporation Customization of an immersive environment
US8689123B2 (en) 2010-12-23 2014-04-01 Microsoft Corporation Application reporting in an application-selectable user interface
US8612874B2 (en) 2010-12-23 2013-12-17 Microsoft Corporation Presenting an application change through a tile
US9423951B2 (en) 2010-12-31 2016-08-23 Microsoft Technology Licensing, Llc Content-based snap point
US9383917B2 (en) 2011-03-28 2016-07-05 Microsoft Technology Licensing, Llc Predictive tiling
US9104307B2 (en) 2011-05-27 2015-08-11 Microsoft Technology Licensing, Llc Multi-application environment
US9158445B2 (en) 2011-05-27 2015-10-13 Microsoft Technology Licensing, Llc Managing an immersive interface in a multi-application immersive environment
US8893033B2 (en) 2011-05-27 2014-11-18 Microsoft Corporation Application notifications
US20120304132A1 (en) 2011-05-27 2012-11-29 Chaitanya Dev Sareen Switching back to a previously-interacted-with application
US9104440B2 (en) 2011-05-27 2015-08-11 Microsoft Technology Licensing, Llc Multi-application environment
US9658766B2 (en) 2011-05-27 2017-05-23 Microsoft Technology Licensing, Llc Edge gesture
US8687023B2 (en) 2011-08-02 2014-04-01 Microsoft Corporation Cross-slide gesture to select and rearrange
US20130057587A1 (en) 2011-09-01 2013-03-07 Microsoft Corporation Arranging tiles
US8922575B2 (en) 2011-09-09 2014-12-30 Microsoft Corporation Tile cache
US10353566B2 (en) 2011-09-09 2019-07-16 Microsoft Technology Licensing, Llc Semantic zoom animations
US9557909B2 (en) 2011-09-09 2017-01-31 Microsoft Technology Licensing, Llc Semantic zoom linguistic helpers
US8933952B2 (en) 2011-09-10 2015-01-13 Microsoft Corporation Pre-rendering new content for an application-selectable user interface
US9146670B2 (en) 2011-09-10 2015-09-29 Microsoft Technology Licensing, Llc Progressively indicating new content in an application-selectable user interface
US9244802B2 (en) 2011-09-10 2016-01-26 Microsoft Technology Licensing, Llc Resource user interface
US9223472B2 (en) 2011-12-22 2015-12-29 Microsoft Technology Licensing, Llc Closing applications
US9128605B2 (en) 2012-02-16 2015-09-08 Microsoft Technology Licensing, Llc Thumbnail-image selection of applications
US20140129928A1 (en) * 2012-11-06 2014-05-08 Psyentific Mind Inc. Method and system for representing capitalization of letters while preserving their category similarity to lowercase letters
US9450952B2 (en) 2013-05-29 2016-09-20 Microsoft Technology Licensing, Llc Live tiles without application-code execution
EP3126969A4 (fr) 2014-04-04 2017-04-12 Microsoft Technology Licensing, LLC Représentation d'application extensible
EP3129847A4 (fr) 2014-04-10 2017-04-19 Microsoft Technology Licensing, LLC Couvercle coulissant pour dispositif informatique
WO2015154273A1 (fr) 2014-04-10 2015-10-15 Microsoft Technology Licensing, Llc Couvercle de coque pliable destiné à un dispositif informatique
US10592080B2 (en) 2014-07-31 2020-03-17 Microsoft Technology Licensing, Llc Assisted presentation of application windows
US10254942B2 (en) 2014-07-31 2019-04-09 Microsoft Technology Licensing, Llc Adaptive sizing and positioning of application windows
US10678412B2 (en) 2014-07-31 2020-06-09 Microsoft Technology Licensing, Llc Dynamic joint dividers for application windows
US10642365B2 (en) 2014-09-09 2020-05-05 Microsoft Technology Licensing, Llc Parametric inertia and APIs
WO2016065568A1 (fr) 2014-10-30 2016-05-06 Microsoft Technology Licensing, Llc Dispositif d'entrée à configurations multiples

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4830521A (en) * 1986-11-10 1989-05-16 Brother Kogyo Kabushiki Kaisha Electronic typewriter with a spelling check function and proper noun recognition
US4864501A (en) * 1987-10-07 1989-09-05 Houghton Mifflin Company Word annotation system
US5485372A (en) * 1994-06-01 1996-01-16 Mitsubishi Electric Research Laboratories, Inc. System for underlying spelling recovery
US5819265A (en) * 1996-07-12 1998-10-06 International Business Machines Corporation Processing names in a text

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5008818A (en) * 1989-04-24 1991-04-16 Alexander K. Bocast Method and apparatus for reconstructing a token from a token fragment
US5404514A (en) * 1989-12-26 1995-04-04 Kageneck; Karl-Erbo G. Method of indexing and retrieval of electronically-stored documents
US5995922A (en) * 1996-05-02 1999-11-30 Microsoft Corporation Identifying information related to an input word in an electronic dictionary

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4830521A (en) * 1986-11-10 1989-05-16 Brother Kogyo Kabushiki Kaisha Electronic typewriter with a spelling check function and proper noun recognition
US4864501A (en) * 1987-10-07 1989-09-05 Houghton Mifflin Company Word annotation system
US5485372A (en) * 1994-06-01 1996-01-16 Mitsubishi Electric Research Laboratories, Inc. System for underlying spelling recovery
US5819265A (en) * 1996-07-12 1998-10-06 International Business Machines Corporation Processing names in a text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6739719B2 (en) 2002-06-13 2004-05-25 Essilor International Compagnie Generale D'optique Lens blank convenient for masking unpleasant odor and/or delivering a pleasant odor upon edging and/or surfacing, and perfume delivering lens

Also Published As

Publication number Publication date
DE60136478D1 (de) 2008-12-18
ES2316474T3 (es) 2009-04-16
SE524595C2 (sv) 2004-08-31
SE0003433L (sv) 2002-03-27
SE0003433D0 (sv) 2000-09-26
ATE413651T1 (de) 2008-11-15
AU2001290464A1 (en) 2002-04-08
US6385630B1 (en) 2002-05-07
EP1325429A1 (fr) 2003-07-09
EP1325429B1 (fr) 2008-11-05

Similar Documents

Publication Publication Date Title
EP1325429B1 (fr) Procede pour normaliser la casse
US8621344B1 (en) Method of spell-checking search queries
US9552349B2 (en) Methods and apparatus for performing spelling corrections using one or more variant hash tables
US9886478B2 (en) Aviation field service report natural language processing
US6173252B1 (en) Apparatus and methods for Chinese error check by means of dynamic programming and weighted classes
US20060224379A1 (en) Method of finding answers to questions
Mikheev A knowledge-free method for capitalized word disambiguation
US20070136243A1 (en) System and method for data indexing and retrieval
JPS62229368A (ja) 文書処理装置
FI120755B (fi) Tietueiden käsittely vastinparien löytämiseksi vertailutietojoukosta
CN114153962A (zh) 一种数据匹配方法、装置及电子设备
CN109830272B (zh) 数据标准化方法、装置、计算机设备及存储介质
KR100788440B1 (ko) 도용 패턴에 기반한 복사 감지시스템
EP3425549B1 (fr) Système et procédé de détermination de texte contenant des données confidentielles
Bokinsky et al. Application of natural language processing techniques to marine V-22 maintenance data for populating a CBM-oriented database
EP1575172A2 (fr) Compression des protocoles linguistiques
US8296651B2 (en) Selecting terms for a glossary in a document processing system
US7343280B2 (en) Processing noisy data and determining word similarity
US20020129066A1 (en) Computer implemented method for reformatting logically complex clauses in an electronic text-based document
US20160078072A1 (en) Term variant discernment system and method therefor
Quasthoff Tools for automatic lexicon maintenance: acquisition, error correction, and the generation of missing values.
CN115906817A (zh) 一种跨语言环境的关键字匹配方法、装置及电子设备
CN115617564A (zh) 针对内核异常的处理方法、装置、电子设备及存储介质
Comeau et al. Non‐word identification or spell checking without a dictionary
US9208145B2 (en) Computer-implemented systems and methods for non-monotonic recognition of phrasal terms

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2001970463

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001970463

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

NENP Non-entry into the national phase

Ref country code: JP