WO2004006124A3 - Text-representation, text-matching and text-classification code, system and method - Google Patents

Text-representation, text-matching and text-classification code, system and method Download PDF

Info

Publication number
WO2004006124A3
WO2004006124A3 PCT/US2003/021243 US0321243W WO2004006124A3 WO 2004006124 A3 WO2004006124 A3 WO 2004006124A3 US 0321243 W US0321243 W US 0321243W WO 2004006124 A3 WO2004006124 A3 WO 2004006124A3
Authority
WO
WIPO (PCT)
Prior art keywords
text
texts
term
determined
target document
Prior art date
Application number
PCT/US2003/021243
Other languages
English (en)
Other versions
WO2004006124A2 (fr
Inventor
Peter J Dehlinger
Shao Chin
Original Assignee
Word Data Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US2002/021200 external-priority patent/WO2004006134A1/fr
Priority claimed from US10/261,970 external-priority patent/US20040006459A1/en
Priority claimed from US10/261,971 external-priority patent/US7181451B2/en
Priority claimed from US10/262,192 external-priority patent/US20040006547A1/en
Priority claimed from US10/374,877 external-priority patent/US7016895B2/en
Priority claimed from US10/438,486 external-priority patent/US7003516B2/en
Application filed by Word Data Corp filed Critical Word Data Corp
Priority to AU2003256456A priority Critical patent/AU2003256456A1/en
Publication of WO2004006124A2 publication Critical patent/WO2004006124A2/fr
Publication of WO2004006124A3 publication Critical patent/WO2004006124A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri

Abstract

Disclosed are a computer-readable code, system and method for representing, retrieving, and/or classifying a target document in the form of a digitally encoded natural-language text. For each of a plurality of non-generic words and/or words groups characterizing the target document, there is determined a selectivity value calculated as the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and the document is represented as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term. There is then determined, for each of the plurality of sample texts having associated classification identifiers, a match score related to the number of descriptive terms present in or derived from that text that match those in the target text. From the selected matched texts, and the associated classification identifiers, a classification determination of the target document may be made.
PCT/US2003/021243 2002-07-03 2003-07-02 Text-representation, text-matching and text-classification code, system and method WO2004006124A2 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2003256456A AU2003256456A1 (en) 2002-07-03 2003-07-02 Text-representation, text-matching and text-classification code, system and method

Applications Claiming Priority (14)

Application Number Priority Date Filing Date Title
USPCT/US02/21200 2002-07-03
PCT/US2002/021200 WO2004006134A1 (fr) 2002-07-03 2002-07-03 Code, systeme et procede de traitement de texte
US39420402P 2002-07-05 2002-07-05
US60/394,204 2002-07-05
US10/261,971 2002-09-30
US10/262,192 2002-09-30
US10/261,970 US20040006459A1 (en) 2002-07-05 2002-09-30 Text-searching system and method
US10/261,970 2002-09-30
US10/261,971 US7181451B2 (en) 2002-07-03 2002-09-30 Processing input text to generate the selectivity value of a word or word group in a library of texts in a field is related to the frequency of occurrence of that word or word group in library
US10/262,192 US20040006547A1 (en) 2002-07-03 2002-09-30 Text-processing database
US10/374,877 US7016895B2 (en) 2002-07-05 2003-02-25 Text-classification system and method
US10/374,877 2003-02-25
US10/438,486 2003-05-15
US10/438,486 US7003516B2 (en) 2002-07-03 2003-05-15 Text representation and method

Publications (2)

Publication Number Publication Date
WO2004006124A2 WO2004006124A2 (fr) 2004-01-15
WO2004006124A3 true WO2004006124A3 (fr) 2004-05-06

Family

ID=30119518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/021243 WO2004006124A2 (fr) 2002-07-03 2003-07-02 Text-representation, text-matching and text-classification code, system and method

Country Status (2)

Country Link
AU (1) AU2003256456A1 (fr)
WO (1) WO2004006124A2 (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100367216C (zh) * 2005-06-20 2008-02-06 南京大学 事件模型中的快速事件匹配方法
CN105701120B (zh) 2014-11-28 2019-05-03 华为技术有限公司 确定语义匹配度的方法和装置
CN110532551A (zh) * 2019-08-15 2019-12-03 苏州朗动网络科技有限公司 文本关键词自动提取的方法、设备和存储介质
CN111695353B (zh) * 2020-06-12 2023-07-04 百度在线网络技术(北京)有限公司 时效性文本的识别方法、装置、设备及存储介质
CN111881257B (zh) * 2020-07-24 2022-06-03 广州大学 基于主题词和语句主旨的自动匹配方法、系统及存储介质
CN116383390B (zh) * 2023-06-05 2023-08-08 南京数策信息科技有限公司 一种用于经营管理信息的非结构化数据存储方法及云平台

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999010819A1 (fr) * 1997-08-26 1999-03-04 Siemens Aktiengesellschaft Procede et systeme pour la determination assistee par ordinateur de l'utilite d'un document electronique par rapport a un profil de recherche predeterminable
EP1049030A1 (fr) * 1999-04-28 2000-11-02 SER Systeme AG Produkte und Anwendungen der Datenverarbeitung Méthode et appareil de classification
EP1168202A2 (fr) * 2000-06-28 2002-01-02 Matsushita Electric Industrial Co., Ltd. Appareil de recouvrement de documents similaires et appareil pour extraire des mots-clés pertinents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999010819A1 (fr) * 1997-08-26 1999-03-04 Siemens Aktiengesellschaft Procede et systeme pour la determination assistee par ordinateur de l'utilite d'un document electronique par rapport a un profil de recherche predeterminable
EP1049030A1 (fr) * 1999-04-28 2000-11-02 SER Systeme AG Produkte und Anwendungen der Datenverarbeitung Méthode et appareil de classification
EP1168202A2 (fr) * 2000-06-28 2002-01-02 Matsushita Electric Industrial Co., Ltd. Appareil de recouvrement de documents similaires et appareil pour extraire des mots-clés pertinents

Also Published As

Publication number Publication date
AU2003256456A1 (en) 2004-01-23
AU2003256456A8 (en) 2004-01-23
WO2004006124A2 (fr) 2004-01-15

Similar Documents

Publication Publication Date Title
CA2410881A1 (fr) Systeme et procede informatiques de recherche de regles de droit dans des textes
Wu et al. Domain-specific keyphrase extraction
JP4861078B2 (ja) 索引作成プログラム、索引作成装置および索引作成方法
WO2004086192A3 (fr) Systemes et procedes visant a affiner une demande de recherche interactive
WO2003040875A3 (fr) Systemes, procedes et logiciels de classement de documents
CA2373568A1 (fr) Methode de recherche de document similaire, systeme d'execution de cette recherche et programme de traitement de cette recherche
WO2000067150A3 (fr) Procede et dispositif de classification
WO2001082114A3 (fr) Systeme repondant a un besoin d'information
AU2003245016A1 (en) System and method for automated mapping of keywords and key phrases to documents
WO2004013772A3 (fr) Systeme et procede d'indexation de donnees non textuelles
WO2005010691A3 (fr) Desambiguisation des phrases de recherche au moyen de groupes d'interpretation
EP1154358A3 (fr) Systéme de classification de textes automatique
WO2006049996A3 (fr) Detection de courriel indesirable basee sur des liens
WO2001067388A3 (fr) Gestion de proprietes de video dote d'un hyperlien
WO2004084099A3 (fr) Regroupement d'un corpus, raffinement de confiance, et etablissement de rang pour une recherche de texte geographique et pour une extraction d'informations
WO2004057497A3 (fr) Recherches reordonnees d'empreintes de supports
EP1072982A3 (fr) Méthode et système d'extraction de mots similaires et de recouvrement de documents
WO2002050662A3 (fr) Appareil et procede de classification de programmes basee sur une syntaxe d'information de transcription
CN102915299A (zh) 一种分词方法及装置
EP1221693A3 (fr) Comparaison de références de prosodie pour des systèmes de conversion texte-parole
Nomoto NEAL: A neurally enhanced approach to linking citation and reference
CA2348420A1 (fr) Methode et appareil d'affichage de donnees pour l'analyse de textes
WO2004006124A3 (fr) Text-representation, text-matching and text-classification code, system and method
CN106570162A (zh) 基于人工智能的谣言识别方法及装置
WO2003027895A3 (fr) Dictionnaire japonais virtuel

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase