GB2338807A - Extraction server for unstructured documents - Google Patents

Extraction server for unstructured documents

Info

Publication number
GB2338807A
GB2338807A GB9923074A GB9923074A GB2338807A GB 2338807 A GB2338807 A GB 2338807A GB 9923074 A GB9923074 A GB 9923074A GB 9923074 A GB9923074 A GB 9923074A GB 2338807 A GB2338807 A GB 2338807A
Authority
GB
United Kingdom
Prior art keywords
electronic document
engine
semantic network
word groups
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB9923074A
Other versions
GB9923074D0 (en
Inventor
Prabhat K Andleigh
Nagaraju Pappu
Vasudeva V Kalidindi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Infodream Corp
Infodream Corp
Original Assignee
Infodream Corp
Infodream Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infodream Corp, Infodream Corp filed Critical Infodream Corp
Publication of GB9923074D0 publication Critical patent/GB9923074D0/en
Publication of GB2338807A publication Critical patent/GB2338807A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A system for analyzing and extracting words and word groups from an electronic document (104) and for storing the extracted words and word groups into predefined fields or tables in a target database (110) comprises a content analysis and semantic network engine (216) for analyzing and extracting words and word groups from the electronic document and a heuristics engine (212) coupled to the content analysis and semantic network engine (216), for applying a set of heuristics to the words and word groups in the electronic document. The content analysis and semantic network engine (216) further comprises a thesaurus (400) for linking together terms (402) and concepts (404) and for defining relationships between and among the terms (402) and concepts (404), a semantic network (220) coupled to the thesaurus (400), for organizing the terms (402) and concepts (404) in the thesaurus (400), meta-concepts (502), and categories (504) in a hierarchical structure, and section processors (218) for analyzing a section in the electronic document (104) and applying a set of heuristics to each section in the electronic document (104). The system further comprises a document pre-processor (210) for performing an initial analysis on the electronic document (104), a morphological analysis engine (214) coupled to the heuristics engine (212) for performing a morphological analysis and tagging of words and word groups in the electronic document (104), and a database interface (222) for providing an interface between the content analysis and semantic network engine (216) and the target database (110).
GB9923074A 1997-12-29 1998-12-28 Extraction server for unstructured documents Withdrawn GB2338807A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US6892097P 1997-12-29 1997-12-29
PCT/US1998/027664 WO1999034307A1 (en) 1997-12-29 1998-12-28 Extraction server for unstructured documents

Publications (2)

Publication Number Publication Date
GB9923074D0 GB9923074D0 (en) 1999-12-01
GB2338807A true GB2338807A (en) 1999-12-29

Family

ID=22085559

Family Applications (1)

Application Number Title Priority Date Filing Date
GB9923074A Withdrawn GB2338807A (en) 1997-12-29 1998-12-28 Extraction server for unstructured documents

Country Status (3)

Country Link
AU (1) AU1948299A (en)
GB (1) GB2338807A (en)
WO (1) WO1999034307A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2428114A (en) * 2005-07-08 2007-01-17 William Alan Hollingsworth Data Format Conversion System

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1998055937A1 (en) * 1997-06-04 1998-12-10 Sharp Gary L Database structure and management
JP2000113064A (en) * 1998-10-09 2000-04-21 Fuji Xerox Co Ltd Optimum acting person selection support system
NL1015151C2 (en) * 2000-05-10 2001-12-10 Collexis B V Device and method for cataloging textual information.
AU2002368316A1 (en) 2002-10-24 2004-06-07 Agency For Science, Technology And Research Method and system for discovering knowledge from text documents
US7039625B2 (en) * 2002-11-22 2006-05-02 International Business Machines Corporation International information search and delivery system providing search results personalized to a particular natural language
US6961733B2 (en) * 2003-03-10 2005-11-01 Unisys Corporation System and method for storing and accessing data in an interlocking trees datastore
CA2518797A1 (en) * 2003-03-10 2004-09-23 Unisys Corporation System and method for storing and accessing data in an interlocking trees datastore
US7580947B2 (en) * 2003-03-27 2009-08-25 Hewlett-Packard Development Company, L.P. Data representation for improved link analysis
US7593909B2 (en) 2003-03-27 2009-09-22 Hewlett-Packard Development Company, L.P. Knowledge representation using reflective links for link analysis applications
US7716241B1 (en) 2004-10-27 2010-05-11 Unisys Corporation Storing the repository origin of data inputs within a knowledge store
US7908240B1 (en) 2004-10-28 2011-03-15 Unisys Corporation Facilitated use of column and field data for field record universe in a knowledge store
US7676477B1 (en) 2005-10-24 2010-03-09 Unisys Corporation Utilities for deriving values and information from within an interlocking trees data store
US7734571B2 (en) 2006-03-20 2010-06-08 Unisys Corporation Method for processing sensor data within a particle stream by a KStore
US7689571B1 (en) 2006-03-24 2010-03-30 Unisys Corporation Optimizing the size of an interlocking tree datastore structure for KStore
CN103207872A (en) * 2012-01-17 2013-07-17 深圳市快播科技有限公司 Real-time indexing method and server
US20150169676A1 (en) * 2013-12-18 2015-06-18 International Business Machines Corporation Generating a Table of Contents for Unformatted Text
WO2017017678A1 (en) * 2015-07-27 2017-02-02 Opisoft Care Ltd. System and method for phrase search within document section
CN107844497A (en) * 2016-09-20 2018-03-27 天脉聚源(北京)科技有限公司 A kind of method and system of database retrieval
US11783127B2 (en) 2019-08-07 2023-10-10 Zinatt Technologies, Inc. Data entry feature for information tracking system
US11829701B1 (en) * 2022-06-30 2023-11-28 Accenture Global Solutions Limited Heuristics-based processing of electronic document contents

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2943447B2 (en) * 1991-01-30 1999-08-30 三菱電機株式会社 Text information extraction device, text similarity matching device, text search system, text information extraction method, text similarity matching method, and question analysis device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ent of semi-structured data", Tucson, AZ, USA, 16.05.97,pp.18-25,XP002099172, 1997,Murray Hill, NJ,U *
HAMMER J et al:"Extracting semistructured information from the Web" Proc. of the workshop on managem *
SA, AT & T Labs-Research, USA *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2428114A (en) * 2005-07-08 2007-01-17 William Alan Hollingsworth Data Format Conversion System
US10528806B2 (en) 2005-07-08 2020-01-07 Cynsight, Llc Data format conversion

Also Published As

Publication number Publication date
WO1999034307A1 (en) 1999-07-08
GB9923074D0 (en) 1999-12-01
AU1948299A (en) 1999-07-19

Similar Documents

Publication Publication Date Title
GB2338807A (en) Extraction server for unstructured documents
Grefenstette et al. What is a word, what is a sentence?: problems of Tokenisation
Habert et al. Towards tokenization evaluation.
US20010014852A1 (en) Document semantic analysis/selection with knowledge creativity capability
WO1997008604A3 (en) Multilingual document retrieval system and method using semantic vector matching
WO1998037478A3 (en) Group action processing between users
Thompson et al. Name searching and information retrieval
Borin et al. Naming the past: Named entity and animacy recognition in 19th century Swedish literature
EP1331574A1 (en) Named entity interface for multiple client application programs
Hazman et al. Ontology learning from domain specific web documents
Llidó et al. Extracting temporal references to assign document event-time periods
McDonald Neuroanatomical labeling with biocytin: a review
WO2000026839A8 (en) Advanced model for automatic extraction of skill and knowledge information from an electronic document
Chowdhury Template mining for information extraction from digital documents
Lawson et al. Automatic extraction of citations from the text of English-language patents-an example of template mining
Krulwich et al. ContactFinder: Extracting indications of expertise and answering questions with referrals
Ciravegna et al. Flexible text classification for financial applications: the FACILE system
Vokhmintsev et al. The knowledge on the basis of fact analysis in business intelligence
Nguyen et al. Vn-kim ie: Automatic extraction of vietnamese named-entities on the web
Rosner et al. Maltilex: A computational lexicon for maltese
Sardinha et al. Corpus linguistics
Bruder et al. GETESS: Constructing a linguistic search index for an Internet search engine
Levow Corpus-based techniques for word sense disambiguation
Meng et al. Word segmentation based on database semantics in NChiql
Crowder et al. Using Statistical Properties of Text to Create Metadata.

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)