AU2004235636A1 - A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources - Google Patents

A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources Download PDF

Info

Publication number
AU2004235636A1
AU2004235636A1 AU2004235636A AU2004235636A AU2004235636A1 AU 2004235636 A1 AU2004235636 A1 AU 2004235636A1 AU 2004235636 A AU2004235636 A AU 2004235636A AU 2004235636 A AU2004235636 A AU 2004235636A AU 2004235636 A1 AU2004235636 A1 AU 2004235636A1
Authority
AU
Australia
Prior art keywords
entity
span
document
entities
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2004235636A
Inventor
Jonathan Baxter
Kristie Seymore
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Panscient Inc
Original Assignee
Panscient Pty Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Panscient Pty Ltd filed Critical Panscient Pty Ltd
Priority to AU2004235636A priority Critical patent/AU2004235636A1/en
Priority to EP05111255A priority patent/EP1669896A3/en
Priority to US11/291,740 priority patent/US20060123000A1/en
Publication of AU2004235636A1 publication Critical patent/AU2004235636A1/en
Assigned to PANSCIENT INC reassignment PANSCIENT INC Request for Assignment Assignors: PANSCIENT PTY LTD
Abandoned legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Description

03/12/2004 15:56 MPDDERNS 4 0262832734 NJ.S09 Q004 Regulation 3.2
C-)
C)
en 0
AUSTRAU-A
PATENTS ACT 1990 COMPLET-E SPECIRCAnON FOR A STANDARD PATE-NT
ORIGINAL
Name of Applicant..
Actual Inventors: Address for S~erv .ice: Invention title: Panscient Pty Ltd fonathon Baxter IKristie Seymore MADDERNS, 1st Floor, 64 Hindznarsh Square, Adelade, South Australia, Australia A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources method of performaing it known to us.
(FabkUiS2} COMS JONo: SBM-128o Received by IP Australia: rime (H:rn) 17:4? Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N.909 ND. 909 9aBS 0 0 o A Machine Learning System for Extracting Structured SRecords from Web Pages and Other Text Sources en3 0 D Field of the Invention IN 5 The present invention relates to a machine learning system for extracting structured Ci records from documents in a corpus. In one particular form the present invention Srelates to a system for extracting structured records from a web site.
Background of the Invention As the web continues to expand at an exponential rate, the primary mechansim for finding web pages of interest is through the use of search engines such as Google
T
Search engines of this type use sophisticated ranking technology to determine lists of web pages that attempt to match a given query. However, there are many queries that are not usefully answered by just a list of web pages. For example a query such as "Give me all the online biographies of IT managers in Adelaide", or "Give me all the open Sydney-based sales positions listed on corporate websites", or even alternatively "What are the obituaries posted on newspaper sites in the last week for people with surname Baxter" all relate to further structured information that may be found in a number of web pages from the same or different sites.
Accordingly, to answer such a query a search engine must extract more than just the words in a web page; it must also extract higer-level semantic information such as people names, jobtitles, locations from a given web page and then further process this higher-level information into structured records. These records would then be queried 1a COMS ID No: SBMI-01024869 Received by IPAustrala: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MAnn Sr-K A 02-c)v A2 NO. 99 D006 0 0 O as if one were simply querying a database, with the results being returned as lists of
C)
structured records rather than web pages.
There have been a number of attempts to provide this type of searching functionality.
\O 5 However, existing systems for extracting structured records from unstucred sources IN all suffer from the problem that they are painstakingly hand-tuned to their specific search domain. Thus in the example queries outlined above which relate to different Sdomains or areas of interests such as employment, corporate information or even obit- Snuaries, the extraction systems must be customised according to the expected query.
Cearly, this has a number of disadvantages as extraction systems of this type must each be developed and tuned separately depending on the expected query type. Where a query may relate to a number of different search domains or areas of interest the performance of existing extraction systems will be severely reduced.
It is an object of the present invention to provide a method that is capable of extracting a structured record from a document relevant to a given query type that is substantially independent of the domain of interest of that query.
It is a further object of the present invention to provide a method that is capable of extracting a structured record from a document that employs machine learning methods.
Summary of the Invention In a first aspect, the present method accordingly provides a method for extracting a structured record from a document, said structured record associated with a query of said document, said method comprising the steps of; 2 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 i5:56 M~flrcPki a fl~C)O~rflA 15:5.6 S. NO. 909 1007 0 0 O identifying and extracting a span of text in said document, said span identified according to a criteria associated with said query; and fonning a structured record according to said extracted span.
N 5 The top down approach employed by the present invention addresses a number of cn Sdisadvantages of the prior art in that information obtained from a higher level of exe traction may be employed in refining the extraction at lower levels such as identifying a relevant span of text and then forming a structured record from this span. Many prior CN art approaches attempt to use natural language processing (NLP) techniques which in direct contrast to the present invention identify words and entities within a document and then try to associate these words and entities with each other to form structured information. The top down approach of the present invention also makes it directly applicable to a machine learning approach which automates the extraction process.
Preferably, said method further comprises the steps of: identifying at least one entity in said span, said entity identified according to said criteria; and modifying said structured record to include information related to said at least one entity.
Preferably, said method further comprises the steps of: identifying at least one sub-entity in at least one entity, said at least one entity identified according to said criteria; and modifying said structured record to include infoimation related to said at least one sub-entity.
3 COMS ID No: SBMJ-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 NO.909 0088 O Often for a given extraction problem it is appropriate to further classify and extract information from a span into further entities or sub-entities and then include this infor- Smation in the structured record being formed.
'C 5 A method for forming a structured record from a document, said stuctured record N associated with a query of said document, said method comprising the steps of: identifying and extracting a span of text in said document, said span identified according to a criteria associated with said query; and forming said structured record according to said extracted span.
In a second aspect, the present invention accordingly provides a method for classifying text based elements according to a characteristic, said method comprising the steps of: identifying said text based elements in a training document; forming a feature vector corresponding to each text based element; forming a sequence of said feature vectors corresponding to said text based elements located in said training document; labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and training a predictive algorithm based on said sequence of labels and said sequence of said feature vectors, said algorithm trained to generate new label sequences from a new sequence of feature vectors thereby classifying text based elements corresponding to said new sequence of feature vectors.
4 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N909 09 0 0 (cN Brief Description of the Figures A preferred embodiment of the present invention will be discussed with reference to the accompanying drawings wherein: 0 FIGURE 1 is a screenshot of an obituary web page; e FIGURE 2 is a screenshot of an executive biography web page; O FIGURE 3 is a screenshot of ajob openings web page; FIGURE 4 is a screenshot of a single obituary web page;.
FIGURE 5 is a flowchart of a method for extracting records from a document according to a preferred embodiment of the present invention; FIGURE 6 is a screenshot of a span labeling tool as employed in a preferred embodiment of the present invention; FIGURE 7 is a screenshot of an entity labeling tool as employed in a preferred embodiment of the present invention; FIGURE 8 is a flowchart of the document labeling method according to a prefered embodiment of the present invention; FIGURE 9 is a flowchart of the span labeling method according to a preferred embodiment of the present invention; FIGURE 10 is a flowchart of the entity labeling method according to a preferred embodiment of the present invention; FIGURE 11 is a flowchart of the sub-entity labeling process according to a preferred embodmincnt of the present invention; FIGURE 12 is a flowchart of the association labeling method according to a preferred embodmient of the present invention; COMS ID No: SBM-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56F Mrnnn l ~k1C -r o ,s 34 NO. 909 P010 0 O FIGURE 13 is a flowchart of the normalization labeling method according to a pre- Sfared embodiment of the present invention; SFIGURE 14 is a flowchart of the entity/association/ormalization classification labeling method according to a prefened embodiment of the present invention; O 5 FIGURE 15 is a flowchart illustrating the steps involved in training a span extractor Ce to extract spans frwn labeled documents according to a preferred embodiment of the present invention; 0 FIGURE 16 is flowchart illustrating the steps involved in running a trained span extractor according to a prefwred embodiment of the present invention; FIGURE 17 is a flowchart illustrating the steps involved in training an entity extractor to extract entities from labeled documents according to a preferred embodiment of the present invention; FIGURE 18 is a flowchart illustrating the steps involved in running trained entity extractor according to a preferred embodiment of the present invention; FIGURE 19 is a flowchart illustrating the steps involved in training a sub-entity extractor to extract sub-entities from labeled documents according to a preferred embodiment of the present invention; FIGURE 20 is a flowchart illustrating the steps involved in running a trained sub-entity extractor according to a preferred embodiment of the present invention; FIGURE 21 is a flowchart illustrating the steps involved in training an associator to associate entities from labeled documents according to a preferred embodiment of the present invention; FIGURE 22 is a flowchart illustrating the steps involved in running a trained associator according to a preferred embodiment of the present invention; 6 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 a3/12/2004 03/12/2004 15:56i MflaoucD s 909 9011 0 0 ci FIGURE 23 is a flowchart illustrating the steps involved in training an associator from labeled documents according to a preferred embodiment of the present invention; o FIGURE 24 is an example search application according to a preferred embodiment of Sthe present invention over corporate biographical data extracted from the Australian Mn vc 5 web. Summary hits from a query on "patent attorney" are shown; in M FIGURE 25 is the full extracted record from the first bit in Figure 24; and S FIGURE 26 depicts the cached page from which the record in Figure 25 was extracted.
ci In the following description, like reference characters designate like or corresponding parts or steps throughout the several views of the drawings.
Detailed Description of the Invention The present invention is concerned with the extraction of structured records from documents in a corpus. Each one of these documents may include one or more "spans" of interest Referring to Figure 1, there is shown a web page fromr an online newspaper that contain several obituaries (the first is highlighted). Jo this case the corpus is the collection of all web pages on the newspaper site; the documents of interest are the obituary pages, and each obituary represents a distinct "span" that is to be extracted into its own structured record. In this case the structured record might include the full obituary text, deceased name, age at death, date of birth and other fields such as next-of-kin.
Referring now to Figure 2, there is shown a web page in which the spans of interest are executive biographies. The corpus in this case is the collection of all web pages on 7 COMS ID No: SBM-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15: M13n~FPl~le alcla~rr~~n NO. 909 P012 0 0 the company's website; the documents of interest are the executive biography pages, and the biographical records might include person name, current job title, former job o titles, education history, etc.
Va Cr 5 Referring to Figure 3, there is shown a web page in which the spans of interest are open Va n job positions. As for biographies, the corpus is the collection of all web pages on the Scompany's website; the documents of interest are the job pages, and the job records o might include title, full or part-time, location, contact information, description, etc.
These examples all show multiple spans in each document, but there may also be only one span of interest on a given web page, such as shown in Figure 4.
Clearly, as would be apparent to those skilled in the art, the corpus of documents could be further generalised to include all web pages located on servers originating from a given country domain name or alteratively all web pages that have been updated in the last year.
In this preferred embodiment the application of the present invention is directed to the extraction of strctured executive biographical records from corporate web sites.
However, as would also be apparent to those skilled in the art, the method of extracting structnal records according to the present invention is equally applicable to generating structural records from any text based source.
Accordingly, the goal of the extraction process is to process the web pages in a corporate web site; locate the biographical pages such as the one shown in Figure 2 and to then generate structured records containing the biographical information of each executive. As an illustrative example the structured record could be generated in XML 8 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 i5:SS NWrnflF~tdC c~o-rr7A 15:5 MADRI NO. 909 QP013 format as followso <bic> <person>
IND
Mn 5 <fullname>Mr Roger Campbell Corbett</fullnawe> en <title>Mr/title> ci c~first..name>Rogerc/ first-name> o <middle_npame>Canpbellc/middlename, <lastjiane>Corbettc/last~name> <fperson> <work__history> Cjobtitle>Chief Executive Oftlcercfjobtitae> <current >true</cnrrent> </work;,history> is <work_history> cjobtitie>croup Managing Directorc/ jobtitle> <current>true</current> </work_history> <workjhistory> Cjobtitle>cbief Operating Officerc/jobtirle> <current>falsec/current> </wotlchi±story' <workjiistory> <jobtitJle>Managing Director Retailcfjcbtitle> Ccurrent>falsec/cur rent> </work3_history> 9.
COMS ID Na:SBMI-01024869 Received by IP Australia: Time (I-tm) 17:47 Date 2004-12-03 03/12/2004 15:56 MRDDERNS 4 026283273,4 NO. 909 [P014 <workji._istory> cjobtitle>Hanaging Director<c/jobtitle> o<organization>Exg WC/organizatimn> e<current >faise</ current> </vorkjdhstory> tfl <workhistory> <jobtitle>Directar of Operationac</jobtitle> o <organi zat ion>c avid Jones (Australia) Pty Ltcdc/organization> <current >false</ curent> </vork--history> <work~history> <jobtitle>Director</jobtitle> <orqanization>David Jones (Australia) Pty Ltdc/organization> <ourrent>false</current> </workjiistory> <wo rk.Jiis tory> Clobtitle>Herctandasing and Stores Director</jobtitle> corganization>Grace Broe</organization> <current>false</current> </work-history> cWorkjuistory> cjobtitle>Director</ jObtitle> <Organization>Grace Broec/organization> <Cu rrent>false<f current> c/workjhistory> <workjitory> COMS ID NO: SBMI-01024869 Received by IP Australia: lime 17:47 Date (f-MAd) 2004-12-03 03/12/20i4d Mfncok rnr?~~ rrozor NO.909 0015 0 0 <jobtitle>Executive Director</jobtitle> <current>true</current> o </workhistory> <workhistory>
VO
M 5 <jobtitle>Chairman</jobtitle> S <group>Strategy Committee</group> ci S<current>true</current> o </work_history> <bio_text> CEO and Group Managing Director Mr Corbett was appointed Chief Executive Officer and Group Managing Director in January 1999, having been Chief Operating Officer since July 1998, Managing Director Retail since July 1997 and Managing Director BIG W since May 1990.
He has had more than 40 years experience in retail and was previously Director of Operations and a Director of David Jones (Australia) Pty Ltd as well as Merchandising and Stores Director and a Director of Grace Bros.
He was appointed an Executive Director in 1990.
He is Chairman of the Strategy Committee.
11 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 III 03/12/2004 15:I~F, MLlln~FF~~X G)7C-)C)~7~~A I- NO.909 5016 0 0 Age </bio_text> o </bio>
VO
C 5 The structured records may then be stored in a database and indexed for search.
ttO Ci Referring now to Figure 5, there is shown a flowchart of the method for extracting a Sstmctued record from a document according to the present invention. This process is summarized as follows: 1. Candidate pages are generated by a directed crawl from the home page or collection of pages from the corporate web site; 2. Each candidate page is classified 110 according to whether it is a page ofinterest or not; 3. Pages that are positively classified 120 are processed 130 to identify the spans (contiguous biographies) of interest; 4. Spans are further processed 150 to identify entities of interest, such as people and organization names, jobtitles, degrees; Extracted entities may be further processed 165 to identify sub-entities for example people naes broken downinto title, first, middle, last, suffix; 6. Extracted entites may be further associated 170 into related groups for example jobtitles associated with the correct organization; 12 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 NO.909 9017 0 0 0 7. Extracted entities may also be normalized 175, for example multiple variants of the same person name may be combined together, 8. Extracted entities, normalized entities, and associated groups of entities may be 5 further classified 180: for example jobtitle/organization pairs categorized into IN current or former; ci 9. All the extracted information is formed into a structured record 190;
O
c 10 10. The structured record is stored in a database 210 and indexed for searching 200.
Each step in the process, from classification 110 (step 2) through to normalization 175 (step can be performed using hand-coded rules or in this preferred embodiment with the use of classfiers and eractors trained using machine learning algorithms.
Machine learning algorithms take as input human-labeled examples of the data to be extracted and output a classifier or extractor that automatically identifies the data of interest Their principal advantage is that they require less explicit domain knowledge.
Machine learning algorithms essentially infer domain knowledge from the labeled examples. In contrast, the use ofpurely hand-coded rules requires an engineer or scientist to explicitly identify and hand-code prior domain knowledge, thereby adding to the expense and development time of extraction tools based on these methods.
In this preferred embodiment, hand-coded rules are used as input to machine learning algorithms. In this manner, the algorithms obtain the benefit of the domain knowledge contained in the rules but can also use the labeled data to find the appropriate weighting to assign to these rules.
13 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MAIDDERNS 0262832734 NO.909 0018 0 0 O As is known in the art, the application of machine learning algorithms requires hand-
CD)
labeling example data of interest, extracting features from the labeled data, and then Straining classifiers and extractors based on these features and labels. It is typically an iterative process, in which analysis of the trained extractors and classifiers is used
VO
C 5 to improve the labeled data and feature extraction process. In some cases many itere ations may be required before adequate performance from the trained classifiers and Sextractors is achieved.
0 Two of the primary determinants of trained classifier and extractor perfonnance are the number of independent labeled training examples and the extent to which spurious or irrelevant features can be pruned from the training data. Labeled examples that are selected from within the same web site are typically not independent. For example, documents from the same site may share similar structure or biographies from the same site, may use common idioms peculiar to the site.
Most machine learning algorithms can deal with "weighted" training examples in which the significance of each example is reflected by an assigned number between 0 and 1. Thus, in order to generate accurate statistics and to ensure good generalization of the machine learning algorithms to novel sites, labeled training examples can be weighted so that each site is equally significant from the perspective of the machine learning algorithm each site has the same weight regardless of the number of examples it contains).
Techniques for pruning features usually rely on statistics computed from the labeled training data. For example, features that occur on too few training examples can be 14 COMS ID No: SBMI-01024869 Received by IPAustralia: Time 17:47 Date 2004-12-03 15:56 MADDERNS 0262832734 N0.909 D019 0 0 O pruned. In a similar fashion, the labeled training examples can be weighted so that each site's examples contributes the same amount to the statistics upon which pruning o is based. This leads, for example, to pruning based upon the number of sites that have an example containing a particular feature, rather than the number of examples themn 5 selves. This "site-based weighting" approach yields substantially better performance en from trained classifiers and extractors than uniform weighting schemes.
SReferring now to Figures 6 and 7 there are shown screenshots of a graphical tool used to label both spans of interest within example web pages and entities of interest within the spans of interest with a view to training a classifier to extract biographical data from a corporate web site according to a preferred embodiment of the present invention.
This process of labeling is used at multiple stages throughout the extraction method to train the relevant classifier to classify for the relevant characteristic depending on which step of the extraction method is being performed. The flowcharts of Figures 8- 12 describe the steps involved in labeling the various data of interest according to the particular stage of the extraction process.
Referring now to Figure 8, there is shown a flowchart illustrating the process for initially labeling documents of interest from the unlabeled corpus of documents 300.
Documents are retrieved 310 from the unlabeled corpus 300 and human-labeled 320 according to the characteristic of interest (for example "biographical page" or "nonbiographical page")- The labels assigned to the documents are then stored 330.
Referring now to Figure 9, there is shown the next step in the labeling process wherein the spans of interest within the previously labeled web-pages of interest are labeled.
COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 NO.909 0020 0 0 0 Positively labeled documents 340 (those labeled as biographical pages in the biography
C)
extraction application) are retrieved from the labeled document store 330, tokenized O 345 into their constituent tokens (words, numbers, punctuation) and the spans of interest within the documents are labeled or "marked up" 350 (see Figure 6) by a human.
C 5 The locations of the token boundaries of each span in each document are then stored C 360 SReferring now to Figure 10, the next step in the labeling process is to label the enitties of interest within each previously labeled span of interest. Positively labeled documents 340 and the locations of their spans 370 are retrieved from the labeled document store 330 and the labeled span store 360 respectively, and the entities of interest within each span are labeled or "marked up" 380 (see Figure 7) by a human. The locations of the boundaries of each entity within each span, and the category (label) of each entity (name, jobtitle, organization, etc) are then further stored 390.
Depending upon the application, there may be one or more labeling steps involved after entity labeling. For example, whole names labeled as entities in the pevious step may need to be broken down into their constituent parts (for example title, first, middle/maiden/nick, last, suffix), different types of entities may need to be associated together (for example jobtitles with their corresponding organization name), or distinct references to the same entity may need to be "normalized" together (for example references to the same person in a biography, as "Jonathan Baxter", "Jonathan" '"Dr Baxter" etc) Entities, normalized entities, or associated entities may also require further classification such as jobtitles/organizations being classified into either former or cunent.
16 COMS ID No; SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 I I 03/12/2004 15: 56 MADDERNS 0262832734 NO.909 021 0 Referring now to Figure 11, positively labeled documents, the locations of their spans,
C)
and the locations of the entities within the spans 400 are retrieved from the labeled o document store 330, the labeled span store 360, and the labeled entities store 390. The subetities of interest within each entity are labeled or "marked up" 410 by a human.
C 5 The locations of the boundaries of each sub-entity within each entity, and the sub-entity l n category (label) are stored 420.
SAssociation labeling involves grouping multiple labeled entities of different types together, for example jobtitle with organization, or degree with school.
Referring now to Figure 12, positively labeled documents, the locations of their spans, and the locations of the entities within the spans 430 are retrieved from the labeled document store 330, the labeled span store 360, and the labeled entities store 390. The associated entities of interest within each span are labeled or "marked up" 440 by a human. The associated entities and their type (label) are stored 450.
Normalization labeling is similar to association labeling in that it involves grouping multiple labeled entities together, however unlike association labeling it involves grouping entities of the same type together. For example grouping "Jonathan Baxter" with "Dr. Baxter" and "Jonathan" within the same biography.
Referring now to Figure 13, positively labeled documents, the locations of their spans, and the locations of the entities within the spans 430 are retrieved from the labeled document store 330, the labeled span store 360, and the labeled entities store 390. The normalized entities of interest within each span are labeled or "marked up" 460 by a human. The normalized entities are stored 470.
17 COMS ID No: SBMI01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 I II IIII Il l 03/12/2004 15:56 MADDERNS 4 0262832734 NO.909 0022 0 0 O Entities, normalized entities, or associated entities may also require further classifica-
C)
tion such as jobtitles/organizations being classified into cither former or current.
Referring now to Figure 14, positively labeled documents, the locations of their spans, On 5 the locations of the entities within the spans, and the normalized and associated entities
VO
lt with the span 480 are retrieved from the labeled document store 330, the labeled span 1 store 360, the labeled entities store 390, the labeled associations store 450 and the o labeled normalization store 470. The entities/associated entities/normalized entities of interest within each span are classified 490 by a human. The classifications are stored to 500.
Refening once again to Figure 5, document classification step 110 according to a preferred embodiment of the present invention requires classification of text documents into preassigned categories such as "biographical page" versus "non-biographical page".
The first step in the machine classification procedure is to extract features from the stored labeled documents 330 (as shown in Figure Standard features include the words in the document, word frequency indicators (for example, binned counts or weights based on other formulae including fidf), words on incoming links, distinct features for words in various document fields including document title, headings (for example html 1hli, ih2, etc tags), emphasized words, capitalization, indicators of word membership in various lists, such as first-names, last-names, locations, organization names, and also frequency indicators for the lists.
As an illustrative example, consider the HTML document: COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 03/2/204 15:56 MflDDERNS 4) 0262832734 mO.90s D023 o <html> <head> o <title>Fox Jumping'C/title> </headi> en 5 <body> en <hl>What the fox didc/hl> o The cb>cpick</b> brown fox jumped over the cb>lazy</b> dog- </bodyr> c/html> Assuming a prespecified list of animal namnes, the feature. vector for this document would then be: f [brown, did, dog, fox, jumped, Jumping, lazy, over, quick, the, what, frequency-3-fox, leadc-ap.Iox, leadcap-jumping, leadeap-the, leadcap.what, t itle-fox, t it le-jumping, heading-.what, heaclingxthe, heading-fox, headinq-did, emphasis-lazy, emphasis-quick, list-anixnal-fox, list-animal-dogj In ibis manner, featorca are extracted from all dociuments within the labeled traning corpus 330 (as shown in Figure or from a statistical sample thereof. The extracted features and associated labels are stored in a training index. Onice these features are 19 COMS ID No: SBMI-010248a9 Received by IP Australia: ime 17:47 Date 2004-12-03 03/12/2004 15:56 MADDRNS 4 0260070r27A NO.909 P024 0 0 cN o extracted, many existing methods for training document classifiers may be applied, including decision trees, and various forms of linear classifier, including maximum o entropy. Linear classifiers, which classify a document according to a score computed from a linear combination of its features, are in many instances the easiest to interpret, \0 e 5 because the significance of each feature may easily be inferred from its associated en weight and accordingly in this preferred embodiment the document classification step 110 (as shown in Figure 5) is implemented using a linear classifier trained from the 0 0 document data labeled according to the process of Figure 8.
Referring back again to Figure 5, the step of span extraction 130, requires the automatic extraction of spans of interest from classified positive documents. With reference to Figures 2 and 6, the text of each individual biography is automatically identified and segmented from the surrounding text.
Referring now to Figure 15, there is shown a flowchart illustrating this segmentation process.
1. Positively labeled Documents 340 from the labeled document corpus 330 are tokenized 345 into their constituent tokens.
2. Text documents can be automatically split into "natural" contiguous regions.
In the simplest case a document with no markup can be split on sentence and paragraph boundaries. A document that is "marked up" (such as an HTML document) can be broken into contiguous text-node regions. For example, the HTML document: <b>Jonathan Baxter</b> COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 PR1 ~CL Mhhhroile I nlrn~-rn~* 0122A 2, flfC.n,,r I InlJJJ']-nO -7 hJD.:9 ^jO Mng [1025 0 p
O
CEO
C <p> Jonathan co-founded Panscient Technologies
ID
en 5 in 2002...
ID
i <p> <b>Kristie Seymore</b>
S<P>
coo <p> would naturally split into 5 "text nodes": [Jonathan Baxter, [CEO], [Jonathan co-founded Panscient Technologies in 2002... [Kristie Scymore], [COO]. These regions are "natural" in the sense that their text refers to a particular named entity or are related in some other fashion. In the above example, the first text-node contains the subject of the first biography "Jonathan Baxter", the second contains his jobtitle "CEO", while the third contains the first paragraph of Jonathan's biography. The next text-node contains the subject of the second biography ("Kistie Seymore"), the following text-node is her jobtitle, and so on.
It is important point to note in this example that it is highly unusual for there to be no boundaries between unrelated text. In particular, it would almost never be the case that a single text node contained more than one biography, or obituary, orjob, etc.
21 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2f04 M~~TrCn~lc orlr~~?n-rr 03/12/. 24 IS:r aoLoacr (nr ND.909 9026 C 'The tokenized documents in the labeled training corpus are automatically split 710 into their natnal contiguous text gions by this method. These regions are geneically referred to as "text nodes", regardless of their method of coostruction.
n 3. Each segmented text-node is processed 720 to generate a vector of features. Such features would usually include indicators for each word in the text-node, frequwicy information, membership of text-node words in various lists such as flrst name, last name, jobtitle and so on. Any feature of the text node that could help distinguish the boundaries between biographies and can be automatically computed should be considered. For example, the feature vector f corresponding to the text-node "Jonatban Baxter" might look like: f [jonathan, baxter, list-firsatname, listlastsname, list-firstlname-preceedsistlastm ame, first-ccurence-oflastname) Here "listfistname" indicates that the text-node contains a fint-name, '"listJastname" indicates the same for last-name, "Ist-firstamepreceedsJistlast-name" indicates that the text-node contains a first-name directly preceding a last-name, "firsoccurenceoflast-me" indicates that the text node is the first in the doc- 20 ument in which the last name "baxter"occured.
4. The feature vectors from the text-nodes in a single document are concatenated 730 to form a feature vector sequence for that document: h U when n is the number of text-nodes in the doocument.
5. The span labels 360 assigned by the span labeling process (as shown in Fig- 22 COMS ID No: SSMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/204d M~n~CP~C I n~co~~n~l cMo,.I-e ro* N. 909 D027 0 0 Sure 9) can be used to induce 740 a labeling of the feature vector sequence
C)
f] by assigning the "bio.span" label to the feature-vectors of those g text-nodes that fall within a biographical span, and assigning "other" to the remaining feature vectors (in fact, the "other" label does not need to be explicitly CC 5 assigned the absence of a label can be interpreted as the "other label). Here tfn we are relying on the assumption that breaks between biographies do not occur C within text-nodes.
0 o 'Ihis generates a sequence of labels 1 for each document in 1- 1 correspondence with the document's tet-node feature vector sequence f [fV, fn], where I ="bio.span"' or 4 ="other".
In order to distinguish a single long biography from two biographies that run together (with no intervening text-node), additional labels must be assigned 750 to distinguish boundary text-nodes (in both cases the label sequence will be a continuous sequence of "bio_span" hence it is not possible, based on the assigned labels, to determine where the boundary between biographies occurs).
One technique is to assign a special "biospanstart" label to the first text-node in a biography. In cases where the data exhibits paticularly uniform structure one could further categorize the text-nodes and label as such. For example, if all biographies followed the pattern [namejobtitle,text] (which they don't) then one could further label the text nodes in the biography as [bio-name, bio-jobtitle, biotext].
6. The feature vector sequences and their corresponding label sequences for each positively labeled document 340 in the labeled document corpus 330 are then used 760 as training data for standard Markov model algorithms, such as Hid- 23 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 N0.909 P028 0 0 Oc den Markov Models (HMM), Maximum Entropy Markov Models (MEMM) and
C)
Conditional Random Fields (CRF). Any other algorithms for predicting labelo sequences from feature-vector-sequences cold also be used, including handtuning of rle-based procedures.
n 5 The output 770 of all these algorithms is a model that generates an estimate of en the most likely label sequence [11, when presented with a sequence of Sfeature vectors [fi, a]- Referring now to Figure 16, once the span extraction model has been trained, it can be applied to the positively classified documents generated at step 120 in Figure by applying steps 345 (tokenize), 710 (split into text nodes), 720 (extract text-node features) and 730 (concatenate text-node features) of Figure 15, and then applying the model to the feature-vector sequence so obtained to generate 800 the most likely label sequence 2, The label sequence output by the trained model is used to collate contiguous text nodes into individual biographies by identifying 810 specific patterns of labels. The conect pattern will depend on the labels assigned to the training data on which the model was trained. For example, if the first text-node in each training biography was labeled "bio-spanastart"' then individual biographies within the label sequence output by the trained model will be identified by the label "biospanrstat" assigned to a text-node followed by zero or more text-nodes with the "biospan" label. A biography is then formed from all tokens from the first token in the "bio-spanstart" text-node to the last token in the last "bio-span" text-node. The locations of all such biographical "spans" within a document are then output 820.
24 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 IN I 03/12/2004 15:56 MADDERNS 0262832734 NO.909 D029 0 0 ci Referring back again to Figure 5, entity extraction step 140 requires the extraction of entities of interest from the spans identified at step 130. As shown in Figure 7, each o individual entity must be automatically identified and segmented from the text of the surrounding span. Once again, a machine learning-based method is employed by the n 5 present invention to train an extractor for performing entity extraction, although other en direct (not-trained) methods may also be applicable. The training data used by the c- machine learning algorithms consists of one example for each labeled span from the O positively labeled training documents.
Referring now to Figure 17 ther is shown a flowchart illustrating this process: 1. Positively labeled Documents 340 from the labeled document corpus 330 are tokenized 345 into their constituent tokens. The boundaries of each labeled span with each document are read from the labeled span store 360 and used to segment 910 the tokens of each document into subsequences, one subsequence for each labeled span.
2. A feature vector is extracted 920 from each token in the span. Features extracted from tokens can include word features, capitalization features, list membership.
markup indicators (such as emphasis), location indicators (such as "this is the first occurrence of this first-name on the page/span", or "this token is the first, second, third, etc from the start of the span", or "this token is within 4 tokens of the startlend of the span", etc), frequency of tokens within the span or document, etc. Any feature of a token that will help distinguish entities from the surrounding text and can be automatically computed should be considered.
COMS ID No: SSMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 03'2'204 15:56 MADDERNS 4 0262632734 NO.909 030 A useful addditioval step can be to "shWf' deiived (naon-ward) features, so that features from surrounding tokens are applied to the curret tokoen.
o As a simple cxanl~e of this shift process, consider the following pcoton of a tokenized biographical span: en -<b>Joniathan Baxterc/b> Jonathan Baxter is the CEO of Panscient Technologies.
Assuming that "Jonathan" is pieent ini a first-name list and that the tbst occurence of Jonathan in the span portion is 21so the first ocw-nce of "Jonathan" within the surrounding document, the feaure-vector for the first -jonatban" tokcen would be: if jonathan, leadcap.,jonathan, listifirstuaie, first-in-documant-list-first-amme, f irst-in-span-list- irst-ame, 3location.spana1, html-emphasis, post-l-irst-n-ocutent-.iistlastnme, post-L-first-inspan-2.ist-last.naie, poat.Z..htmal-emphasis)1 26 COMS IDNo: SDMI-01024869 Received by IPAuetralia: lime (H:rn) 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N3.909 9031 0 0 Note the use of the prefix "post_l" to indicate shifting of derived (non-word) features from the following token ("Baxter") (and that we have made similar Sassumptions concerning the presence of "Baxter" in a last name list and its occurence within the document have been made). Obviously features from ton 5 kens further afield could be shifted (and prepended with "post2", "post3", etc i as appropriate), and also shift features from preceding tokens (prepending with "pre-1", "pre"2, etc).
0' 3. The feature vectors from the tokens in a single span are concatenated 930 to form a feature vector sequence for that span: [ff, f f where n is the number of tokens in the span.
4. The entity labels 390 assigned by the entity labeling process (as shown in Figure 10) induces 940 a labeling of the feature vector sequence [ft, Jn] by assigning the appropriate entity label to the feature-vectors corresponding to tokens in that entity, and assigning "other" to the remaining feature vectors (as noted previously, the "other" label does not need to be explicitly assigned the absence of a label can be interpreted as the "other" label).
This generates a sequence of labels 1 [11, for each span in 1-1 correspondence with the featue vector sequence f [fl, f, over tokens in the span. The label assigned to each token will depend upon the entity containing the token. For example, assuming that job titles, person names, and organization names are labeled as distinct entities during the entity labeling process of Figure 10, the label sequence for the example of item 2 above would be: 1 [name, name, name, name, other, other, 27 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 i ii 03/1 2/2004A 15:56 MAlflrDkErR r-mn J~ro23o T-O4 NO.909 P032 0 0 o jobtitle, other, organization,
C)
organization, other] O corresponding to the token sequence \D [Jonathan, Baxter, Jonathan, Baxter, 0 5 is, the, CEO, of, Panscient, Technologies, en C-i In order to distinguish a single long entity from two entities that run together 0 (with no intervening token, such as the adjacent occurences of "Jonathan Baxter' above), additional labels must be assigned 950 to distinguish boundary tokens within entities. As with span extraction, one technique is to assign a special 'start" label to the first token in an entity, eg "nameestart" or "organization-start".
End tokens can also be qualified in the same way "nameend" or "organization-end". Assuming the use of qualifying start labels, the label sequence set out above would become: 1 [name-start, name, name-start, name, other, other, jobtitlestart, other, organizationstart, organization, other] The feature vector sequences and their conesponding label sequences for each labeled span in a positively labeled document 340 in the labeled document corpus 330 are then used 960 as training data for standard Markov model algorithms, such as Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM) and Conditional Random Fields (CRF).
The output 970 of all these algorithms is a trained model that generates an estimate of the most likely label sequence [I 1 when presented with a 28 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 N0.909 0033 0 0 o sequence of feature vectors [fl, corresponding to a token sequence
C)
from a segmented span.
Referring now to Figure 18, once the entity etraction model has been trained, it can cn 5 be applied to generate entities from each extracted span as follows: Cxl 1. Take the span boundaries output 130 by the span extractor (item 820 in Fig- 0 o ure 16) and the document token sequence 345 generated from the positively labeled documents (item 120 in Figure 5) and generate 900 the token subsequence for each span.
2- Generate 920 a feature-vector for each token in the span token subsequence with the same feature extraction process used to generate the training sequences (item 920 in Figure 15), and concatenate 930 the feature vectors to form a featurevector sequence.
3. Apply 1000 the trained entity extraction model (item 970 in Figure 17) to the featre-vector sequence to generate the most likely label sequence [L1, 2, 1,n] 4. The label sequence output by the trained model is used to collate contiguous tokens into individual entities by identifying 1010 specific patterns of labels. The correct pattern will depend on the labels assigned to the training data on which the model was trained. For example, if the first token in each training entity was labeled "name-start" (or "oxganization-start", or "jobtitle.start", etc), then individual names (organizations, jobtitles, etc) within the label sequence output by the trained model will consist of the token with the "name_start" label followed 29 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 NO.909 0034 0 0 O by all tokens with the "name" label. The locations of all such entities within a
C)
document and their category (name, organization, jobtitle, etc) are output 1020.
O
In a similar manner, the sub-ntity extraction step 165 as shown in Figure 5 requires C0 .5 the automatic xtraction of sub-entities of interest from the entities identified at step 150. Not all entities will necessarily require sub-entity extraction, the prototypical example is extraction of name parts (for example title, first, middle/maiden/nick, last, Ssuffix) from full-name entities. Again a machine learning-based method is employed in a preferred embodiment of the present invention to train an extractor for performing sub-entity extraction, although other direct (not trained) methods are also applicable.
The training data used by the machine. leaning algorithms consists of one example for each labeled entity from the positively labeled training documents. The training procedure is similar to that used to extract entities from within spans, and with some simplification may be described as the same process with "span" replaced by "entity" and "entity" replaced by "sub-entity".
Referring now to Figure 19, there is shown a flowchart illustrating the steps involved in training a sub-entity extactor. The main deviation points from the entity extractor training as illustrated in Figure 17 are: 1. there is one training example per labeled entity 1110, rather than one training example per labeled span (item 910 in Figure 2. feature extraction 1120 for the tokens within each entity will not include some of the features extracted (item 920 in Figure 15) for entities within spans that only make sense at the span-level, such as offset from the start of the span, and COMS ID No: SBMI-01024869 Received by JP Australia: Time 17:47 Date 2004-12-03 03/12 2004 15:56 MADDERNS 4 0262832734 N0.909 1035 0 0 CN will include additional features that only make sense at the entity level, such as Soffset from the start of the entity.
Apart from these deviations, the method of training a sub-entity extractor parallels that for training an entity extractor.
IND en l Similarly, the procedure for applying the trained sub-entity extractor to extract subo entities as illustrated in Figure 5 at step 165 parallels that of applying the trained entity C extractor at step 150, and is shown in Figure 20. The main deviation points from applying an entity extractor are: 1. the model operates over feature-vector sequences 1130 constructed from the tokens in each entity, not the tokens from the entire span; 2. feature extration 1120 for the tokens within each entity will be the same as that used when geneating the training features for subentity extraction; 3. the output of the process 1220 are sub-entity boundaries and their categories within each entity; Thus these methods can be used broadly to classify and extract text based elements of a document such as a span, entity or sub-entity by separating a document into region corresponding to the text based elements, forming feature vectors corresponding to each text based element and subsequently a feature vector sequence corresponding to the document. This feature vector sequence can be associated with a label sequence 31 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N. 909 9036 0 0 o and in combination these two sequences may be used to train predictive algorithms which may then be applied accordingly to other documents.
en% Refening once again to Figure 5, entity association step 170 requires the automatic association of entities identified at step 150. In the biography example, job titles need ND to be associated with the conesponding organization.
C
Using the example of "Mr Roger Campbell Corbett" whose biographical details are listed in the web page illustrated in Figure 2, at the end of the entity extraction step 150 the system will have extracted his jobtitles: Chief Executive Officer, Group Managing Director, Chief Operating Officer, Managing Director Retail, Managing Director, etc, and the organizations mentionedinthebiography:Big W, David Jones (Australia) Pty Ltd, Grace Bros. Several of the jobtitiles are not associated with any of the organizations mentioned in the biography (for example Chief Executive officer) and in some cases there is more than one jobtitle associated with the same organization (for example he was previously "Merchandising and Stores Director" and "Director" of Grace Bros). According to a preferred embodiment of the present invention an automated method of associating extracted jobtitles with their corresponding organizaton is provided.
A machine learning-based method is employed by the present invention to train entity associators, although other direct (not trained) methods are also applicable. A distinct associator is trained for each different type of association (egjobtitle/organization association). In this case, the training data used by the machine learning algorithms 32 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N0.909 9037 0 0 Sconsists of one example for each pair of labeled entities (of the appropriate types) from each labeled span (item 360 in Figure 9).
Referring now to Figure 21: O I. Positively labeled Documents 340 from the labeled document corpus 330 are to- Skenized 345 into their constituent tokens. The token boundaries of each labeled S span within each document are read from the labeled span store 360, and the CN locations of the entities to be associated are read from the labeled entity store 390. Each entity pair of the appropriate type within the same span generates a distinct training example 1310. For example, in the case of "Mr Roger Campbell Corbett" above, each of the jobtitles and each of the oganizations from his biographical span will form a distinct training pair: N M training pairs in total if there are N jobtiles and M organizations.
2. A feature vector is extracted 1320 from each entity pair. Features xtracted from pairs of entities can include the words within the entities, the words between the entities, the number of tokens between the entities, the existence of another entity between the entities, indication that the two entities are the closest of any pa,. etc. Any feature of an entity pair that will help distinguish associated entities from non-associated entities and can be automatically computed should be considered.
3. The positive associations for the current span are read from the labeled associations store 450 (generated by the association labeling process (as shown in Figure 12) and the "positive" label ("associated") is assigned 1330 to the fea- 33 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 NO.909 D038 0 0 ture vectors of the corresponding entity pairs. An association pairs that are not )positively labeled are assigned the "not-associated" or "other label.
4. The feature vectors for each entity pair and their corresponding labels are then used 1340 as training data to train a classifier to distinguish associated from Mn non-associated pairs. Any classifier training algorithm will do, including handbuilding rule-based algorithms although automated methods usually perform better.
O
c- 10 The output 1350 of all these algorithms is a trained classifier that assigns either the "associated" or "not-associated" label to a feature vector from an entity pair.
Referring now to Figure 22, once the associator has been trained, it can be applied to classify entity pairs within each extracted span as follows: 1. Take the extracted span boundaries 130 output by the span extractor (item 820 in Figure 16), the extracted entities and their labels 150 output by the entity extractor (item 1020 in Figure 18), and the document token sequence 345 and generate 1300 the entity pairs for the association task (eg all jobtitle/organization z0 pairs). One method for speeding up the association process is to generate only those pairs that pass some test, such as only those pairs within a certain token distance (in most association tasks, if the entities are too far apart they are very unlikely to be associated).
2. Generate 1320 the feature-vector for each candidate entity pair using the same feature extraction process used to generate the training feature vectors (item 1320 at Figure 21).
34 COMS ID No: SBMI-01024889 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2304 15:56 MADDERNS 4 0262832734 NO.909 D039 0 3. Apply 1400 the trained associator (item 1350 at Figure 21) to the feature-vector 4. Output 1410 the positively classified associations.
0 Referring once again to Figure 5, entity normalization step 175 requires the automatic Mn normalization of entities identified at step 150- Normalization is taken to mean the Sidentification of equivalent entities. For example, after successful entity extraction Sfrom the following (trncated) biography: 0 <b>Dr Jonathan Baxter</b> Jonathan is the CEO of Panscient Technologies.
the system should have identified "Dr Jonathan Baxter" and "Jonathan" as separate names- We wish to identify the fact that the two names refer to the same person. This is a special case of association in which the entities being associated shared the same label ("name" in this case), hence the entire association procedure described above applies. Feature extraction for normalization may be facilitated by performing subentity extraction first. For example, if the "Jonathan" token in each entity above had already been identified as a first name (by the name sub-entity extractor) then a natural feature of the entity pair would be "shares.fistname".
Referring once again to Figure 5, classification of "Entities/Associated Entities/Normalized Entities" at step 180requires the automatic classification of entities, associated entities, and nonnalized entities identified at steps 150, 170 and 175 respectively.For example, COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 i i i 03/12/2004 15:56 MADDERNS 0262832734 N0.909 9040 0 0 an associated jobtitle/organization pair from a biography may need to be classified as either a curren orformer job. Or if more than one person is mentioned in the biog- C raphy, each normalized person may need to be classified as to whether they are the
O
subject of the biography or not.
M These three classification tasks may be grouped together because they all possess a C similar structure. Accordingly, association classification is focused on as nonnalizag tion and entity classification are straightforward generalizations of the same approach.
A machine learning based approach is the preferred method for training association classifiers, although other direct (not-trained) methods are also applicable. In this case, the training data used by the machine learning algorithms consists of one example for each labeled association (of the appropriate type) (item 500 at Figure 14).
Referring now to Figure 23: 1. Positively labeled Documents 340 from the labeled document corpus 330 are tokenized 345 into their constituent tokens. The token boundaries of each labeled span within each document are read from the labeled span store 360, the identities of the associated entities of the appropriate type are read from the association store 450, and the locations of the entities in each association are read from the labeled entity store 390. Each associated entity pair of the appropriate type generates a distinct training example 1510.
2. A feature vector is extracted 1520 from each associated entity pair. Features extracted from pairs of entities can include the words within the entities, the 36 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 NO.909 D041 0 0 Swords between the entities, the words surrounding the entities, the location of the first entity within its containing span, etc. Any feature of an associated pair g of entities that will help distinguish it from its differently-classified brethen and can be automatically computed should be considered (for example, features ,0 5 that help to distinguish former jobtitles from current jobtitles include a past- I tense word (was, served, previously, etc) immediately or nearly immediately Spreceding the first entity in the association: "he previously served as Chairman Sof Telstm".
3. The labels for each association are read from the classified associations store 500 (generated by the labeling process of Figur 14) and assigned 1530 to the feature vectors of the corresponding associations.
4. The feature vectors for each association and their corresponding labels are then used 1540 as training data to train a classifier to distinguish associations of different categories. Any classifier training algorithm wil do, including hand-building rule-based algorithms although automated methods usually perform better.
The output 1550 of all these algorithms is a trained classifier that assigns the appropriate label to the feature vector of an association.
Once the association classifier has been trained, it is straightforward to apply it to classify associations within each extracted span: Take the associations output by the associator (item 170 in Figure 5 and item 1410 in Figure 22), and the document token sequence 345 and generate the feature vectors for each association using the same feature extraction process used to generate the training feature vectors (1520, Figure 23).
Apply the trained association classifier to the feature-vectors and output the positively 37 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MA~DDERS 4 026232734 h N0.909 P042 0 classified associations.
C)
r Once all etraction steps have bee performed on a document, the extracted spans, entities, associations and classification are assembled 190 into a structured recordsuch as the XML document referred to above. This is a relatively straightfoard process of N populating the fields in a template. Referring to Figure 5, the extracted records are then e stored 210 in a database and indexed 220 for search, so that records may be retrieved S by querying on different extracted fields such as name, job title, etc.
An example application of a preferred embodiment of the present invention to extraction of biographies from corporate web sites is shown in Figures 24, 25, and 26.
Figure 24 shows summary hits from the query "patent attorey" over the extracted biographical data. Figure 25 shows the full record of the first hit, and Figure 26 shows the cached page from which the biographical information was automatically extracted.
The steps taken by the system to extract, store ad index such records is essentitally herarchical in nature, with the first step being identification of the documents of interest within a web site, then identification of spans (contiguous text) of interest within each document of interest followed by identification of the entities of interest (names, jobtitiles, degrees, etc) within each span, then the subentities within the entities (if appropriate), classification and association of entities into groups, construction of a full record from the extracted data and then storage and index of the extracted records.
This top down approach addresses a number of disadvantages in prior art systems in that the biography span extractor can exploit the fact that it is operating over a known biography page, so it can employ features such as "this is the first time this name 38 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 0262832734 NO.909 DQ43 0 0 has occured in this page" which is much more relevant to extracting spans related to biographies. Based on the knowledge that a span relates to a biography the extractor Mn can then moe reliably extract entites from an already segmented biography as it is known that the biography relates to a single person thereby allowing for more relevant IN 5 features to be chosen to aid the extraction process.
n Although a preferred embodiment of the present invention has been described in the 0 foregoing detailed description, it will be understood that the invention is not limited to CN the embodiment disclosed, but is capable of numerous rearrangements, modifications and substitutions without departing from the scope of the invention as set forth and defined by the following claims.
"Comprises/comprising" when used in this specification is taken to specify the presence of stated features, integers, steps or components but does not preclude the peseoce or addition of one or more other features, integers, steps, components or groups thereof.
39 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03

Claims (3)

  1. 2. A method for extracting a structred record from a document as claimed in claim 1, wherein said method further comprises the steps of: identifying at least one entity in said span, said entity identified according to said criteria; and modifying said structured record to include information related to said at least one entity.
  2. 3. A method fr extracting a stcnied record from a document as claimed in claim 2, wherein said method further comprises the steps of identifying at least one sub-entity in at least one entity, said at least one entity identified according to said criteria; and modifying said structured record to include information related to said at least one sub-entity.
  3. 4. A method for forming a structured record from a document, said structured record associated with a query of said document, said method comprising the steps of: COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03 03/12/2004 15:56 MADDERNS 4 0262832734 N0.909 9045 0 0 Ci identifying and extracting a span of text in said document, said span identi- Sfied according to a criteria associated with said query; and o forming said structured record according to said extracted span. IO 5. A method for classifying text based elements according to a characteristic, said c, 0 5 method comprising the steps of: Sidentifying said text based elements in a training document; 0 Sforming a feature vector corresponding to each text based element; forming a sequence of said feature vectors corresponding to said text based elements located in said training document; LO labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and training a predictive algorithm based on said sequence of labels and said sequence of said feature vectors, said algorithm trained to generate new label sequences from a new sequence of feature vectors thereby classifying text based elements corresponding to said new sequence of feature vectors. Dated this 3rd day of December 2004. PANSCIENT PTY LTD By its Patent Attorneys MADDE~RS41 41 COMS ID No: SBMI-01024869 Received by IP Australia: Time 17:47 Date 2004-12-03
AU2004235636A 2004-12-03 2004-12-03 A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources Abandoned AU2004235636A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
AU2004235636A AU2004235636A1 (en) 2004-12-03 2004-12-03 A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources
EP05111255A EP1669896A3 (en) 2004-12-03 2005-11-24 A machine learning system for extracting structured records from web pages and other text sources
US11/291,740 US20060123000A1 (en) 2004-12-03 2005-12-02 Machine learning system for extracting structured records from web pages and other text sources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2004235636A AU2004235636A1 (en) 2004-12-03 2004-12-03 A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources

Publications (1)

Publication Number Publication Date
AU2004235636A1 true AU2004235636A1 (en) 2006-06-22

Family

ID=36616623

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2004235636A Abandoned AU2004235636A1 (en) 2004-12-03 2004-12-03 A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources

Country Status (1)

Country Link
AU (1) AU2004235636A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840324A (en) * 2019-01-09 2019-06-04 武汉大学 It is a kind of semantic to strengthen topic model and subject evolution analysis method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840324A (en) * 2019-01-09 2019-06-04 武汉大学 It is a kind of semantic to strengthen topic model and subject evolution analysis method
CN109840324B (en) * 2019-01-09 2023-03-24 武汉大学 Semantic enhancement topic model construction method and topic evolution analysis method

Similar Documents

Publication Publication Date Title
US8255386B1 (en) Selection of documents to place in search index
CA2513851C (en) Phrase-based generation of document descriptions
JP4944405B2 (en) Phrase-based indexing method in information retrieval system
US20050060304A1 (en) Navigational learning in a structured transaction processing system
JP5175005B2 (en) Phrase-based search method in information search system
Joaquin et al. Content-based collaborative information filtering: Actively learning to classify and recommend documents
US6453315B1 (en) Meaning-based information organization and retrieval
US7257530B2 (en) Method and system of knowledge based search engine using text mining
US6751606B1 (en) System for enhancing a query interface
JP4976666B2 (en) Phrase identification method in information retrieval system
WO2008120030A1 (en) Latent metonymical analysis and indexing [lmai]
US20030221163A1 (en) Using web structure for classifying and describing web pages
Kulkarni Contextual data representation using prime number route mapping method and ontology
US20040249808A1 (en) Query expansion using query logs
US20050102251A1 (en) Method of document searching
CN105528411B (en) Apparel interactive electronic technical manual full-text search device and method
JP2002297651A (en) Method and system for information retrieval, and program
JP2008529138A (en) Information retrieval system based on multiple indexes
AU2006255181A1 (en) Relationship networks
JP2000090103A (en) Information retrieval device and computer-readable recording medium recorded with information retrieving program
CN106815265A (en) The searching method and device of judgement document
KR100913733B1 (en) Method for Providing Search Result Using Template
BE1012981A3 (en) Method and system for the weather find of documents from an electronic database.
EP1315096B1 (en) Method and apparatus for retrieving relevant information
AU2004235636A1 (en) A Machine Learning System For Extracting Structured Records From Web Pages And Other Text Sources

Legal Events

Date Code Title Description
MK1 Application lapsed section 142(2)(a) - no request for examination in relevant period