CA2487606A1 - Learning and using generalized string patterns for information extraction - Google Patents
Learning and using generalized string patterns for information extraction Download PDFInfo
- Publication number
- CA2487606A1 CA2487606A1 CA002487606A CA2487606A CA2487606A1 CA 2487606 A1 CA2487606 A1 CA 2487606A1 CA 002487606 A CA002487606 A CA 002487606A CA 2487606 A CA2487606 A CA 2487606A CA 2487606 A1 CA2487606 A1 CA 2487606A1
- Authority
- CA
- Canada
- Prior art keywords
- patterns
- generalized
- generalized extraction
- elements
- strings
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99936—Pattern matching access
Abstract
The present invention relates to extracting information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
Description
LEARNING AND USING GENERALIZED STRING
PATTERNS FOR INFORMATION EXTRACTION
BACKGROUND OF THE INVENTION
The present invention relates to 5 information extraction. In particular, the present invention relates to systems and methods for performing .information extraction.
Many databases, web pages and documents Axist that ~ov~ain a large amc-a.~n'_ o' information.
10 With such a large amour_t of existing information, many methods have been used in order to gather relevant information pertaining to a particular subject. Information extraction refsrs to a technique for extracting useful information from these 15 iniormazion sources. Generally, an information extraction system extracts information based on extraction patterns ;or extraction rules).
Manually writing and developing reliable extraction patterns is difficult and time consuming.
20 As a result, many efforts have been made to automatically learn extraction patterns from annotated examples. In some automatic learning systems, natural language patterns are learned by syntactically parsing sentences and acquiring 25 sentential or phrasal patterns from the parses.
Another approach discovers patterns using syntactic and semantic constraints. However, these approaches are generally costly to develop. Another approach uses consecutive surface string patterns =or extracting information on particular pairs of information. These consecutive patterns only cover a small amount of information to be extracted and thus do not provide sufficient generalization of a large amount of information for reliable extraction.
Many different methods have been devised to address the problems presented above. A system and method for accurately and efficiently learning patterns for use in infor-~ation e:araction w,:.uld 10 further address these -~ndjor other probi.ems to provide a more reliable, cost effective information extraction system.
SUMMARY OF THE INVENTION
The present invention relates to extracting 15 information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote 20 that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
Another aspect of the present invention is a computer-readable medium for extracting information 25 from an information source. The medium includes a data structure that has a set of generalized extraction patterns including words and an indication of a position for at least one ootiona.l word. The medium also includes an extraction module that uses 30 the set of the generalized extraction patterns to match string in the information source with the generalized extraction patterns.
Yet another aspect of the present invention is a method of generating patterns for use in 5 extracting information from an information source.
The method includes establishing a set of strings including at least two elements corresponding to a subject. A set of generalized extraction patterns are genArated that correspond to the set of striv!as . The 10 geTveralized extraction oa-terns include at le~5t two elements, words and an indication of a position of at least one optional word.
Another method of generating patterns for use in extracting information from an information 15 source relates to the present invention. The method establishes a set of strings including at least two elements corresponding to a subject and identifying consecutive patterns within the set of strings that include words and the at least two elements. A set of 20 generalized extraction patterns is generated from the consecutive patterns identified. The generalized extraction patterns include the at least two elements, words and wildcards. The wildcards express a combination of the consecutive patterns.
25 BRIEF DESCRIPTION OF !':~iE DRAWINGS
FIG. 1 is a diagram of an exemplary computing system environment.
FIG. 2 is a flow diagram o~ information extraction.
i FIG. 3 is a flow diagram for generating and ranking patterns for information extraction.
FIG. 4 is a method for generating and ranking generalized extraction patterns.
PATTERNS FOR INFORMATION EXTRACTION
BACKGROUND OF THE INVENTION
The present invention relates to 5 information extraction. In particular, the present invention relates to systems and methods for performing .information extraction.
Many databases, web pages and documents Axist that ~ov~ain a large amc-a.~n'_ o' information.
10 With such a large amour_t of existing information, many methods have been used in order to gather relevant information pertaining to a particular subject. Information extraction refsrs to a technique for extracting useful information from these 15 iniormazion sources. Generally, an information extraction system extracts information based on extraction patterns ;or extraction rules).
Manually writing and developing reliable extraction patterns is difficult and time consuming.
20 As a result, many efforts have been made to automatically learn extraction patterns from annotated examples. In some automatic learning systems, natural language patterns are learned by syntactically parsing sentences and acquiring 25 sentential or phrasal patterns from the parses.
Another approach discovers patterns using syntactic and semantic constraints. However, these approaches are generally costly to develop. Another approach uses consecutive surface string patterns =or extracting information on particular pairs of information. These consecutive patterns only cover a small amount of information to be extracted and thus do not provide sufficient generalization of a large amount of information for reliable extraction.
Many different methods have been devised to address the problems presented above. A system and method for accurately and efficiently learning patterns for use in infor-~ation e:araction w,:.uld 10 further address these -~ndjor other probi.ems to provide a more reliable, cost effective information extraction system.
SUMMARY OF THE INVENTION
The present invention relates to extracting 15 information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote 20 that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
Another aspect of the present invention is a computer-readable medium for extracting information 25 from an information source. The medium includes a data structure that has a set of generalized extraction patterns including words and an indication of a position for at least one ootiona.l word. The medium also includes an extraction module that uses 30 the set of the generalized extraction patterns to match string in the information source with the generalized extraction patterns.
Yet another aspect of the present invention is a method of generating patterns for use in 5 extracting information from an information source.
The method includes establishing a set of strings including at least two elements corresponding to a subject. A set of generalized extraction patterns are genArated that correspond to the set of striv!as . The 10 geTveralized extraction oa-terns include at le~5t two elements, words and an indication of a position of at least one optional word.
Another method of generating patterns for use in extracting information from an information 15 source relates to the present invention. The method establishes a set of strings including at least two elements corresponding to a subject and identifying consecutive patterns within the set of strings that include words and the at least two elements. A set of 20 generalized extraction patterns is generated from the consecutive patterns identified. The generalized extraction patterns include the at least two elements, words and wildcards. The wildcards express a combination of the consecutive patterns.
25 BRIEF DESCRIPTION OF !':~iE DRAWINGS
FIG. 1 is a diagram of an exemplary computing system environment.
FIG. 2 is a flow diagram o~ information extraction.
i FIG. 3 is a flow diagram for generating and ranking patterns for information extraction.
FIG. 4 is a method for generating and ranking generalized extraction patterns.
5 FIG. 5 is a method for generating generalized extraction patterns.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
The present invention relates to information extraction. Although herein described 10 with re=erence to development of patterns for information extraction, the present invention may also be applied to other types of information processing. Prior to discussing the present invention in greater detail, one embodiment of an illustrative 15 environment in which the present invention can be used will be discussed.
FIG. 1 illustrates an example of a suitable computing system environment 100 on which the invention may be implemented. The computing system 20 environment 100 is only one example of a suitable computing environmer_t and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having 25 any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
The =nvention is operational with numerous other general purpose or special purpose computir_g 30 system environments cr configurations. Exampes o=
i well known computing systems, environments, and/or con'ignratior_s that may be suitable fo= use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop 5 devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that ~.nclude ar~y of the above systems or devices, a:~d the 1<; like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include 15 routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by 20 remote processing devices that are linked through a comma:ications network. In a di5tribut2d computing enviror_ment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the 25 programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructior_s, which can be writt~r_ on ary form of a comcuter readable medium.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
The present invention relates to information extraction. Although herein described 10 with re=erence to development of patterns for information extraction, the present invention may also be applied to other types of information processing. Prior to discussing the present invention in greater detail, one embodiment of an illustrative 15 environment in which the present invention can be used will be discussed.
FIG. 1 illustrates an example of a suitable computing system environment 100 on which the invention may be implemented. The computing system 20 environment 100 is only one example of a suitable computing environmer_t and is not intended to suggest any limitation as to the scope of use or functionality of the invention. Neither should the computing environment 100 be interpreted as having 25 any dependency or requirement relating to any one or combination of components illustrated in the exemplary operating environment 100.
The =nvention is operational with numerous other general purpose or special purpose computir_g 30 system environments cr configurations. Exampes o=
i well known computing systems, environments, and/or con'ignratior_s that may be suitable fo= use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop 5 devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that ~.nclude ar~y of the above systems or devices, a:~d the 1<; like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include 15 routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by 20 remote processing devices that are linked through a comma:ications network. In a di5tribut2d computing enviror_ment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the 25 programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructior_s, which can be writt~r_ on ary form of a comcuter readable medium.
With reference to FIG. 1, an exemplary system fer implementing the invention includes a general purpose computing device in the form of a computer 110. Components of computer 110 may include, but are not limited to, a processing unit 120, a system memory 130, and a system bus 121 that couples various system components including the system memory to the processing unit 120. The system bus 121 may be ar_y of sevaral types of bus st~~uctures i:~cluding a l0 memory hus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, Micro Channel Architecture 15 (MCA) bus, Enhanced ISA (EISA) bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus also known as Mezzanine bus.
Computer 110 typically includes a variety 20 of computer readable media. Computer readable media can be any available medium or media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, 25 computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile ar_d nonvolatile, a-emovable and non-removable media implemented in any method or technology for storage of information such as 30 computer readable instruct=ions, data structures, _7_ program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk 5 storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically Pmbodies 10 cor.:outar readable in=:fractions, data _.~~~ucuures, program modules or other data in a modulated data signal such as a carrier wave or otter transport mechanism and includes any information delivery media. The term "modulated data signal~~ means a 15 signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired 20 connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer 25 storage media in the form of vola~ile and/«r nonvolatile memory such as read only memory (ROM) i31 and random access memory (RAM) '32. A basic input/output system 133 (BTOS), containing she basic routines that help to transfer information between 30 elements within compu~er .10, such as during sta=t-_g_ up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of 5 example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
The computer 110 may also include other rer~ovable/non-removable volatilainonvolatilP computer 0 storage media. By way of example or:ly, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an 15 optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD
ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating 20 environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like . The hard disk drive 141 is typically connected to the system bus 121 through a 25 non-removable memory interface :such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus .21 by a removable memory interface, such as interface 150.
The drives and their associated computer 30 storage media discussed above and illustrated in FIG.
_g_ 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In rIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 5 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from ope.r.~ating system 134, application programs 135, other program modules 136, and program data 137. Operating system 10 144, applica~ion programs 145, or:ner program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
A user may enter commands and information 15 into the computer 17.0 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like.
20 These and other input devices are often connected to the processing unit 120 t'~rough a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a 25 universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computes may also incl ude other peri~'~eral output devices such i as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 may operate in a networked environment using logical connections to one or more 5 remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many o~~ all of the elemants 10 desc,ribGd above relative to the computer 11~~. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, 15 enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a 20 WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be interr_a1 or external, may be connected to the system bus 121 25 via the user-input interface '_50, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereoT, may be stored in the remote memory storage device. By way of example, and not 30 limitation, FT_G. _ illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be 5 used.
FIG. 2 illustrates an extraction module 200 that extracts information from a database 202 and provides an output of extracted information 2D4. As will be discussed below, extraction module 200 10 operates based on ext.ractien patterns _earned from a training or test corpus. As appreciated by those skilled in the art, extraction module 200 may include the extraction patterns and/or access a data structure having the patterns to perform extraction.
15 The extraction patterns match strings in database 202 during extraction. In an exemplary embodiment of the present invention, the extraction patterns include words, elements and wildcards generated based on a training corpus. As used herein, strings include a 20 sequence of words and words can be of different languages including Lnglish, German, Chinese and Japanese. Elements are variables containing information related to a particular subject and wildcards are indications that denote that words in a 25 string can be sk~_pped and/or a position fcr optional words during matching. Database 202 can be a variety of different in-ormation sources. rFOr example, database 202 may b~ a collection of documents, news group articles, a collection of customer feedback 30 data, and/cr any otter type of information and stored on a local system or across a wide area network such as the Internet. The information can be in text or other form, including for example speech data that can be converted to text. The extracted information 5 204 can be excerpts from a plurality of documents related to a particular subject that may be reviewed or further processed in order to better analyze data in database 202.
Information extraction is concerned with 10 extracti.nq i nformavion re l at~d to a parti c:ul ;?~
subject. Extracted information can include pairs, triplets, etc. of related elements pertaining to the subject. For example, when extracting product release information, the elements can include a company 15 element and a product element. If the subject relates to books, the elements can include a book title and author information. Other related elements can include inventor and invention information, question and answer pairs, etc. In general, one or more of the 20 elements associated with a subject can be referred to as an "anchor", which will typically signal that the information in an string is associated wish a particular subject. For example, a product can be an anchor ir_ a company/product pair related to product 25 release in=ormation. One aspect of the present invention relates to generating patterns that include elements for extraction.
FIG. 3 illustrates a flow diagram of various modules for developing patterns to be used by 30 extracticn module 200. '='he modules include a pattern generation module 210 and a pattern. ranking module 212. Pattern generation module 210 develops patterns based on a positive example corpus 214. The positive example corpus contains strings of text that include 5 elements related to a subject of information to be extracted. Using the positive examples in corpus 214, consecutive patterns are generated by module 210.
Additionally, pattern generation module 210 can use wildcards to express combinations of patterns. As a i0 result , the pattern (s) generated by module 210 , ~:ehich is indicated at 216, represents a combination that includes a generalized string.
Below are example training instances that form part of an exemplary corpus 214. The instances 15 include company and product elements annotated with <company> and <product> tags, respectively. The positive training instances in corpus 214 are:
<company> Microsoft Corp. </compary>
today announced the immediate availability 20 of <product> Microsoft Internet Explorer Plus </product>, the eagerly awaited retail version of Internet Explorer 4Ø
<company> Microsoft Corp. </company>
today announced the availability of 2~ <product> Microsoft Visual J++ 6.0 Technology Preview 2</product>, a beta release of the next version cf the industry's most widely used dev~~Opment system for Java.
<company> Microsoft Core. </company>
today announced the immediate, free availability of <product> Microsoft Visual InterDev 6.0 March pre-release </product>, 5 a preview of the new version of the leading team-based Web development system for rapidly building data-driven Web applications.
Given the positive training instances, 10 consecutive patterns can be identified that con~ain the elements related to the subject. For example, the following three patterns represent consecutive patterns generated from the instances above, where the variables <company> and <product> have replaced 15 specific company and product information:
<company> today announced the immediate availability of <product>, <company> today announced the availability of <product>, 20 <company> today announced the immediate, free availability of <product>.
Given these consecutive patterns, a generalized extraction pattern expressing the elements of the consecutive patterns containing a 25 wildcard can be developed by module 210 such as:
<company> today announced the {\w+3}
availability of <product>.
Here, the w-ldcard {\w+3~ denotes that up to ~hree words can be skipped between "the" and 30 "availability". The generalized extraction patterr_ above "covers" each of the consecutive patterns, that is each consecutive pattern can be expressed in terms of the generalized extraction pattern. Using the generalized extraction pattern with the wildcard, the 5 product information "Microsoft Office 60 Minute Intranet Kit Version 2.0" will be extracted from the following sentence since the pattern allows skipping of the words "immediate worldwide" without the need for an additional consecutive pattern incl~.:ding the words "imme_~.iate worldwide":
<company> Microsoft Corporation </company> today announced the immediate worldwide availability of Microsoft Office 60 Minute Intranet Kit version 2.0, 15 downloadable for free (connect-time charges may apply) from the Office intranet Web site located at http://www.mircosoft.com/office/intranet/.
Pattern generation module 210 provides an 20 output of unranked patterns 216 generated from corpus 214 that include wildcards ~o pattern ranking module 212 such as described above. Pattern ranking module 212 ranks the patterns received from pattern generation module 220 using a positive and negative 25 example corpus 218. A negative example contains one element in a pair but doss not contain a second elemer_.t, for instance the anchor described above. For example, the sentence below is a negative example because it contains company information, but does not i include a specific product and is not related to a product release:
<company> Microsoft Corp. </company>
today announced the availability of an 5 expanded selection of Web-based training through its independent training providers.
The patterns obtained from pattern generation module 210 can be ranked by pattern ranking module 212 using a number of different 10 meLnods. In one method, the precision of a p~r~icular pattern P can be calculated by dividing the number of correct instances extracted from corpus 218 divided by the number of instances extracted from corpus 218 using pattern P. A pattern with a higher precision 15 value is ranked higher by pattern ranking module 212.
Additionally, patterns may be removed if a corresponding pattern matches all the positive instances that a corresponding pattern can match. The pattern having the lower precision value can then be 20 removed.
Rar_ked patterns 220 form tre basis for extraction using extraction module 200. Positive and/or negative examples 222 can then be used to evaluate the performance of extract=on module 200 ir_ 25 providing corxwct and useful extracted information 204. During extraction, patterns that rank higher can be used first to match strings in database 202. In one embod~mer_t, matching is performed in a 1=ft-~c-right order. Ror example, in the pattern "x \w+ y i \w+", occurrences of x are matched and then any occurrences of y are matched.
FIG. 4 illustrates a method 250 for generating and ranking patterns to be used by 5 extraction module 200. Method 250 is based on what is known as the Apriori Algorithm. The Apriori Algorithm is founded on the basis that subsets and associated supersets share similar attributes and a combination of subset s and supersets can be expres:~ed to l0 encompass characteristics of both the subs:.ts and supersets. The following algorithm can be used to generate generalized extraction patterns, wrich will be described in more detail below with regard to method 250. In the algorithm provided below, S is a 15 set of input strings (i..e. positive example corpus 214), P1 are the set of words in S, pl is an individual word in P1. Pi and P~=_1~ are sets of patterns for the i'h iteration of the algoritiam and pi and p~;_i~ represent patterns within the ith set.
Learn Generalized Extraction Patterns with Constraints Algorithm 1. S - set of positive example ir_put strings, 2 . P, - set of words in S ;
3. for (~=z:~<_~:t++) 4 . P; =find-generalized-extraction-patterns ( P~,..~,, P~ ) ;
5 . f or each ( p = P; ) {
i6. if ( net satisfy-constraints(p) ) 7 . remove p f rom P; ;
30 8. if (p' s frequency is not larger than a threshold) 9 . remove p f rom P, ;
10. if (pdoes not contain <anchor>) 11 . remove p f rom P; ;
12.
13 . i f ( P; i s empty ) 14. Goto line 16;
i 15 .
16 . output P = U';=-PJ ;
Method 250 begins at step 25~, where a set of input strings is established. The set of input strings is t'~e nositivz examble -ornus 214 in FIG. 3.
The set of input strings includes patterns, in the case of a pair of elements, where both portions of a desired pair of information elements are included.
i5 After the set of input strings is established, generalized extraction patterns including wildcards OY~~ph ~T1 'I n Y- ~ I ~ N
.w w. ~... y.ww..v w ~y v T . vim tG iit~ wv. ~s.W rGiyuGu extraction pattern (which is also the sub-algorithm find-generalized-extraction-patterns() in the 20 algorithm above) is discussed in further detail below with regard to FIG. 5. The generalized extraction patterns include words and elements in addition to the w.ildcards that denote other words may appear within the pattern.
25 The generalized extraction patterns can then be evaluated to determine whether or not they represent reliable candidates for extraction. At step 255, patterns that do not satisfy constraints are removed. A number of different constraints can be 30 used to remove generalized extraction patterns ger_erated by pattern generation module 210. One constraint is referred to as a "boundary constraint"
wherein a wildcard ca_~.not immediately be bositioned before or after an anchor. This constraint helps eliminate patterns for which it is difficult to 5 determine where the anchor information begins and ends. For example, the following generalized extraction pattern would be removed:
<company> today announced the immediate availability {\w +3) <product>
10 the above generalized extract=or: pattsrn could inappropriately determine that the string "of Internet Explorer for no-charge download from the Internet" was a product for the following sentence:
Microsoft Corp. today announced the 15 immediate availability of Internet Explorer for no charge download from the Internet.
Another constraint is the "distant constraint". The distant constraint limits the number of words that can be ski pped by a wildcard to not be 20 larger than the largest number of words that are skipped based on the training data. For example, the following pattern that does not limit the amount of words to be skipped would rot be used:
<company> {\w +} today announced {\w 25 +1 deliver <product>.
The above pattern could incorrectly extract "entertrise and electronic-commerce solutions based on the Microsoft Windows NT Server operatir_g system and the 3ackOfiice family cf products" as broduct 30 information for the ser_~ence:
Microsoft Corp. and Policy Management Systems Corp. (PMSC) today announced a pla:~
in which the two companies will work together to deliver enterprise and 5 electronic-commerce solutions based on the Microsoft Windows NT Server operating system and the BackOffice family of products.
Another constraint, called the "island 10 constraint" Prohibits what is =e~crred to as an "isolated function word". Isolated function words are generally articles such as "the", 'a', and "an" that do not include specific content related to information to be extracted and are surrounded by 15 wildcards. The following pattern does not satisfy the island constraint:
<company> {\w +8} the {\w +13} of the <product> , the first The above pattern could incorrectly extract 20 "Microsoft Entertainment Pack for the Windows CE
operating system" as product information that is not related to a release for the following sentence:
Microsoft Corp, today provided attendees of the Consumer Electronics Show 25 in Las Vegas with a demonstration of the Microsoft Entertainment Pack for the Windows CE operating sys tem, t~:e first game product to be released for the Windows CE-based har_dheld PC platform.
At step 258, patterns that do not meet a frequency threshold are removed. As a result, patterns that are not commonly used are removed at this step. At step 260, patterns that do not contain 5 an anchor are removed. For example, a pattern not containing a product with an associated company name is not included as a pattern for information extraction. Given these patterns, the patterns are ranked at step 252. As discussed above, many 10 diffe-ent ranking methods can be used to r-.nk the patterns. If patterns rank too low, they can be removed.
FIG. 5 illustrates method 280 for generating generalized extraction patterns. The 15 algorithm below can be used to generate these patterns, and is a sub-algorithm for the algorithm described above. The same variables apply to the algorithm below.
find-generalized-extraction-pattern ( P;,-" , P, ) 20 I 1 . for each ( P,;~~ a r;,_" ) ~ 2 for each ( p, a P, ) f ' .
3 p; = p.;-t,pi .
4. if (p; exists in S) 5 put p, into P; ;
.
25 6. p'.=p;-_i;ii~r~n;pi;
~
7. if ( p'; exists in S ) ~ 8 put p'; into P; ; ;
.
10.
3 11 output P; ;
0 .
I
At step 282 of method 280, consecutive patterns are identified from the positive instances in positive example corpus 214. This step corresponds to lines 3 through 5 in the sub-algorithm above. The consecutive patterns include the elements related to the subject to be extracted, for example company and 5 product. In one method, patterns can be recursively generated given the input strings by combining subsets and supersets of the strings sharing similar attributes. After the consecutive patterns have been identified, method 280 proceeds to step 284 wherein 10 wildcard position~> and lengths ar~ idantifiad by combining the consecutive patterns and expressing generalized extraction patterns to cover the consecutive patterns. This step corresponds to lines 6 through 8 in the sub-algorithm above. Next, the 15 generalized extraction patterns with wildcards are output at step 286. The generalized extraction patterns are then further analyzed as explained above with respect to method 250 to remove and rank the patterns.
20 By implementing the present invention described above, generalized extraction patterns can be developed that represent combinations of patterns and provide a more reliable informatior_ extraction system. The generalized extraction patterns can 25 include posi~ions for optional words and/or_~ wildcards denoting that words can be skipped during matching that allow combinations of pasterns to be expressed.
Using the generalized patterns during extraction allows for matching o= various strings in order to 30 identify matching strings ir_ an infcrmaticn source.
i Although the present invention has been described with reference to partic;111ar e:,tbodiments, workers skilled in the ar~ will recognize that changes may be made in form and detail without 5 departing from the spirit and scope of the invention.
Computer 110 typically includes a variety 20 of computer readable media. Computer readable media can be any available medium or media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, 25 computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile ar_d nonvolatile, a-emovable and non-removable media implemented in any method or technology for storage of information such as 30 computer readable instruct=ions, data structures, _7_ program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk 5 storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically Pmbodies 10 cor.:outar readable in=:fractions, data _.~~~ucuures, program modules or other data in a modulated data signal such as a carrier wave or otter transport mechanism and includes any information delivery media. The term "modulated data signal~~ means a 15 signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired 20 connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer 25 storage media in the form of vola~ile and/«r nonvolatile memory such as read only memory (ROM) i31 and random access memory (RAM) '32. A basic input/output system 133 (BTOS), containing she basic routines that help to transfer information between 30 elements within compu~er .10, such as during sta=t-_g_ up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of 5 example, and not limitation, FIG. 1 illustrates operating system 134, application programs 135, other program modules 136, and program data 137.
The computer 110 may also include other rer~ovable/non-removable volatilainonvolatilP computer 0 storage media. By way of example or:ly, FIG. 1 illustrates a hard disk drive 141 that reads from or writes to non-removable, nonvolatile magnetic media, a magnetic disk drive 151 that reads from or writes to a removable, nonvolatile magnetic disk 152, and an 15 optical disk drive 155 that reads from or writes to a removable, nonvolatile optical disk 156 such as a CD
ROM or other optical media. Other removable/non-removable, volatile/nonvolatile computer storage media that can be used in the exemplary operating 20 environment include, but are not limited to, magnetic tape cassettes, flash memory cards, digital versatile disks, digital video tape, solid state RAM, solid state ROM, and the like . The hard disk drive 141 is typically connected to the system bus 121 through a 25 non-removable memory interface :such as interface 140, and magnetic disk drive 151 and optical disk drive 155 are typically connected to the system bus .21 by a removable memory interface, such as interface 150.
The drives and their associated computer 30 storage media discussed above and illustrated in FIG.
_g_ 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In rIG. 1, for example, hard disk drive 141 is illustrated as storing operating system 5 144, application programs 145, other program modules 146, and program data 147. Note that these components can either be the same as or different from ope.r.~ating system 134, application programs 135, other program modules 136, and program data 137. Operating system 10 144, applica~ion programs 145, or:ner program modules 146, and program data 147 are given different numbers here to illustrate that, at a minimum, they are different copies.
A user may enter commands and information 15 into the computer 17.0 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like.
20 These and other input devices are often connected to the processing unit 120 t'~rough a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a 25 universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computes may also incl ude other peri~'~eral output devices such i as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 may operate in a networked environment using logical connections to one or more 5 remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many o~~ all of the elemants 10 desc,ribGd above relative to the computer 11~~. The logical connections depicted in FIG. 1 include a local area network (LAN) 171 and a wide area network (WAN) 173, but may also include other networks. Such networking environments are commonplace in offices, 15 enterprise-wide computer networks, intranets and the Internet.
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a 20 WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be interr_a1 or external, may be connected to the system bus 121 25 via the user-input interface '_50, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereoT, may be stored in the remote memory storage device. By way of example, and not 30 limitation, FT_G. _ illustrates remote application programs 185 as residing on remote computer 180. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be 5 used.
FIG. 2 illustrates an extraction module 200 that extracts information from a database 202 and provides an output of extracted information 2D4. As will be discussed below, extraction module 200 10 operates based on ext.ractien patterns _earned from a training or test corpus. As appreciated by those skilled in the art, extraction module 200 may include the extraction patterns and/or access a data structure having the patterns to perform extraction.
15 The extraction patterns match strings in database 202 during extraction. In an exemplary embodiment of the present invention, the extraction patterns include words, elements and wildcards generated based on a training corpus. As used herein, strings include a 20 sequence of words and words can be of different languages including Lnglish, German, Chinese and Japanese. Elements are variables containing information related to a particular subject and wildcards are indications that denote that words in a 25 string can be sk~_pped and/or a position fcr optional words during matching. Database 202 can be a variety of different in-ormation sources. rFOr example, database 202 may b~ a collection of documents, news group articles, a collection of customer feedback 30 data, and/cr any otter type of information and stored on a local system or across a wide area network such as the Internet. The information can be in text or other form, including for example speech data that can be converted to text. The extracted information 5 204 can be excerpts from a plurality of documents related to a particular subject that may be reviewed or further processed in order to better analyze data in database 202.
Information extraction is concerned with 10 extracti.nq i nformavion re l at~d to a parti c:ul ;?~
subject. Extracted information can include pairs, triplets, etc. of related elements pertaining to the subject. For example, when extracting product release information, the elements can include a company 15 element and a product element. If the subject relates to books, the elements can include a book title and author information. Other related elements can include inventor and invention information, question and answer pairs, etc. In general, one or more of the 20 elements associated with a subject can be referred to as an "anchor", which will typically signal that the information in an string is associated wish a particular subject. For example, a product can be an anchor ir_ a company/product pair related to product 25 release in=ormation. One aspect of the present invention relates to generating patterns that include elements for extraction.
FIG. 3 illustrates a flow diagram of various modules for developing patterns to be used by 30 extracticn module 200. '='he modules include a pattern generation module 210 and a pattern. ranking module 212. Pattern generation module 210 develops patterns based on a positive example corpus 214. The positive example corpus contains strings of text that include 5 elements related to a subject of information to be extracted. Using the positive examples in corpus 214, consecutive patterns are generated by module 210.
Additionally, pattern generation module 210 can use wildcards to express combinations of patterns. As a i0 result , the pattern (s) generated by module 210 , ~:ehich is indicated at 216, represents a combination that includes a generalized string.
Below are example training instances that form part of an exemplary corpus 214. The instances 15 include company and product elements annotated with <company> and <product> tags, respectively. The positive training instances in corpus 214 are:
<company> Microsoft Corp. </compary>
today announced the immediate availability 20 of <product> Microsoft Internet Explorer Plus </product>, the eagerly awaited retail version of Internet Explorer 4Ø
<company> Microsoft Corp. </company>
today announced the availability of 2~ <product> Microsoft Visual J++ 6.0 Technology Preview 2</product>, a beta release of the next version cf the industry's most widely used dev~~Opment system for Java.
<company> Microsoft Core. </company>
today announced the immediate, free availability of <product> Microsoft Visual InterDev 6.0 March pre-release </product>, 5 a preview of the new version of the leading team-based Web development system for rapidly building data-driven Web applications.
Given the positive training instances, 10 consecutive patterns can be identified that con~ain the elements related to the subject. For example, the following three patterns represent consecutive patterns generated from the instances above, where the variables <company> and <product> have replaced 15 specific company and product information:
<company> today announced the immediate availability of <product>, <company> today announced the availability of <product>, 20 <company> today announced the immediate, free availability of <product>.
Given these consecutive patterns, a generalized extraction pattern expressing the elements of the consecutive patterns containing a 25 wildcard can be developed by module 210 such as:
<company> today announced the {\w+3}
availability of <product>.
Here, the w-ldcard {\w+3~ denotes that up to ~hree words can be skipped between "the" and 30 "availability". The generalized extraction patterr_ above "covers" each of the consecutive patterns, that is each consecutive pattern can be expressed in terms of the generalized extraction pattern. Using the generalized extraction pattern with the wildcard, the 5 product information "Microsoft Office 60 Minute Intranet Kit Version 2.0" will be extracted from the following sentence since the pattern allows skipping of the words "immediate worldwide" without the need for an additional consecutive pattern incl~.:ding the words "imme_~.iate worldwide":
<company> Microsoft Corporation </company> today announced the immediate worldwide availability of Microsoft Office 60 Minute Intranet Kit version 2.0, 15 downloadable for free (connect-time charges may apply) from the Office intranet Web site located at http://www.mircosoft.com/office/intranet/.
Pattern generation module 210 provides an 20 output of unranked patterns 216 generated from corpus 214 that include wildcards ~o pattern ranking module 212 such as described above. Pattern ranking module 212 ranks the patterns received from pattern generation module 220 using a positive and negative 25 example corpus 218. A negative example contains one element in a pair but doss not contain a second elemer_.t, for instance the anchor described above. For example, the sentence below is a negative example because it contains company information, but does not i include a specific product and is not related to a product release:
<company> Microsoft Corp. </company>
today announced the availability of an 5 expanded selection of Web-based training through its independent training providers.
The patterns obtained from pattern generation module 210 can be ranked by pattern ranking module 212 using a number of different 10 meLnods. In one method, the precision of a p~r~icular pattern P can be calculated by dividing the number of correct instances extracted from corpus 218 divided by the number of instances extracted from corpus 218 using pattern P. A pattern with a higher precision 15 value is ranked higher by pattern ranking module 212.
Additionally, patterns may be removed if a corresponding pattern matches all the positive instances that a corresponding pattern can match. The pattern having the lower precision value can then be 20 removed.
Rar_ked patterns 220 form tre basis for extraction using extraction module 200. Positive and/or negative examples 222 can then be used to evaluate the performance of extract=on module 200 ir_ 25 providing corxwct and useful extracted information 204. During extraction, patterns that rank higher can be used first to match strings in database 202. In one embod~mer_t, matching is performed in a 1=ft-~c-right order. Ror example, in the pattern "x \w+ y i \w+", occurrences of x are matched and then any occurrences of y are matched.
FIG. 4 illustrates a method 250 for generating and ranking patterns to be used by 5 extraction module 200. Method 250 is based on what is known as the Apriori Algorithm. The Apriori Algorithm is founded on the basis that subsets and associated supersets share similar attributes and a combination of subset s and supersets can be expres:~ed to l0 encompass characteristics of both the subs:.ts and supersets. The following algorithm can be used to generate generalized extraction patterns, wrich will be described in more detail below with regard to method 250. In the algorithm provided below, S is a 15 set of input strings (i..e. positive example corpus 214), P1 are the set of words in S, pl is an individual word in P1. Pi and P~=_1~ are sets of patterns for the i'h iteration of the algoritiam and pi and p~;_i~ represent patterns within the ith set.
Learn Generalized Extraction Patterns with Constraints Algorithm 1. S - set of positive example ir_put strings, 2 . P, - set of words in S ;
3. for (~=z:~<_~:t++) 4 . P; =find-generalized-extraction-patterns ( P~,..~,, P~ ) ;
5 . f or each ( p = P; ) {
i6. if ( net satisfy-constraints(p) ) 7 . remove p f rom P; ;
30 8. if (p' s frequency is not larger than a threshold) 9 . remove p f rom P, ;
10. if (pdoes not contain <anchor>) 11 . remove p f rom P; ;
12.
13 . i f ( P; i s empty ) 14. Goto line 16;
i 15 .
16 . output P = U';=-PJ ;
Method 250 begins at step 25~, where a set of input strings is established. The set of input strings is t'~e nositivz examble -ornus 214 in FIG. 3.
The set of input strings includes patterns, in the case of a pair of elements, where both portions of a desired pair of information elements are included.
i5 After the set of input strings is established, generalized extraction patterns including wildcards OY~~ph ~T1 'I n Y- ~ I ~ N
.w w. ~... y.ww..v w ~y v T . vim tG iit~ wv. ~s.W rGiyuGu extraction pattern (which is also the sub-algorithm find-generalized-extraction-patterns() in the 20 algorithm above) is discussed in further detail below with regard to FIG. 5. The generalized extraction patterns include words and elements in addition to the w.ildcards that denote other words may appear within the pattern.
25 The generalized extraction patterns can then be evaluated to determine whether or not they represent reliable candidates for extraction. At step 255, patterns that do not satisfy constraints are removed. A number of different constraints can be 30 used to remove generalized extraction patterns ger_erated by pattern generation module 210. One constraint is referred to as a "boundary constraint"
wherein a wildcard ca_~.not immediately be bositioned before or after an anchor. This constraint helps eliminate patterns for which it is difficult to 5 determine where the anchor information begins and ends. For example, the following generalized extraction pattern would be removed:
<company> today announced the immediate availability {\w +3) <product>
10 the above generalized extract=or: pattsrn could inappropriately determine that the string "of Internet Explorer for no-charge download from the Internet" was a product for the following sentence:
Microsoft Corp. today announced the 15 immediate availability of Internet Explorer for no charge download from the Internet.
Another constraint is the "distant constraint". The distant constraint limits the number of words that can be ski pped by a wildcard to not be 20 larger than the largest number of words that are skipped based on the training data. For example, the following pattern that does not limit the amount of words to be skipped would rot be used:
<company> {\w +} today announced {\w 25 +1 deliver <product>.
The above pattern could incorrectly extract "entertrise and electronic-commerce solutions based on the Microsoft Windows NT Server operatir_g system and the 3ackOfiice family cf products" as broduct 30 information for the ser_~ence:
Microsoft Corp. and Policy Management Systems Corp. (PMSC) today announced a pla:~
in which the two companies will work together to deliver enterprise and 5 electronic-commerce solutions based on the Microsoft Windows NT Server operating system and the BackOffice family of products.
Another constraint, called the "island 10 constraint" Prohibits what is =e~crred to as an "isolated function word". Isolated function words are generally articles such as "the", 'a', and "an" that do not include specific content related to information to be extracted and are surrounded by 15 wildcards. The following pattern does not satisfy the island constraint:
<company> {\w +8} the {\w +13} of the <product> , the first The above pattern could incorrectly extract 20 "Microsoft Entertainment Pack for the Windows CE
operating system" as product information that is not related to a release for the following sentence:
Microsoft Corp, today provided attendees of the Consumer Electronics Show 25 in Las Vegas with a demonstration of the Microsoft Entertainment Pack for the Windows CE operating sys tem, t~:e first game product to be released for the Windows CE-based har_dheld PC platform.
At step 258, patterns that do not meet a frequency threshold are removed. As a result, patterns that are not commonly used are removed at this step. At step 260, patterns that do not contain 5 an anchor are removed. For example, a pattern not containing a product with an associated company name is not included as a pattern for information extraction. Given these patterns, the patterns are ranked at step 252. As discussed above, many 10 diffe-ent ranking methods can be used to r-.nk the patterns. If patterns rank too low, they can be removed.
FIG. 5 illustrates method 280 for generating generalized extraction patterns. The 15 algorithm below can be used to generate these patterns, and is a sub-algorithm for the algorithm described above. The same variables apply to the algorithm below.
find-generalized-extraction-pattern ( P;,-" , P, ) 20 I 1 . for each ( P,;~~ a r;,_" ) ~ 2 for each ( p, a P, ) f ' .
3 p; = p.;-t,pi .
4. if (p; exists in S) 5 put p, into P; ;
.
25 6. p'.=p;-_i;ii~r~n;pi;
~
7. if ( p'; exists in S ) ~ 8 put p'; into P; ; ;
.
10.
3 11 output P; ;
0 .
I
At step 282 of method 280, consecutive patterns are identified from the positive instances in positive example corpus 214. This step corresponds to lines 3 through 5 in the sub-algorithm above. The consecutive patterns include the elements related to the subject to be extracted, for example company and 5 product. In one method, patterns can be recursively generated given the input strings by combining subsets and supersets of the strings sharing similar attributes. After the consecutive patterns have been identified, method 280 proceeds to step 284 wherein 10 wildcard position~> and lengths ar~ idantifiad by combining the consecutive patterns and expressing generalized extraction patterns to cover the consecutive patterns. This step corresponds to lines 6 through 8 in the sub-algorithm above. Next, the 15 generalized extraction patterns with wildcards are output at step 286. The generalized extraction patterns are then further analyzed as explained above with respect to method 250 to remove and rank the patterns.
20 By implementing the present invention described above, generalized extraction patterns can be developed that represent combinations of patterns and provide a more reliable informatior_ extraction system. The generalized extraction patterns can 25 include posi~ions for optional words and/or_~ wildcards denoting that words can be skipped during matching that allow combinations of pasterns to be expressed.
Using the generalized patterns during extraction allows for matching o= various strings in order to 30 identify matching strings ir_ an infcrmaticn source.
i Although the present invention has been described with reference to partic;111ar e:,tbodiments, workers skilled in the ar~ will recognize that changes may be made in form and detail without 5 departing from the spirit and scope of the invention.
Claims (24)
1. A computer-implemented method of extracting information from an information source, comprising:
accessing strings in the information source; and comparing the strings in the information source with generalized extraction patterns and identifying strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including words and wildcards, wherein the wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
accessing strings in the information source; and comparing the strings in the information source with generalized extraction patterns and identifying strings in the information source that match at least one generalized extraction pattern, the generalized extraction patterns including words and wildcards, wherein the wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
2. The computer-implemented method of claim 1 and further comprising extracting at least two elements from strings in the information source that have been identified to match, the at least two elements being based on at least two corresponding elements in a corresponding generalized extraction patterns.
3. The computer-implemented method of claim 2 wherein for at least one of the corresponding elements in each of the generalized extraction patterns, there is at least one word positioned between said at least one of the corresponding elements and the wildcards.
4. The computer-implemented method of claim 1 wherein the wildcards indicate the number of words that can be skipped.
5. A computer-readable medium for extracting information from an information source, comprising:
a data structure including a set of generalized extraction patterns including words and an indication of a position for at least one optional word; and an extraction module using the set of generalized extraction patterns to match strings in the information source with the generalized extraction patterns.
a data structure including a set of generalized extraction patterns including words and an indication of a position for at least one optional word; and an extraction module using the set of generalized extraction patterns to match strings in the information source with the generalized extraction patterns.
6. The computer-readable medium of claim 5 wherein the generalized extraction patterns further include at least two elements related to a subject.
7. The computer-readable medium of claim 5 wherein for the generalized extraction patterns there is at least one word positioned between at least one of the elements and the indication.
8. The computer-readable medium of claim 5 wherein the indication includes a number of words that can be skipped during information extraction.
9. A method of generating patterns for use in extracting information from an information source, comprising:
establishing a set of strings including at least two elements corresponding to a subject;
generating a set of generalized extraction patterns that correspond to the set of strings, the generalized extraction patterns including the at least two elements, words and an indication of a position for at least one optional word.
establishing a set of strings including at least two elements corresponding to a subject;
generating a set of generalized extraction patterns that correspond to the set of strings, the generalized extraction patterns including the at least two elements, words and an indication of a position for at least one optional word.
10. The method of claim 9 and further comprising removing patterns from the set of generalized extraction patterns that do not meet a frequency threshold in the set of strings.
11. The method of claim 9 and further comprising removing patterns from the set of generalized extraction patterns that contain the indication adjacent to one of the at least two elements in the generalized extraction pattern.
12. The method of claim 9 and further comprising removing patterns from the set of generalized extraction patterns where the number of words to be skipped by the indication is above a threshold.
13. The method of claim 9 and further comprising ranking the generalized extraction patterns in the set of generalized extraction patterns.
14. The method of claim 13 wherein the step of ranking further comprises calculating a precision score for each generalized extraction pattern.
15. The method of claim 1, and further comprising removing patterns from the set of generalized extraction patterns that do not meet a ranking threshold.
16. The method of claim 9 and further comprising determining a number of words that a particular indication will skip.
17. A method of generating patterns for use in extracting information from an information source, comprising:
establishing a set of strings including at least two elements corresponding to a subject;
identifying consecutive patterns within the set of strings that include words and the at least two elements; and generating a set of generalized extraction patterns from the consecutive patterns identified, the generalized extraction patterns including the at least two elements, words and wildcards, wherein the wildcards express a combination of the consecutive patterns.
establishing a set of strings including at least two elements corresponding to a subject;
identifying consecutive patterns within the set of strings that include words and the at least two elements; and generating a set of generalized extraction patterns from the consecutive patterns identified, the generalized extraction patterns including the at least two elements, words and wildcards, wherein the wildcards express a combination of the consecutive patterns.
18. The method of claim 17 and further comprising removing patterns from the set of generalized extraction patterns that do not meet a frequency threshold in the set of strings.
19. The method of claim 17 and further comprising removing patterns from the set of generalized extraction patterns that contain a wildcard adjacent to one of the at least two elements in the generalized extraction pattern.
20. The method of claim 17 and further comprising removing patterns from the set of generalized extraction patterns where the number of words to be skipped by a wildcard is above a threshold.
21. The method of claim 17 and further comprising ranking the generalized extraction patterns in the set of generalized extraction patterns.
22. The method of claim 21 wherein the step of ranking further comprises calculating a precision score for each generalized extraction pattern.
23. The method of claim 21 and further comprising removing patterns from the set of generalized extraction patterns that do not meet a ranking threshold.
24. The method of claim 17 and further comprising determining a number of words that a particular wildcard will skip.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/733,541 | 2003-12-11 | ||
US10/733,541 US7299228B2 (en) | 2003-12-11 | 2003-12-11 | Learning and using generalized string patterns for information extraction |
Publications (1)
Publication Number | Publication Date |
---|---|
CA2487606A1 true CA2487606A1 (en) | 2005-06-11 |
Family
ID=34523068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002487606A Abandoned CA2487606A1 (en) | 2003-12-11 | 2004-11-10 | Learning and using generalized string patterns for information extraction |
Country Status (11)
Country | Link |
---|---|
US (1) | US7299228B2 (en) |
EP (1) | EP1542138A1 (en) |
JP (1) | JP2005174336A (en) |
KR (1) | KR20050058189A (en) |
CN (1) | CN1627300A (en) |
AU (1) | AU2004229097A1 (en) |
BR (1) | BRPI0404954A (en) |
CA (1) | CA2487606A1 (en) |
MX (1) | MXPA04011788A (en) |
RU (1) | RU2004132977A (en) |
TW (1) | TW200527229A (en) |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3962382B2 (en) * | 2004-02-20 | 2007-08-22 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Expression extraction device, expression extraction method, program, and recording medium |
US7606797B2 (en) * | 2005-02-24 | 2009-10-20 | Kaboodle, Inc. | Reverse value attribute extraction |
US7630968B2 (en) * | 2005-02-24 | 2009-12-08 | Kaboodle, Inc. | Extracting information from formatted sources |
JP4645242B2 (en) * | 2005-03-14 | 2011-03-09 | 富士ゼロックス株式会社 | Question answering system, data retrieval method, and computer program |
WO2008151466A1 (en) * | 2007-06-14 | 2008-12-18 | Google Inc. | Dictionary word and phrase determination |
US8275803B2 (en) | 2008-05-14 | 2012-09-25 | International Business Machines Corporation | System and method for providing answers to questions |
US8332394B2 (en) | 2008-05-23 | 2012-12-11 | International Business Machines Corporation | System and method for providing question and answers with deferred type evaluation |
CN102138141B (en) * | 2008-09-05 | 2013-06-05 | 日本电信电话株式会社 | Approximate collation device, approximate collation method, program, and recording medium |
US8447632B2 (en) * | 2009-05-29 | 2013-05-21 | Hyperquest, Inc. | Automation of auditing claims |
US8255205B2 (en) | 2009-05-29 | 2012-08-28 | Hyperquest, Inc. | Automation of auditing claims |
US8073718B2 (en) | 2009-05-29 | 2011-12-06 | Hyperquest, Inc. | Automation of auditing claims |
US8346577B2 (en) | 2009-05-29 | 2013-01-01 | Hyperquest, Inc. | Automation of auditing claims |
US8892550B2 (en) | 2010-09-24 | 2014-11-18 | International Business Machines Corporation | Source expansion for information retrieval and information extraction |
RU2498401C2 (en) * | 2012-02-14 | 2013-11-10 | Общество С Ограниченной Ответственностью "Центр Инноваций Натальи Касперской" | Method to detect text objects |
US10621880B2 (en) | 2012-09-11 | 2020-04-14 | International Business Machines Corporation | Generating secondary questions in an introspective question answering system |
US9262938B2 (en) | 2013-03-15 | 2016-02-16 | International Business Machines Corporation | Combining different type coercion components for deferred type evaluation |
US9785321B2 (en) * | 2013-05-30 | 2017-10-10 | Empire Technology Development Llc | Controlling a massively multiplayer online role-playing game |
KR101586258B1 (en) | 2014-09-30 | 2016-01-18 | 경북대학교 산학협력단 | Conflict resolution method of patterns for generating linked data, recording medium and device for performing the method |
US9626594B2 (en) * | 2015-01-21 | 2017-04-18 | Xerox Corporation | Method and system to perform text-to-image queries with wildcards |
US10062208B2 (en) | 2015-04-09 | 2018-08-28 | Cinemoi North America, LLC | Systems and methods to provide interactive virtual environments |
WO2018165932A1 (en) * | 2017-03-16 | 2018-09-20 | Microsoft Technology Licensing, Llc | Generating responses in automated chatting |
US10620945B2 (en) * | 2017-12-21 | 2020-04-14 | Fujitsu Limited | API specification generation |
JP6605105B1 (en) * | 2018-10-15 | 2019-11-13 | 株式会社野村総合研究所 | Sentence symbol insertion apparatus and method |
US11023095B2 (en) | 2019-07-12 | 2021-06-01 | Cinemoi North America, LLC | Providing a first person view in a virtual world using a lens |
US10817576B1 (en) * | 2019-08-07 | 2020-10-27 | SparkBeyond Ltd. | Systems and methods for searching an unstructured dataset with a query |
JP7229144B2 (en) * | 2019-10-11 | 2023-02-27 | 株式会社野村総合研究所 | Sentence symbol insertion device and method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5606690A (en) * | 1993-08-20 | 1997-02-25 | Canon Inc. | Non-literal textual search using fuzzy finite non-deterministic automata |
US6785417B1 (en) * | 2000-08-22 | 2004-08-31 | Microsoft Corp | Method and system for searching for words in ink word documents |
-
2003
- 2003-12-11 US US10/733,541 patent/US7299228B2/en not_active Expired - Fee Related
-
2004
- 2004-10-29 TW TW093133116A patent/TW200527229A/en unknown
- 2004-11-09 EP EP04026563A patent/EP1542138A1/en not_active Withdrawn
- 2004-11-10 BR BR0404954-3A patent/BRPI0404954A/en not_active Application Discontinuation
- 2004-11-10 CA CA002487606A patent/CA2487606A1/en not_active Abandoned
- 2004-11-11 RU RU2004132977/09A patent/RU2004132977A/en not_active Application Discontinuation
- 2004-11-15 AU AU2004229097A patent/AU2004229097A1/en not_active Abandoned
- 2004-11-17 KR KR1020040093894A patent/KR20050058189A/en not_active Application Discontinuation
- 2004-11-26 MX MXPA04011788A patent/MXPA04011788A/en unknown
- 2004-12-07 JP JP2004354479A patent/JP2005174336A/en not_active Withdrawn
- 2004-12-10 CN CNA2004101022625A patent/CN1627300A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
AU2004229097A1 (en) | 2005-06-30 |
JP2005174336A (en) | 2005-06-30 |
TW200527229A (en) | 2005-08-16 |
CN1627300A (en) | 2005-06-15 |
EP1542138A1 (en) | 2005-06-15 |
US20050131896A1 (en) | 2005-06-16 |
RU2004132977A (en) | 2006-04-27 |
MXPA04011788A (en) | 2005-07-05 |
US7299228B2 (en) | 2007-11-20 |
BRPI0404954A (en) | 2005-08-30 |
KR20050058189A (en) | 2005-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7299228B2 (en) | Learning and using generalized string patterns for information extraction | |
Vajjala et al. | On improving the accuracy of readability classification using insights from second language acquisition | |
EP2664997B1 (en) | System and method for resolving named entity coreference | |
CA2483217C (en) | System for rating constructed responses based on concepts and a model answer | |
KR102196583B1 (en) | Method for automatic keyword extraction and computing device therefor | |
US5383120A (en) | Method for tagging collocations in text | |
Schofield et al. | Quantifying the effects of text duplication on semantic models | |
US20020046018A1 (en) | Discourse parsing and summarization | |
US20090282114A1 (en) | System and method for generating suggested responses to an email | |
US20070016863A1 (en) | Method and apparatus for extracting and structuring domain terms | |
CN102439595A (en) | Question-answering system and method based on semantic labeling of text documents and user questions | |
Abdi et al. | A question answering system in hadith using linguistic knowledge | |
US10191975B1 (en) | Features for automatic classification of narrative point of view and diegesis | |
US20160299891A1 (en) | Matching of an input document to documents in a document collection | |
Sun et al. | Syntactic parsing of web queries | |
KR101926669B1 (en) | Device and method for generating multiple choise gap fill quizzes using text embedding model | |
Kang et al. | Discourse structure analysis for requirement mining | |
Kuyoro et al. | Intelligent Essay Grading System using Hybrid Text Processing Techniques | |
Srivastava et al. | Different German and English coreference resolution models for multi-domain content curation scenarios | |
Anttila | Automatic Text Summarization | |
Patel et al. | On neurons invariant to sentence structural changes in neural machine translation | |
Duma | RDFa Editor for Ontological Annotation | |
Reis et al. | Concept maps construction based on exhaustive rules and vector space intersection | |
US20240086768A1 (en) | Learning device, inference device, non-transitory computer-readable medium, learning method, and inference method | |
Kaleem et al. | Word order variation and string similarity algorithm to reduce pattern scripting in pattern matching conversational agents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FZDE | Discontinued |