CN100589100C - Method and system for classifying words and conception in text based on diagram classification - Google Patents

Method and system for classifying words and conception in text based on diagram classification Download PDF

Info

Publication number
CN100589100C
CN100589100C CN200510053179A CN200510053179A CN100589100C CN 100589100 C CN100589100 C CN 100589100C CN 200510053179 A CN200510053179 A CN 200510053179A CN 200510053179 A CN200510053179 A CN 200510053179A CN 100589100 C CN100589100 C CN 100589100C
Authority
CN
China
Prior art keywords
chart
node
noun
score
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200510053179A
Other languages
Chinese (zh)
Other versions
CN1691014A (en
Inventor
A·A·梅尼泽斯
L·H·范德文蒂
M·L·班科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of CN1691014A publication Critical patent/CN1691014A/en
Application granted granted Critical
Publication of CN100589100C publication Critical patent/CN100589100C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention is a method and system for identifying words, text fragments, or concepts of interest in a corpus of text. A graph is built which covers the corpus of text. The graph includes nodes and links, where nodes represent a word or a concept and links between the nodes represent directed relation names. A score is then computed for each node in the graph. Scores can also be computedfor larger sub-graph portions of the graph (such as tuples). The scores are used to identify desired sub-graph portions of the graph, those sub-graph portions being referred to as graph fragments.

Description

Use classification based on chart to come the method and system of word and notion in the classification text
Technical field
The present invention relates to identification and retrieval text, relate in particular to by the chart of generation overlay text data with to the part of chart and score from bigger text information corpus, to discern and to retrieve interested textual portions (or text chunk).
Background technology
There are various application to have benefited from the ability of the interested text of identification in big text language data bank.For example, document cluster (clustering) and document summary both attempt discerning the notion that is associated with document.Those notions are used to document cluster is become respectively to troop, or summarize document.In fact, attempted trooping automatically document and summary entire document are trooped, to use (for example information retrieval) in processing after a while.
Existing systems has been attempted based on them and the notion of document or the theme relevant sentence that sorts how.These sentences are compressed then, sometimes by a little rewriting to obtain summary.
In the past, attempted the sentence ordering with many kinds of distinct methods.Some existing systems based on the verb specificity to attempt the ordering sentence.Additive method uses and attempts the sentence that sorts based on the trial method of the frequency of the entity of discerning in sentence position and the sentence in the document.
All such existing systems all have certain shortcoming.For example, all such existing systems all are a large amount of consumption of natural resource.These systems extract word and sentence segment simply from the document of being summarized.These words and order of words can not be changed.On the contrary, as writing in the original document,, provide these words or sentence fragment simply as documentation summary with the original order that appears in the original document.Certainly, this is for human, and it is difficult deciphering such text fragments.
In addition, most of existing methods must assign to discern interested word and text fragments by based on term frequency each word in the text being calculated one.In order to calculate this score, the main technology of using is contrary document frequency (tf*idf) function of term frequency * in existing system, and this function is well-known and has documentary evidence in the art.Some existing systems are used the less variation of (tf*idf) functions, but are to use all algorithms of (tf*idf) function class all to be based on word.
In another technical field, chart is established so that webpage is carried out classification.Use center and authority (hub andauthorities) algorithm is with to the chart classification, and this algorithm uses webpage as the node in the chart, and the link that uses webpage is as the connection in the chart.This graphics also is not applied to the chart text.
Summary of the invention
The present invention is a kind of method and system that is used for from text corpus identification interested word, text chunk or notion.Chart is established with the overlay text corpus.Chart comprises node and is connected that wherein node is represented word or notion, and the directive title that concerns is represented in internodal connection.Calculate a score for each node in the chart subsequently.Also can count the score (such as tuple) to subgraph matrix section bigger in the chart.Score is used to discern the subgraph matrix section that needs in the chart, and those subgraph matrix sections are known as the chart fragment.
In one embodiment, text output is to generate from the chart fragment of identification.This chart fragment is provided for a text generation assembly, and it generates the text output that indication offers its icon fragment.
Description of drawings
Fig. 1 is a kind of block diagram that can use Illustrative environment of the present invention therein.
Fig. 2 is the block diagram of a kind of embodiment according to system of the present invention.
Fig. 3 is a process flow diagram, shows a kind of embodiment of the operation of the system shown in Fig. 2.
Fig. 4 shows the example chart that an example input text is generated.
Embodiment
The present invention relates to identification interested word, text chunk and notion from big text corpus.Before describing the present invention in more detail, will describe and a kind ofly can use Illustrative environment of the present invention therein.
Fig. 1 shows an example that is adapted at wherein realizing computingasystem environment 100 of the present invention.Computingasystem environment 100 only is an example of suitable computing environment, is not the limitation of hint to usable range of the present invention or function.Computing environment 100 should be interpreted as the arbitrary assembly shown in the exemplary operation environment 100 or its combination are had any dependence or demand yet.
The present invention can use numerous other universal or special computingasystem environment or configuration to operate.Be fit to use well-known computing system of the present invention, environment and/or configuration to include but not limited to, personal computer, server computer, hand-held or laptop devices, multicomputer system, the system based on microprocessor, set-top box, programmable consumer electronics, network PC, minicomputer, large scale computer, comprise distributed computing environment of arbitrary said system or equipment or the like.
The present invention can describe in the general context environmental such as the computer executable instructions of being carried out by computing machine such as program module.Generally speaking, program module comprises routine, program, object, assembly, data structure or the like, carries out specific task or realizes specific abstract data type.The present invention also can put into practice in distributed computing environment, and wherein, task is carried out by the teleprocessing equipment that connects by communication network.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory storage device.
With reference to figure 1, be used to realize that example system of the present invention comprises the general-purpose computations device with computing machine 110 forms.The assembly of computing machine 110 includes, but not limited to processing unit 120, system storage 130 and will comprise that the sorts of systems assembly of system storage is coupled to the system bus 121 of processing unit 120.System bus 121 can be any of some kinds of types of bus structure, comprises memory bus or Memory Controller, peripheral bus and the local bus that uses all kinds of bus architectures.As example but not the limitation, this class architecture comprises ISA(Industry Standard Architecture) bus, MCA (MCA) bus, strengthens ISA (EISA) bus, Video Electronics Standards Association's (VESA) local bus and peripheral component interconnect (pci) bus, is also referred to as interlayer (Mezzanine) bus.
Computing machine 110 generally includes various computer-readable mediums.Computer-readable medium can be can be by arbitrary usable medium of computing machine 110 visit, comprises the non-volatile media of easily becoming estranged, removable and removable medium not.As example but not the limitation, computer-readable medium comprises computer-readable storage medium and communication media.Computer-readable storage medium comprises to be used to store such as easily becoming estranged of realizing of arbitrary method of information such as computer-readable instruction, data structure, program module or other data or technology non-volatile, removable and removable medium not.Computer-readable storage medium includes but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic holder, tape, disk storage or other magnetic storage apparatus, maybe can be used for storing desired information and can be by arbitrary other medium of computing machine 110 visits.Communication media comprises computer-readable instruction, data structure, program module or other data usually in the modulated message signal such as carrier wave or other transmission mechanism, and comprises arbitrary information-delivery media.Term " modulated message signal " refers to be provided with or change in the mode that the information in the signal is encoded the signal of its one or more features.As example but not limitation, communication media comprises wire medium, as cable network or directly line connect, and wireless medium is as acoustics, RF, infrared and other wireless medium.Above-mentioned arbitrary combination also should be included within the scope of computer-readable medium.
System storage 130 comprises the computer-readable storage medium of easy mistake and/or nonvolatile memory form, as ROM (read-only memory) (ROM) 131 and random-access memory (ram) 132.Basic input/output 133 (BIOS) comprises as help the basic routine of transmission information between the element in computing machine 110 when starting, is stored in usually among the ROM 131.RAM 132 comprises addressable immediately or current data of operating of processing unit 120 and/or program module usually.As example but not the limitation, Fig. 1 shows operating system 134, application program 135, other program module 136 and routine data 137.
Computing machine 110 also can comprise other removable/not removable, easy mistake/nonvolatile computer storage media.Only make example, the disc driver 151 that Fig. 1 shows hard disk drive 141 that not removable, non-volatile magnetic medium is read and write, read and write removable, non-volatile disk 152 and to removable, nonvolatile optical disk 156, the CD drive of reading and writing as CD ROM or other light medium 155.Other that can use in the exemplary operation environment be removable/and not removable, easy mistake/nonvolatile computer storage media includes but not limited to tape cassete, flash card, digital versatile disc, digital video band, solid-state RAM, solid-state ROM or the like.Hard disk drive 141 passes through not removable memory interface usually, is connected to system bus 121 as interface 140, and disc driver 151 and CD drive 155 are connected to system bus 121 usually by the removable memory interfaces as interface 150.
Driver that above-mentioned discussion is also shown in Figure 1 and related computer-readable storage medium thereof provide the storage of computer-readable instruction, data structure, program module and other data for computing machine 110.For example, in Fig. 1, hard disk drive 141 store operation systems 144, application program 145, other program module 146 and routine data 147 are shown.Notice that these assemblies can be identical with routine data 137 with operating system 134, application program 135, other program module 136, also can be different with them.Here give different labels to operating system 144, application program 145, other program module 146 and routine data 147 and illustrate that they are different copies at least.
The user can pass through input equipment, and as keyboard 162, microphone 163 and positioning equipment 161 (as mouse, tracking ball or touch pad) are to computing machine 110 input commands and information.Other input equipment (not shown) can comprise operating rod, game mat, satellite dish, scanner or the like.These and other input equipment is connected to processing unit 120 by the user's input interface 160 that is coupled to system bus usually, but also can be connected with bus structure by other interface, as parallel port, game port or USB (universal serial bus) (USB).The display device of monitor 191 or other type also by interface, is connected to system bus 121 as video interface 190.Except monitor, computing machine also comprises other peripheral output device, as loudspeaker 197 and printer 196, connects by output peripheral interface 190.
Computing machine 110 can use one or more remote computers, operates in the networked environment that connects as the logic of remote computer 180.Remote computer 180 can be personal computer, portable equipment, server, router, network PC, peer device or other common network node, and generally include the relevant element of many or all above-mentioned and computing machines 110.The logic that Fig. 1 describes connects and comprises Local Area Network 171 and wide area network (WAN) 173, but also comprises other network.This class network environment is common in office, enterprise-wide. computer networks, Intranet and the Internet.
When using in the lan network environment, computing machine 110 is connected to LAN 171 by network interface or adapter 170.When using in the WAN network environment, computing machine 110 generally includes modulator-demodular unit 172 or other device, is used for by WAN 173, sets up communication as the Internet.Modulator-demodular unit 172 can be internal or external, is connected to system bus 121 by user's input interface 160.In networked environment, program module or its part relevant with computing machine 110 of description can be stored in the remote memory storage device.As an example, and unrestricted, Fig. 1 shows remote application and resides in the remote computer 180.Be appreciated that it is exemplary that the network that illustrates connects, and also can use other device of setting up communication link at intercomputer.
Fig. 2 is the block diagram according to the text processing system 200 of a kind of embodiment of the present invention.Text processing system 200 can be used in the various text hosts application.For example, below can be for a more detailed description, it can be used as summary, question answering, information retrieval of document cluster, document summarization, document cluster or the like.In order to oversimplify, the present invention will described aspect the summary of trooping.Yet, the invention is not restricted to this.System 200 comprises that chart makes up device 202, score assembly 204, can choose talks planning system 205, subgraph table extraction assembly 206 and formation component 208 wantonly.Fig. 3 is a process flow diagram that shows the operation of system shown in Fig. 2 200.
In operation, chart makes up device 202 and at first receives input text 210.This is by 212 indications of square frame among Fig. 3.For example, input text 210 can be the text corpus that comprises one or more documents.Be used to summarize under the situation of document cluster in system 200, then input text 210 is one group of previous document that use any known cluster system to troop.
Under any circumstance, chart makes up the chart 214 that device 202 receives input text 210 and makes up the whole input text 210 of covering.This by finishing for independent sentence in the input text 210 with at first making up chart explanation property.This independent chart is joined together to form whole chart 214 subsequently.In realization, because word separately in the chart or notion will be corresponding to independent nodes in the whole chart 214, and no matter the number of times that they occur in independent chart, therefore independent chart is folded a little a little.Generate whole chart 214 by 216 indications of the square frame among Fig. 3.In a kind of illustrative embodiment, chart 214 has comprised node and has been connected.Node table is shown in word, incident, entity or the notion in the input text 210, and the oriented title that concerns is represented in the connection between the node.In one implementation, one group of word of determining can be excluded from chart 214.Such word is commonly referred to as and stops word (stop words).
In a kind of illustrative embodiment, it is that natural language processing system by the abstract analysis that produces input text 210 realizes that chart makes up device 202.This abstract analysis normalization surface order of words, functions of use word (for example " be ", " have ", " with " or the like) relations of distribution title.Comprise that natural language processing system that chart makes up device 202 also can finish the resolution of the anaphora (anaphora) of the noun phrase coreference (co-reference) that solves pronominal and vocabulary.A kind of embodiment of the abstract analysis of this input text 210 is known as logical form, the applicable system of a kind of generation abstract analysis (logical form) is at the U.S. Patent number 5 of " from the method and system (Method and System for ComputingSemantic Logical Forms From Syntax Trees) of syntax tree computing semantic logical forms " by name of distribution on October 12nd, 1999, statement is arranged in 966,686.This logical form is at the acyclic chart that covers each sentence input text.Be connected to the being illustrated property of chart of each sentence in the big chart 214 that covers whole input text 210 another.
Certainly, chart structure device 202 also can be another kind of suitable system.For example, chart makes up the syntactic analysis that device can be configured to produce each the input sentence in the input text 210, produces a correlativity tree according to this syntactic analysis subsequently.A chart is constructed on illustrative ground from this correlativity tree subsequently.Alternatively, by word that close on or colocated being orientated as the node in the chart, and by the connection between location node, chart makes up device 202 can be to input text 210 structure charts 214, the direction of Lian Jieing or distribute arbitrarily wherein, or calculate according to the part of node voice.This can be finished by the method for heuristic or machine learning.
Under any circumstance, generated chart 214 in case chart makes up device 202 from input text 210, the node of chart 214 or subgraph table component are by 204 score of score assembly.This is by 218 indications of the block scheme among Fig. 3.In a kind of illustrative embodiment, a kind of public can with the chart hierarchical algorithms be used to the node in the chart 214 is scored.This public can with a kind of example of chart hierarchy system be known as center and authority algorithm (Hub andAuthorities Algorithm), the author is that John Kleinberg is (referring to the extended version of the 9th discrete logarithm Conference Papers collection of Authoritative sources in ahyperlinked environment (authorization source in the hyperlink environment) (Proc.9th ACM-SIAM Symposium on Discrete Algorithms, 1998.) ACM 46 periodicals (1999).It also appears among the IBM survey report RJ 10076 in May, 1997), it for example be used to such as among Sergey Brin and the Lawrence Page statement ground webpage is carried out classification.A kind of analysis (anatomy) of extensive hypertext web search engine.The Ashman andThistlewaite[2 of Australian Brisbane], the 107-117 page or leaf.Briefly, this algorithm has been considered the closure in the chart, to produce classification.Each node in the chart receives a weight, and it is connected to this node with what nodes, and given node to be connected to what nodes relevant.The output of this algorithm is the score of each node in the chart.The score of node can be as scoring system that substitute to use term frequency, for example, and in such as information retrieval, question answering, the text-processing application program trooping, summarize or the like.
In case the score of node is calculated, the score of tuple can be calculated in the chart 214.Tuple comprises that wherein, node A is known as destination node in the tuple with the subgraph table component of the chart 214 of nodeB → relation → nodeA (Node B → relation → node A) form, and Node B is known as the start node in the tuple.In a kind of illustrative embodiment, the score of each tuple is the score that all is connected to the node of node A, the function of the frequency counting of given tuple in the score of Node B and the text corpus 210.The score of each tuple can be used in requiring to mate the Any Application of tuple in fact.Yet for simplicity, here a reference documents is summarized and is described.
According to a kind of embodiment of the present invention, the accurate Calculation of tuple score is only come the weighting tuple with respect to destination node.For example, in the tuple of nodeB → relation → nodeA, the tuple weight is to calculate with respect to all other nodes that point to node A, and is irrelevant with other tuple or other node.The example of concrete formula that is used to finish this calculating is as follows:
Equation 1
TupleScore (nodeB → relation → nodeA)=NodeScore (B) * Count (nodeB → relation → nodeA)/Sum (for making nodeX → R → nodeA|NodeScore (X) * Count (all nodes X of nodeX → R → nodeA) and concern R)).
Wherein, TupleScore () has indicated the score of given tuple;
NodeScore () has indicated the score of given node; And
Count () is the frequency of the tuple of discerning in the input text.
Certainly, also can use other score mechanism and equation.
Score and chart 214 that score assembly 204 generates are provided for subgraph table extraction assembly 206.Subgraph table extraction assembly 206 uses the important subgraph table of discern generation corresponding to high the partial node of chart 214 and tuple from input text 210.Extract the subgraph table based on NodeScores and TupleScores subsequently.The subgraph table also can come classification based on them to reserved portion by subgraph table extraction assembly 206.Block scheme 220 and 222 among Fig. 3 has been indicated the extraction that gets the chart fragment of partial node and subgraph table corresponding to height, and based on the classification chart fragment of must assigning to.Block scheme 224 among Fig. 2 has been provided by the classified chart fragment that is provided by assembly 206.
The chart fragment can be extracted with distinct methods.For example, the independent sentence that they can be from input text 210 produces, and produces in overall chart 214 in the independent chart (or logical form) of high partial node and tuple and extract.Perhaps, they can directly extract from overall chart 214.
In a kind of illustrative embodiment, mate by the logical form that will from input text 210, generate and high partial node and tuple, subgraph table extraction assembly 206 has been discerned important subgraph table.So-called " high score " mean and determine threshold values by rule of thumb, and node and tuple with the score that satisfies threshold values are identified as high score.In addition, in order to extract the extra high partial node that gets that is connected to that subgraph table, each subgraph table is further studied.Each high partial node that gets for the subgraph table connects uses high score tuple as anchor (anchor), this being illustrated property of process ground iteration.
In addition, the node in the logical form can relate to another node.For example, relate to identical entity or incident, this situation can take place by pronounization or dependence.For example, rely on and to consult identical entity, word " GeneralAugusto Pinochet " and " Pinochet " dependence relate to same entity and relevant.In a kind of illustrative embodiment, these relevant nodes also can use in matching process.
In addition, in a kind of illustrative embodiment, given one concrete node type, definite relation and their value can be extracted the part as coupling subgraph table.For example, for the node type that meets an incident, the incident core parameter (as theme and/or object linking, also can be retained as a part of mating the subgraph table if present).Particularly the target at the recognin chart is it to be passed among the embodiment of formation component, and this has improved the consistance of subgraph table.
Whole subgraph table as above-mentioned coupling is known as the chart fragment.In a kind of illustrative embodiment, one blocks threshold values is used to determine the minimum score as coupling, and score is preserved for further processing in the chart fragment more than minimum.
In a kind of illustrative embodiment, according to node and tuple score, this chart fragment 224 is sorted, and offers the formation component 208 that generates natural language output into chart fragment 224.
Perhaps, in one embodiment, also provide optional talks planning system 205.Planning system 205 receives chart fragment 224 and generates the ordering of the optimization of chart fragment, it has not only been considered the node of chart fragment and tuple score, also considered the position of similar node, two orders (relevant phonological component) that node occurs, with high-level consideration item, for example event time line (timeline), theme and focus or the like.For example, suppose to generate three sentences (S1, S2 and S3), if only consider a kind of score, the sentence order will be S1S2S3.Yet sentence S1 and S3 relate to same entity, and planning system 205 will generate S1S3S2, and also use the entity among the synonym replacement S3, and perhaps sentence S1 and S3 are combined into long sentence.The combination sentence that relates to common node has improved the readability of the summary that generates.
Same, for example, suppose that two sentence S1 and S2 relate to word " arrest (arrest) ", but in S1, be used as noun, in S2, be used as verb.Planning system 205 is S2S1 with the sentence rearrangement.This has produced a summary, for example, relates to " X got arrested yesterday... (X was under arrest yesterday ...) ", and " the arrest... (arrest ...) " subsequently, this has improved the readability of the summary that generates again.
Under any circumstance, based on extra consideration item, planning system 205 rearrangement chart fragments 224, and they are offered formation component 208 as the chart fragment 225 of rearrangement.But the block scheme 224 among Fig. 3 has been indicated the optional step of the chart fragment of rearrangement talks planning systems 205.
Provide set of diagrams table fragment to formation component 208.Formation component 208 generates output text 226 based on the chart fragment that receives subsequently.This is by 228 indications of the block scheme among Fig. 3.
The type of the chart fragment that formation component 208 only must receive with its is consistent.Assembly 208 is based on rule, for example the 8th European natural language generates the Aikawa that the symposium meeting paper is concentrated in the Toulousc, T., M.Melero, L Schwartz and A.Wu. (2001). " multilingual sentence generation (Multilingual Sentence Generation) " in, and the concentrated Aikawa of the meeting paper of Spain Santiago MT high-level meeting VIII, T., M.Melero, " in the sentence generation of multilingual mechanical translation (Sentence Generation for Multilingual MachineTranslation) " of L.Schwartz and A.Wu. (2001) found.It also can be that machine can be learnt, for example report Gamon among the MSR-TR-2002-57, " Amalgam: find in a kind of machine learning generation module (Amalgam:A machine-learned generationmodule) of M., E.Ringger and S.Corston-Oliver.2002. at Microsoft's investigative technique.
In this, perhaps an example is useful.Suppose that input text 210 comprises following sentence group:
Pinochet?was?reported?to?have?left?London?Bridge?Hospital?on?Wednesday.
President?Eduardo?Frei?Ruiz_Tagle?said?that?Pinochet,now?an?unelectedsenator?for?life,carried?a?diplomatic?passport?giving?him?legalimmunity.
The?arrest?of?Gen.Augusto?Pinochet?shows?the?growing?significanceof?international?human_rights?law.
Former?Chilean?dictator?Gen.Augusto?Pinochet?has?been?arrested?byBritish?police,despite?protests?from?Chile?that?he?is?entitled?todiplomatic?immunity.
Independent chart (logical form) to each independent sentence is as follows:
Pinochet?was?reported?to?have?left?London?Bridge?Hospital?on?Wednesday.
report2({Verb}?(.))
Tsub _X2?({Pron})
Tobj leave2?({Verb})
Time Wednesday2?({Noun}?{on})
Tsub Pinochet2?({Noun})
Tobj London_Bridge_Hospital2?({Noun})
PLACENAME?London1?({Noun})
PLACETYPE?bridge1?({Noun})
PLACETYPE?hospital1?({Noun})
FactHyp?hospital2?({Noun})
President?Eduardo?Frei?Ruiz_Tagle?said?that?Pinochet,now?an?unelectedsenator?for?life,carried?a?diplomat?ic?passport?giving?him?legalimmunity.
say1?({Verb}?(.))
Tsub?President_Eduardo_Frei_Ruiz_Tagle1?({Noun})
TITLE president1?({Noun})
FIRSTNAME?Eduardo1?({Noun})
LASTNAME?Frei1?({Noun})
LASTNAME Ruiz_Tagle1?({Noun})
FactHyp person1?({Noun})
Tobj carry1?({Verb})
Tsub Pinochet2?({Noun})
Appostn senator2?({Noun})
Time now1?({Adv})
Attrib unelected2?({Adj})
for life1?({Noun})
Tobj?passport1?({Noun})
Attrib?diplomatic1?({Adj})
give1?({Verb})
Tsub passport1
Tobj immunity1({Noun})
Attrib legal1({Adj})
Tind he1({Pron})
The?arrest?of?Gen.Augusto?Pinochet?shows?the?growing?significanceof?international?human_rights?law.
show2({Verb}(.))
Tsub arrest3?({Noun})
Possr Gen._Augusto_Pinochet3?({Noun})
TITLE Gen.1?({Noun})
FIRSTNAME Augusto1?({Noun})
LASTNAME Pinochet1?({Noun})
FactHyp person1?({Noun})
Tobj significance3?({Noun})
Attrib grow3?({Verb})
Tsub significance3
of law3?({Noun})
Mod human_rights3?({Noun})
Attrib international3?({Adj})
Former?Chilean?dictator?Gen.Augusto?Pinochet?has?been?arrested?byBritish?police,despite?protests?from?Chile?that?he?is?entitled?todiplomatic?immunity.
arrest2?({Verb}?(.))
Tsub police3?({Noun})
Attrib?British3?({Adj})
despite?protest2?({Noun})
Props emtitle1?({Verb})
Tsub _X1?({Pron})
Tobj he1?({Pron})
to diplomatic_immunity1?({Noun})
Source Chile2?({Noun}?{from})
Tobj dictator2?({Noun})
Appostn?Gen._Augusto_Pinochet2?({Noun})
TITLE Gen.1?({Noun})
FIRSTNAME Augusto1?({Noun})
LASTNAME?Pinochet1?({Noun})
FactHyp?person1?({Noun})
Attrib?Chilean2?({Adj})
former2?({Adj})
It is the chart 300 at center that Fig. 4 shows with " Pinochet " node, and it connects the node that comes from input sentence logical form.Chart 300 also can be expressed as followsin virtually:
leave2?({Verb})
Tsub Pinochet2?({Noun})
Tobj London_Bridge_Hospital2?({Noun})
carry1?({Verb})
Tsub?Pinochet2?({Noun})
Tobj?passport1?({Noun})
Attrib?diplomatic1?({Adj})
Pinochet2?({Noun})
Appostn?senator2?({Noun})
give1?({Verb})
Tsub passport1
Tobj immunity1({Noun})
Tind he1({Pron}Refs:Pinochet)
show2?({Verb}?(.))
Tsub arrest3({Noun})
Possr Gen._Augusto_Pinochet3?({Noun})
Tobj significance3?({Noun})
arrest2?({Verb}?(.))
Tsub police3?({Noun})
Tobj dictator2?({Noun})
Appostn?Gen._Augusto_Pinochet2?({Noun})
entitle1?({Verb})
Tsub _X1({Pron})
Tobj he1({Pron}Refs:Pinochet)
to diplomatic_immunity1?({Noun})
dictator2?({Noun})
Appostn?Gen._Augusto_Pinochet2({Noun})
Gen._Augusto_Pinochet3?({Noun})
TITLE Gen.1?({Noun})
FIRSTNAME?Augusto1?({Noun})
LASTNAME?Pinochet1?({Noun})
FactHyp?person1?({Noun})
Can see that the node that is connected to Pinochet in the chart 300 is as follows:
leave2?({Verb})
Tsub Pinochet2({Noun})
carry1?({Verb})
Tsub Pinochet2({Noun})
Notice that the anaphora resolution is used to " he " resolved to " Pinochet ".
give1?({Verb})
Tind he1({Pron}Refs:Pinochet)
arrest3?({Noun})
Possr?Gen._Augusto_Pinochet3 ({Noun})
Notice that Appostn relation " being unpacked " is to obtain two connections (or no matter how many Appostn are arranged).So, except connecting " arrest-Tobj-dictator ", connect " arrest-Tobj-Gen._Augusto_Pinochet " and also be identified according to this logical form.
arrest2?({Verb}?(.))
Tsub police3?({Noun})
Tobj dictator2?({Noun})
Appostn?Gen._Augusto_Pinochet2?({Noun})
arrest2?({Verb}?(.))
Tobj Gen._Augusto_Pinochet2?({Noun})
Notice that anaphora is differentiated and is used to " he " resolved to " Pinochet ".
entitle1?({Verb})
Tobj he1 ({Pron}?Refs:Pinochet)
The node that also can see the Pinochet connection is as follows:
Pinochet2?({Noun})
Appostn?senator2?({Noun})
dictator2?({Noun})
Appostn?Gen._Augusto_Pinochet2({Noun})
Notice that this last logical form has been indicated the notion of above-mentioned " similar word ", because if the node in considering is Gen._Augusto_Pinochet, in " Pinochet " also can be included in.This is based on LASTNAME (last name) relation:
Gen._Augusto_Pinochet3?({Noun})
TITLE Gen.1?({Noun})
FIRSTNAME?Augusto1?({Noun})
LASTNAME?Pinochet1?({Noun})
FactHyp?person1?({Noun})
Following node score shows the example only a part of to this whole chart of trooping, so score is indicative rather than accurate:
Pinochet_Noun?8.86931560843612
arrest_Noun?5.65798261000217
dictator_Noun?4.66735025856776
leave_Verb?3.19016764263043
show_Verb?3.05887157398304
arrest_Verb?2.99724084165062
immunity_Noun?2.61908266128404
give_Verb?2.59211486749912
police_Noun?2.23721253134214
Gen._Augusto_Pinochet_Noun?2.14890018458375
senator_Noun?1.99746859744986
diplomatic_immunity_Noun?1.52760640157329
carry_Verb?1.4547668737008
passport_Noun?1.08547333802503
diplomatic_Adj?0.949668310003334
entitle_Verb?0.760364251949961
significance_Noun?0.518215630826775
London_Bridge_Hospital_Noun?0.493827515638096
Following is exemplary tuple score.Notice that score is about left node, therefore " arrest_Possr_Pinochet " has higher score than " arrest_Tsub_police ", but about whether " arrest_Tsub_police " has more high/low score to infer from weight than " carry_Tobj_passport ".
arrest_Noun?Possr?Pinochet_Noun?0.9674310
arrest_Verb?Tobj?Pinochet_Noun?0.9137349
arrest_Verb?Tsub?police_Noun?0.5801700
carry_Verb?Tsub?Pinochet_Noun?0.9916259
carry_Verb?Tobj?passport_Noun?0.7846062
entitle_Verb?Tobj?Pinochet_Noun?0.9956231
entitle_Verb″to″diplomatic_immunity_Noun?0.8876522
Gen._Augusto_Pinochet_Noun?Appostn?dictator_Noun?0.7838148
give_Verb?Tind?Pinochet_Noun?0.8829976
giye_Verb?Tsub?passport_Noun?0.8081048
give_Verb?Tobj?immunity_Noun?0.5551054
leave_Verb?Tsub?Pinochet_Noun?0.9449093
leave_Verb?Tobj?London?Bridge?Hospital_Noun?0.0713249
passport_Noun?Attrib?diplomatic_Adj?0.3981289
Pinochet_Noun?Appostn?senator_Noun?0.5996584
show_Verb?Tsub?arrest_Noun?0.9343253
show_Verb?Tobj?significance_Noun?0.1478469
Fragment is come classification by mark.In this example, partly be that the selected fragment of root was sorted before the selected fragment that partly is root with the verb of delivering a speech with the noun of delivering a speech.
Notice that Time and Tobj also are chosen as the part of chart fragment, because they all are core parameters for " leave ", even " London_Bridge_Hospital " itself is a low score tuple.
1.leave({Verb}3.19016764263043)
Time Wednesday?({Noun}{on})
Tsub Pinochet?({Noun})
Tobj London_Bridge_Hospital({Noun})
Notice that selecting " significant " is because it is a core parameter.Because " significant " is noun, but because event attribute, we are that parameter also selected in noun (Attrrib and " of ").
2.show({Verb}3.05887157398304)
Tsub arrest?({Noun})
Possr Gen._Augusto_Pinochet?({Noun})
Tobj significance?({Noun})
Attrib?grow?({Verb})
Tsub significance?({Noun})
of human_rights?({Noun})
Attrib?international?({Adj})
Notice that this is the tuple score of " arrest Tobj Pinochet ", still " dictator " is identical entity with " Pinochet ", is identified by coreference.
3.arrest({Verb}2.99724084165062)
Tsub police?({Noun})
Tobj dictator?({Noun})
Locn London?({Noun})
This is a noun phrase example, and when using high score incident or ought arriving the weight restriction, it can be used for expanding the node in the chart.
4.Pinochet?({Noun}8.86931560843612)
Appostn?senator?({Noun})
Attrib?unelected?({Adj})
Below be when using rearrangement when can choose planning system 205 wantonly and making up the example of similar/identical node:
Because 1 and 4 all share node " Pinochet ", below show the chart fragment of their combinations:
leave?({Verb})
Time Wednesday?({Noun}?{on})
Tsub Pinochet?({Noun})
Appostn?senator?({Noun})
Attrib?unelected?({Adj})
Tobj London_Bridge_Hospital?({Noun})
Below show the rearrangement of chart fragment 2 and 3 and reflected the same node point verb the first of different piece in the language, the first-selection of noun ordering then:
arrest?({Verb})
Tsub police?({Noun})
Tobj dictator?({Noun})
Locn London?({Noun})
show?({Verb})
Tsub arrest?({Noun})
Possr Gen._Augusto_Pinochet?({Noun})
Tobj significance?({Noun})
Attrib?grow?({Verb})
Tsub significance?({Noun})
of human_rights?({Noun})
Attrib international?({Adj})
Below show and generate output 226.In this example, in generative process, the reference expression formula is selected for generation.Usually, that at first is the most concrete reference expression formula (Gen.Augusto Pinochet), secondly is short-form (Pinochet), followed by being pronominalization (if it is in core parameter position).It is therefore, a kind of that to generate output 226 embodiment as follows:
Gen.Augusto?Pinochet,an?unelected?senator,left?London?BridgeHospital?on?Wednesday.
Pinochet?has?been?arrested?in?London?by?the?police.
His?arrest?shows?the?growing?significance?of?internationalhuman_rights.
Therefore can see, the invention provides than the tangible advantage of prior art.The present invention is based on the chart that from input text, generates incident is carried out classification.Have been found that it is more accurate than the method based on word frequencies when decision comprises and so in summary.Summary when another aspect of the present invention has generated given classification chart fragment.This extracts or is compressed with better consistance and readability than sentence to many documents summary.
It will be appreciated, of course, that the present invention also can be used in various other application program.For example, by input text being generated chart, component score in the calculation chart then, discerning word in the input text or text fragments or incident is useful in many occasions.For example, when attempting to discern concerning between two texts inputs, for example information retrieval, index, document cluster, question answering or the like can be used this method.In those examples, the word of first input or tuple score are made comparisons with the word of second input or the score of tuple, to determine the relation between two inputs.In information retrieval, first input is to inquire about and second input or index or the document of making comparisons with inquiry.In question answering, first input is a problem, and second input is to be examined to determine whether it has answered the text of this problem.In document cluster, two inputs are document or its summary, or troop summary.Similarly, the score that the chart that covers input text is generated can be used for determining which word of document is used to the index input text, and any weight that those conditions are calculated.
Certainly, the present invention also can use as described to generate the output text corresponding to input text.This article instinct is the summary of separate document, the summary of trooping or the like.So, though the present invention mainly describe with respect to the document summary, yet the present invention be widely used, be not restricted to summary.
Though described the present invention with reference to specific embodiment, those skilled in the art will recognize that, can change in form and details and do not deviate from the spirit and scope of the present invention.

Claims (25)

1. an identification is characterized in that by the method for the interested feature of text input expression it comprises:
Structure has node and a chart that is connected corresponding to the input of described text, a pair of node and should be to the internodal tuple that is connected to;
Come the subgraph table component of described chart is scored by each node in described chart and each first set of dispense score, the score of each tuple is based on the frequency of tuple in the score of a start node in the described tuple, the score that is connected to the node of a destination node in the described tuple and the input of described text;
Discern interested chart fragment based on described score; And
Handle based on the chart fragment execution contexts of being discerned.
2. the method for claim 1 is characterized in that, described node is corresponding to word in the described text input or the notion of being represented by the input of described text.
3. method as claimed in claim 2 is characterized in that, makes up chart and also comprises the generation connection as oriented semantic relation title.
4. method as claimed in claim 3 is characterized in that, makes up chart and also is included as one group of abstract analysis of described text input generation.
5. method as claimed in claim 4 is characterized in that, generates one group of abstract analysis and comprises:
Generate one group of directed acyclic graph table based on described text input; And
Described one group of directed acyclic graph table is connected to each other.
6. the method for claim 1 is characterized in that, makes up chart and comprises:
Textual portions in the described text input is generated a syntactic analysis;
From described syntactic analysis, generate a dependency structure;
From described dependency structure, generate described chart.
7. the method for claim 1 is characterized in that, makes up chart and comprises:
Recognition node is word contiguous or colocated; And
Connection between recognition node.
8. method as claimed in claim 7 is characterized in that, identification connects and comprises:
At random distribute the directivity that connects.
9. method as claimed in claim 7 is characterized in that, identification connects and comprises the given phonological component that is associated based on described node, uses and inspires, and identification connects and distributes the described directivity that is connected.
10. method as claimed in claim 7 is characterized in that, identification connects and comprises based on the given language part that is associated with described node, but uses the machine learning method, and identification connects and distributes the described directivity that is connected.
11. the method for claim 1 is characterized in that, discerns interested chart fragment and comprises:
With the subgraph table component of described chart with have the node of enough scores and tuple to mate.
12. method as claimed in claim 11 is characterized in that, discerns interested chart fragment and comprises:
Identification is connected to subgraph table component that is mated and the node with enough scores.
13. method as claimed in claim 12 is characterized in that, identification chart fragment comprises:
Be identified in the node outside the subgraph table component of coupling, this node has a predetermined relation with node in the subgraph table component that is mated.
14. method as claimed in claim 13 is characterized in that, identification chart fragment comprises:
Given one predetermined concrete node type is discerned some relations.
15. method as claimed in claim 14 is characterized in that, the subgraph table component of all couplings and the node of being discerned and pass are described chart fragment.
16. method as claimed in claim 15 is characterized in that, execution contexts is handled and is comprised:
Extraction is to the subgraph table component group of given text input part branch identification, as the chart fragment.
17. method as claimed in claim 16 is characterized in that, makes up chart and comprises:
To the independent chart of each sentence generation in the input of described text one; And
Described independent chart is connected together to form total chart.
18. method as claimed in claim 17 is characterized in that, extraction comprises:
From described total chart, extract the subgraph matrix section that enough scores are arranged.
19. method as claimed in claim 18 is characterized in that, the subgraph matrix section of high score comprises to have in described total chart and satisfies the subgraph matrix section that a threshold values gets the score of score value in described total chart, wherein, extracts the subgraph matrix section and comprises:
Extract and produce the high part that gets each independent chart of Molecular Graphs matrix section in described total chart.
20. the method for claim 1 is characterized in that, execution contexts is handled and to be comprised a kind of in summary, information retrieval, question answering, document cluster and the index.
21. the method for claim 1 is characterized in that, execution contexts is handled and is comprised: generate text output based on the chart fragment of being extracted.
22. the method for claim 1 is characterized in that, also comprises:
Based on score the chart fragment is sorted corresponding to described chart fragment.
23. method as claimed in claim 22 is characterized in that, ordering also comprises:
Described chart fragment is sorted a kind of in the position that wherein said factor comprises node and order that occurs by the relevant node of phonological component and the high-level consideration item based on the factor except that described score.
24. method as claimed in claim 23 is characterized in that, a kind of in theme that described high-level consideration item comprises from the input of described text definite event timeline, determine for described text input and the focus.
25. the method for claim 1 is characterized in that, described interested feature comprises a kind of in word, text fragments, notion, incident, entity and the theme.
CN200510053179A 2004-03-02 2005-03-02 Method and system for classifying words and conception in text based on diagram classification Active CN100589100C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US54977504P 2004-03-02 2004-03-02
US60/549,775 2004-03-02
US10/825,642 2004-04-15

Publications (2)

Publication Number Publication Date
CN1691014A CN1691014A (en) 2005-11-02
CN100589100C true CN100589100C (en) 2010-02-10

Family

ID=35346457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200510053179A Active CN100589100C (en) 2004-03-02 2005-03-02 Method and system for classifying words and conception in text based on diagram classification

Country Status (1)

Country Link
CN (1) CN100589100C (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107977368B (en) * 2016-10-21 2021-12-10 京东方科技集团股份有限公司 Information extraction method and system

Also Published As

Publication number Publication date
CN1691014A (en) 2005-11-02

Similar Documents

Publication Publication Date Title
JP4647336B2 (en) Method and system for ranking words and concepts in text using graph-based ranking
Chowdhury et al. Plagiarism: Taxonomy, tools and detection techniques
Ding et al. Entity discovery and assignment for opinion mining applications
US6810146B2 (en) Method and system for segmenting and identifying events in images using spoken annotations
JP4945086B2 (en) Statistical language model for logical forms
US20150120738A1 (en) System and method for document classification based on semantic analysis of the document
US10650094B2 (en) Predicting style breaches within textual content
Saravanan et al. Identification of rhetorical roles for segmentation and summarization of a legal judgment
US20180004838A1 (en) System and method for language sensitive contextual searching
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
CN1617134A (en) System for identifying paraphrases using machine translation techniques
CN103544204B (en) For that be expressed as the classification of tree and based on index the system and method for watermark
Datta et al. Multimodal retrieval using mutual information based textual query reformulation
JP5399450B2 (en) System, method and software for determining ambiguity of medical terms
CN102567455A (en) Method and system of managing documents using weighted prevalence data for statements
Han et al. Text Summarization Using FrameNet‐Based Semantic Graph Model
Roy et al. Discovering and understanding word level user intent in web search queries
Berchialla et al. Information extraction approaches to unconventional data sources for “Injury Surveillance System”: the case of newspapers clippings
US8131546B1 (en) System and method for adaptive sentence boundary disambiguation
CN100589100C (en) Method and system for classifying words and conception in text based on diagram classification
Fauzi et al. Image understanding and the web: a state-of-the-art review
US8195458B2 (en) Open class noun classification
Cedeño Detección automática de plagio en texto
JP2004287781A (en) Importance calculation device
Clough Measuring text reuse and document derivation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: MICROSOFT TECHNOLOGY LICENSING LLC

Free format text: FORMER OWNER: MICROSOFT CORP.

Effective date: 20150506

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150506

Address after: Washington State

Patentee after: Micro soft technique license Co., Ltd

Address before: Washington State

Patentee before: Microsoft Corp.