New! View global litigation for patent families

US20020046018A1 - Discourse parsing and summarization - Google Patents

Discourse parsing and summarization Download PDF

Info

Publication number
US20020046018A1
US20020046018A1 US09854301 US85430101A US2002046018A1 US 20020046018 A1 US20020046018 A1 US 20020046018A1 US 09854301 US09854301 US 09854301 US 85430101 A US85430101 A US 85430101A US 2002046018 A1 US2002046018 A1 US 2002046018A1
Authority
US
Grant status
Application
Patent type
Prior art keywords
tree
discourse
text
input
based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US09854301
Inventor
Daniel Marcu
Kevin Knight
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Southern California (USC)
Original Assignee
University of Southern California (USC)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/271Syntactic parsing, e.g. based on context-free grammar [CFG], unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2705Parsing
    • G06F17/2715Statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/274Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2785Semantic analysis
    • G06F17/279Discourse representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/28Processing or translating of natural language
    • G06F17/2809Data driven translation
    • G06F17/2818Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30716Browsing or visualization
    • G06F17/30719Summarization for human users

Abstract

A discourse structure for an input text segment is determined by generating a set of one or more discourse parsing decision rules based on a training set, and determining a discourse structure for the input text segment by applying the generated set of discourse parsing decision rules to the input text segment. A tree structure is summarized by generating a set of one or more summarization decision rules based on a training set, and compressing the tree structure by applying the generated set of summarization decision rules to the tree structure. Alternatively, summarization is accomplished by parsing an input text segment to generate a parse tree for the input segment, generating a plurality of potential solutions, applying a statistical model to determine a probability of correctness for each of potential solution, and extracting one or more high- probability solutions based on the solutions' respective determined probabilities of correctness.

Description

    RELATED APPLICATION
  • [0001]
    This application claims the benefit of, and incorporates herein, U.S. Provisional Patent Application Ser. No. 60/203,643, filed May 11, 2000.
  • ORIGIN OF INVENTION
  • [0002] The research and development described in this application were supported by the NSA under grant number MDA904-97-0262 and by DARPA/ITO under grant number MDA904-99-C-2535. The US government may have certain rights in the claimed inventions.
  • FIELD OF THE INVENTION
  • [0003]
    The present application relates to computational linguistics and more particularly to techniques for parsing a text to determine its underlying rhetorical, or discourse, structure, and to techniques for summarizing, or compressing, text.
  • BACKGROUND AND SUMMARY
  • [0004]
    Computational linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation (“MT”) and text analysis. In conjunction with MT research and related areas in computational linguistics, researchers have developed and frequently use various types of tree structures to graphically represent the structure of a text segment (e.g., clause, sentence, paragraph or entire treatise). Two basic tree types include (1) the syntactic tree, which can be used to graphically represent the syntactic relations among components of a text segment, and (2) the rhetorical tree (equivalently, the rhetorical structure tree (RST) or the discourse tree), which can be used to graph the rhetorical relationships among components of a text segment. Rhetorical structure trees are discussed in detail in William C. Mann and Sandra A. Thompson, “Rhetorical structure theory: Toward a functional theory of text organization,” Text, 8(3):243-281 (1988) (hereinafter, “Mann and Thompson (1988)”). Discourse tree structures find application in many areas including machine translation, summarization, information retrieval, automatic test scoring and the like.
  • [0005]
    The example in FIG. 1 shows the types of structures in a discourse tree 100 for a text fragment. The leaves 102 of the tree correspond to elementary discourse units (“edus”) and the internal nodes correspond to contiguous text spans. Each node in a discourse tree is characterized by a “status” (i.e., either “nucleus” or “satellite”) and a “rhetorical relation,” which is a relation that holds between two non-overlapping text spans. In FIG. 1, nuclei 104 are represented by straight lines while satellites 106 are represented by arcs.
  • [0006]
    The distinction between nuclei and satellites comes from empirical observations that a nucleus expresses information that is more essential than a satellite to the writer's intention, and that the nucleus of a rhetorical relation is comprehensible independent of the satellite but not vice versa. When spans are equally important, the relation is said to be “multinuclear.”
  • [0007]
    Rhetorical relations reflect semantic, intentional and/or textual relations that hold between text spans. Examples of rhetorical relations include the following types indicated in capitals: one text span may ELABORATE on another text span; the information in two text spans may be in CONTRAST; and the information in one text span may provide JUSTIFICATION for the information presented in another text span. Other types of rhetorical relations include EVIDENCE, BACKGROUND, JOINT, and CAUSE. In FIG. 1, the internal nodes of discourse tree 100 are labeled with their respective rhetorical relation names 108.
  • [0008]
    In conventional practice, discourse trees either have been generated by hand by trained personnel or have been pieced together in a semi-automated manner using manually generated instructions for a computer program. Development of the discourse parsing systems and techniques described below was based in part on the recognition that manually generating discourse trees in either of these fashions is time-consuming, expensive and prone to inconsistencies and error. Accordingly, a computer-implemented discourse parsing system and automated discourse parsing techniques were developed for automatically generating a discourse tree for any previously unseen text segment based on a set of automatically learned decision rules.
  • [0009]
    Implementations of the disclosed discourse parsing system and techniques may include various combinations of the following features.
  • [0010]
    In one aspect, a discourse structure for an input text segment (e.g., a clause, a sentence, a paragraph or a treatise) is determined by generating a set of one or more discourse parsing decision rules based on a training set, and determining a discourse structure for the input text segment by applying the generated set of discourse parsing decision rules to the input text segment.
  • [0011]
    The training set may include a plurality of annotated text segments (e.g., built manually by human annotators) and a plurality of elementary discourse units (edus). Each annotated text segment may be associated with a set of edus that collectively represent the annotated text segment.
  • [0012]
    Generating the set of discourse parsing decision rules may include iteratively performing one or more operations (e.g., a shift operation and one or more different types of reduce operations) on a set of edus to incrementally build the annotated text segment associated with the set of edus. The different types of reduce operations may include one or more of the following six operations: reduce-ns, reduce-sn, reduce-nn, reduce-below-ns, reduce-below-sn, reduce-below-nn. The six reduce operations and the shift operation may be sufficient to derive the discourse tree of any input text segment.
  • [0013]
    Determining a discourse structure may include incrementally building a discourse tree for the input text segment, for example, by selectively combining elementary discourse trees (edts) into larger discourse tree units. Moreover, incrementally building a discourse tree for the input text segment may include performing operations on a stack and an input list of edts, one edt for each edu in a set of edus corresponding to the input text segment.
  • [0014]
    Prior to determining the discourse structure for the input text segment, the input text segment may be segmented into edus, which are inserted into the input list. Segmenting the input text segment into edus may be performed by applying a set of automatically learned discourse segmenting decision rules to the input text segment. Generating the set of discourse segmenting decision rules may be accomplished by analyzing a training set.
  • [0015]
    Determining the discourse structure for the input text segment may further include segmenting the input text segment into elementary discourse units (edus); incrementally building a discourse tree for the input text segment by performing operations on the edus to selectively combine the edus into larger discourse tree units; and repeating the incremental building of the discourse tree until all of the edus have been combined.
  • [0016]
    In another aspect, text parsing may include generating a set of one or more discourse segmenting decision rules based on a training set, and determining boundaries in an input text segment by applying the generated set of discourse segmenting decision rules to the input text segment. Determining boundaries may include examining each lexeme in the input text segment in order, and, for example, assigning, for each lexeme, one of the following designations: sentence-break, edu-break, start-parenthetical, end-parenthetical, and none. More generally, determining boundaries in the input text segment may include recognizing sentence boundaries, edu boundaries, parenthetical starts, and parenthetical ends. Examining each lexeme in the input text segment may include associating features with the lexeme based on surrounding context.
  • [0017]
    In another aspect, generating discourse trees may include segmenting an input text segment into edus, and incrementally building a discourse tree for the input text segment by performing operations on the edus to selectively combine the edus into larger discourse tree units. The incremental building of the discourse tree may be repeated until all of the edus have been combined into a single discourse tree. Moreover, the incremental building of the discourse tree is based on predetermined decision rules, such as automatically learned decision rules generated by analyzing a training set of annotated discourse trees.
  • [0018]
    In another aspect, a discourse parsing system may include a plurality of automatically learned decision rules; an input list comprising a plurality of edts, each edt corresponding to an edu of an input text segment; a stack for holding discourse tree segments while a discourse tree for the input text segment is being built; and a plurality of operators for incrementally building the discourse tree for the input text segment by selectively combining the EDTs into a discourse tree segment according to the plurality of decision rules and moving the discourse tree segment onto the stack. The system may further include a discourse segmenter for partitioning the input text segment into edus and inserting the edus into the input list.
  • [0019]
    One or more of the following advantages may be provided by discourse parsing systems and techniques as described herein. The systems and techniques described here result in a discourse parsing system that uses a set of learned decision rules to automatically determine the underlying discourse structure of any unrestricted text. As a result, the discourse parsing system can be used, among other ways, for constructing discourse trees whose leaves are sentences (or units that can be identified at high levels of performance). Moreover, the time, expense, and inconsistencies associated with manually built discourse tree derivation rules are reduced dramatically.
  • [0020]
    The ability to automatically derive discourse trees is useful not only in its standalone form (e.g., as a tool for linguistic researchers) but also as a component of a larger system, such as a discourse-based machine translation system. Accordingly, the systems and techniques described herein represent an enabling technology for many different applications including text, paragraph or sentence summarization, machine translation, informational retrieval, test scoring and related applications.
  • [0021]
    The rhetorical parsing algorithm described herein implements robust lexical, syntactic and semantic knowledge sources. Moreover, the six reduce operations used by the parsing algorithm, along with the shift operation, are mathematically sufficient to derive the discourse structure of any input text.
  • [0022]
    Text summarization (also referred to as text compression) is the process of a taking a longer unit of text (e.g., a long sentence, a paragraph, or an entire treatise) and converting it into a shorter unit of text (e.g., a short sentence or an abstract) referred to as a summary. Automated summarization—that is, using a computer or other automated process to produce a summary—has many applications, for example, in information retrieval, abstracting, automatic test scoring, headline generation, television captioning, and audio scanning services for the blind. FIG. 10 shows a block diagram of an automated summarization process. As shown therein, an input text 1000 is provided to a summarizer 1002, which generates a summary 1004 of the input text 1000. Ideally, whether produced manually or automatically, a summary will capture the most salient aspects of the longer text and present them in a coherent fashion. For example, when humans produce summaries of documents, they do not simply extract sentences, clause or keywords, and then concatenate them to form a summary. Rather, humans attempt to summarize by rewriting the longer text, for example, by constructing new sentences that are grammatical, that cohere with one another, and that capture the most salient items of information in the original document.
  • [0023]
    Conventional attempts at automated summarization, in contrast, typically have focused on identifying relevant items of information in the text being summarized, extracting text segments (e.g., sentences, clauses or keywords) corresponding to those identified items, and then concatenating together the extracted segments. Moreover, these conventional approaches typically rely on manually generated sets of summarization rules.
  • [0024]
    Development of the summarizing systems and techniques described below was based in part on the recognition (1) that identification, extraction and concatenation of relevant text segments typically will not generate a coherent and/or grammatical summary and/or (2) that manually generated summarization rules are prone to error and inconsistencies, are time-consuming and expensive to generate, and generally result in non-ideal summaries. Accordingly, as described in detail below, automated summarization systems and techniques were developed that can generate a coherent summary of an input text by generating new, grammatical sentences that capture the salient aspects of the input text.
  • [0025]
    Implementations of the disclosed summarization systems and techniques may include various combinations of the following features.
  • [0026]
    In one aspect, a tree structure (e.g., a discourse tree or a syntactic tree) is summarized by generating a set of one or more summarization decision rules (e.g., automatically learned decision rules) based on a training set, and compressing the tree structure by applying the generated set of summarization decision rules to the tree structure. The tree structure to be compressed may be generated by parsing an input text segment such as a clause, a sentence, a paragraph, or a treatise. The compressed tree structure may be converted into a summarized text segment that is grammatical and coherent. Moreover, the summarized text segment may include sentences not present in a text segment from which the pre-compressed tree structure was generated.
  • [0027]
    Applying the generated set of summarization decision rules comprises performing a sequence of modification operations on the tree structure, for example, one or more of a shift operation, a reduce operation, and a drop operation. The reduce operation may combine a plurality of trees into a larger tree, and the drop operation may delete constituents from the tree structure.
  • [0028]
    The training set used to generate the decision rules may include pre-generated long/short tree pairs. Generating the set of summarization decision rules comprises iteratively performing one or more tree modification operations on a long tree until the paired short tree is realized. A plurality of long/short tree pairs may be processed to generate a plurality of learning cases. In that case, generating the set of decision rules may include applying a learning algorithm to the plurality of learning cases. Moreover, one or more features may be associated with each of the learning cases to reflect context.
  • [0029]
    In another aspect, a computer-implemented summarization method may include generating a parse tree (e.g., a discourse tree or a syntactic tree) for an input text segment, and iteratively reducing the generated parse tree by selectively eliminating portions of the parse tree. Iterative reduction of the parse tree may be performed based on a plurality of learned decision rules, and may include performing tree modification operations on the parse tree. The tree modification operations may include one or more of the following: a shift operation, a reduce operation (which, for example, combines a plurality of trees into a larger tree), and a drop operation (which, for example, deletes constituents from the tree structure).
  • [0030]
    In another aspect, summarization is accomplished by parsing an input text segment to generate a parse tree (e.g., a discourse tree or a syntactic tree) for the input segment, generating a plurality of potential solutions, applying a statistical model to determine a probability of correctness for each of potential solution, and extracting one or more high-probability solutions based on the solutions' respective determined probability of correctness. Applying a statistical model may include using a stochastic channel model algorithm that, for example, performs minimal operations on a small tree to create a larger tree. Moreover, using a stochastic channel model algorithm may include probabilistically choosing an expansion template. Generating a plurality of potential solutions may include identifying a forest of potential compressions for the parse tree.
  • [0031]
    The generated parse tree may have one or more nodes, each node having N children (wherein N is an integer). In that case, identifying a forest of potential compressions may include generating 2N—1 new nodes, one node for each non-empty subset of the children, and packing the newly generated nodes into a whole. Alternatively, or in addition, identifying a forest of potential compressions may include assigning an expansion-template probability to each node in the forest.
  • [0032]
    Extracting one or more high-probability solutions may include selecting one or more trees based on a combination of each tree's word-bigram and expansion-template score. For example, a list of trees may be selected, one for each possible compression length. The potentials solutions may be normalized for compression length. For example, for each potential solution, a log-probability of correctness for the solution may be divided by a length of compression for the solution.
  • [0033]
    One or more of the following advantages may be provided by summarization systems and techniques as described herein.
  • [0034]
    The systems and techniques described here result in a summarization system that can take virtually any longer text segment (sentence, phrase, paragraph or treatise) and compress it into a shorter version that is both grammatical and coherent. In contrast to the conventional “extract and concatenate” summarization techniques, the disclosed summarizer generates new grammatical sentences that more closely resemble summarizations prepared by trained human editors.
  • [0035]
    Moreover, the disclosed summarizer generates summaries automatically, e.g., in a computer-implemented manner. Accordingly, the inconsistencies, errors, time and/or expense typically incurred with conventional approaches that require manual intervention are reduced dramatically.
  • [0036]
    The two different embodiments of the summarizer (channel-based and decision-based) both generate coherent, grammatical results but also potentially provide different advantages. On the one hand, the channel-based summarizer provides multiple different solutions at varying levels of compression. These multiple solutions may be desirable if, for example, the output of the summarizer was being provided to a user (e.g., human or computer process) that could make use of multiple outputs. On the other hand, the decision-based summarizer is deterministic and thus provides a single solution and does so very quickly. Accordingly, depending on the objectives of the user, the decision-based summarizer may be advantageous both for its speed and for its deterministic approach.
  • [0037]
    Moreover, the channel-based summarizer may be advantageous depending on a user's objectives because its performance can be adjusted, or fine-tuned, to a particular application by replacing or adjusting its statistical model. Similarly, performance of the decision-based summarizer can be fine-tuned to a particular application by varying the training corpus used to learn decision rules. For example, a decision-based summarizer could be tailored to summarize text or trees in a specific discipline by selecting a training corpus specific to that discipline.
  • [0038]
    The details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
  • DRAWING DESCRIPTIONS
  • [0039]
    The above and other aspects will now be described in detail with reference to the accompanying drawings, wherein:
  • [0040]
    [0040]FIG. 1 shows and example of a discourse tree.
  • [0041]
    [0041]FIG. 2 is a flowchart of generating a discourse tree for an input text.
  • [0042]
    [0042]FIG. 3 is a block diagram of a discourse tree generating system.
  • [0043]
    [0043]FIG. 4 shows an example of shift-reduce operations performed in discourse parsing a text.
  • [0044]
    [0044]FIG. 5 shows the operational semantics of six reduce operations.
  • [0045]
    [0045]FIG. 6 is a flowchart of generating decision rules for a discourse segmenter.
  • [0046]
    [0046]FIG. 6A shows examples of automatically derived segmenting rules.
  • [0047]
    [0047]FIG. 7 is a graph of a learning curve for a discourse segmenter.
  • [0048]
    [0048]FIG. 8 is a flowchart of generating decision rules for a discourse segmenter.
  • [0049]
    [0049]FIG. 8A shows examples of automatically derived shift-reduce rules.
  • [0050]
    [0050]FIG. 8B shows a result of applying Rule 1 in FIG. 8A on the edts that correspond to the units in text example (5.1).
  • [0051]
    [0051]FIG. 8C shows a result of applying Rule 2 in FIG. 8A on the edts that correspond to the units in text example (5.2).
  • [0052]
    [0052]FIG. 8D shows an example of a CONTRAST relation that holds between two paragraphs.
  • [0053]
    [0053]FIG. 8E shows a result of applying Rule 4 in FIG. 8A on the on the trees that subsume the two paragraphs in FIG. 8D.
  • [0054]
    [0054]FIG. 9 is a graph of a learning curve for a shift-reduce action identifier.
  • [0055]
    [0055]FIG. 10 is a block diagram of an automated summarization system.
  • [0056]
    [0056]FIG. 11 shows examples of parse (or syntactic) trees.
  • [0057]
    [0057]FIG. 12 shows examples of text from a training corpus.
  • [0058]
    [0058]FIG. 13 is a graph of adjusted log-probabilities for top scoring compressions at various compression lengths.
  • [0059]
    [0059]FIG. 14 shows an example of incremental tree compression.
  • [0060]
    [0060]FIG. 15 shows examples of text compression.
  • [0061]
    [0061]FIG. 16 shows examples of summarizations of varying compression lengths.
  • [0062]
    [0062]FIG. 17 is a flowchart of a channel-based summarization process.
  • [0063]
    [0063]FIG. 18 is a flowchart of a process for training a channel-based summarizer.
  • [0064]
    [0064]FIG. 18A shows examples of rules that were learned automatically by the C4.5 program FIG. 19 is a flowchart of a decision-based summarization process.
  • [0065]
    [0065]FIG. 20 is a flowchart of a process for training a decision-based summarizer.
  • DETAILED DESCRIPTION
  • [0066]
    Discourse Parsing
  • [0067]
    As described herein, a decision-based rhetorical parsing system (equivalently, a discourse parsing system) automatically derives the discourse structure of unrestricted texts and incrementally builds corresponding discourse trees based on a set of learned decision rules. The discourse parsing system uses a shift-reduce rhetorical parsing algorithm that learns to construct rhetorical structures of texts from a corpus of discourse-parse action sequences. The rhetorical parsing algorithm implements robust lexical, syntactic and semantic knowledge sources.
  • [0068]
    In one embodiment, the resulting output of the discourse parsing system is a rhetorical tree. This functionality is useful both in its standalone form (e.g., as a tool for linguistic researchers) and as a component of a larger system, such as in a discourse-based machine translation system, as described in Daniel Marcu et al., “The Automatic Translation of Discourse Structures,” Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 9-17, Seattle, Washington, (April 29 -May 3, 2000), and Daniel Marcu, “The Theory and Practice of Discourse Parsing and Summarization,” The MIT Press (2000), both of which are incorporated herein.
  • [0069]
    [0069]FIG. 2 shows a flowchart of a discourse parsing process 200 that generates a discourse tree from an input text. Upon receiving the input text in step 202, the process 200 breaks the text into elementary discourse units, or “edus.” Edus are defined functionally as clauses or clause-like units that are unequivocally the nucleus or satellite of a rhetorical relation that holds between two adjacent spans of text. Further details of edus are discussed below.
  • [0070]
    Next, in step 206, the edus are put into an input list. In step 208, the process 200 uses the input list, a stack, and a set of learned decision rules to perform the shift-reduce rhetorical parsing algorithm, which eventually yields the discourse structure of the text given as input. In step 210, performing the algorithm results in the generation of a discourse tree that corresponds to the input text.
  • [0071]
    [0071]FIG. 3 shows a block diagram of a discourse tree generating system 300 that takes in input text 301 and produces discourse tree 305. The system 300 as shown includes two sub-systems: (1) a discourse segmenter 302 that identifies the edus in a text, and (2) a discourse parser 304 (equivalently, a shift-reduce action identifier), which determines how the edus should be assembled into rhetorical structure trees.
  • [0072]
    The discourse segmenter 302, which serves as a front-end to the discourse parser 304, partitions the input text into edus. The discourse segmenter processes an input text one lexeme (word or punctuation mark) at a time and recognizes sentence and edu boundaries and beginnings and ends of parenthetical units.
  • [0073]
    The discourse parser 304 takes in the edus from the segmenter 302 and applies the shift-reduce algorithm to incrementally build the discourse tree 305. As indicated in FIG. 3, in this embodiment, each of the discourse segmenter 302 and the discourse parser 304 performs its operations based on a set of decision rules that were learned from analyzing a training set, as discussed in detail below. An alternative embodiment is possible, however, in which substantially the same results could be achieved using probabilistic rules.
  • [0074]
    Further details of the discourse parser and the parsing process that it performs are provided with reference to FIGS. 4 and 5. Details on generating discourse segmenter decision rules and discourse parser decision rules appear below with reference to FIGS. 6-9. What follows is a description of the training corpus that was used in generating decision rules for the discourse segmenter and for the discourse parser.
  • [0075]
    The Training Corpus
  • [0076]
    The training corpus (equivalently, the training set) used was a body of manually built (i.e., by humans) rhetorical structure trees. This corpus, which included 90 texts that were manually annotated with discourse trees, was used to generate learning cases of how texts should be partitioned into edus and how discourse units and segments should be assembled into discourse trees.
  • [0077]
    A corpus of 90 rhetorical structure trees were used, which were built manually using rhetorical relations that were defined informally in the style of Mann et al., “Rhetorical structure theory: Toward a functional theory of text organization, Text, 8(3):243-281 (1988): 30 trees were built for short personal news stories from the MUC7 co-reference corpus (Hirschman et al., MUC-7 Coreference Task Definition, 1997); 30 trees for scientific texts from the Brown corpus; and 30 trees for editorials from the Wall Street Journal (WSJ). The average number of words for each text was 405 in the MUC corpus, 2029 in the Brown corpus and 878 in the WSJ corpus. Each MUC text was tagged by three annotators; each Brown and WSJ text was tagged by two annotators.
  • [0078]
    The rhetorical structure assigned to each text is a (possibly non-binary) tree whose leaves correspond to elementary discourse units (edu)s, and whose internal nodes correspond to contiguous text spans. Each internal node is characterized by a rhetorical relation, such as ELABORATION and CONTRAST. Each relation holds between two non-overlapping text spans called NUCLEUS and SATELLITE. (There are a few exceptions to this rule: some relations, such as SEQUENCE and CONTRAST, are multinuclear.) As noted above, the distinction between nuclei and satellites comes from the empirical observation that the nucleus expresses what is more essential to the writer's purpose than the satellite. Each node in the tree is also characterized by a promotion set that denotes the units that are important in the corresponding subtree. The promotion sets of leaf nodes are the leaves themselves. The promotion sets of internal nodes are given by the union of the promotion sets of the immediate nuclei nodes.
  • [0079]
    As noted above, edus are defined functionally as clauses or clause-like units that are unequivocally the NUCLEUS or SATELLITE of a rhetorical relation that holds between two adjacent spans of text. For example, “because of the low atmospheric pressure” in the text (1), below, is not a fully fleshed clause. However, since it is the SATELLITE of an EXPLANATION relation, it is treated as elementary.
  • [0080]
    (1) [Only the midday sun at tropical latitudes is warm enough] [to thaw ice on occasion,] [but any liquid water formed in this way would evaporate almost instantly] [because of the low atmospheric pressure.]
  • [0081]
    Some edus may contain parenthetical units, i.e., embedded units whose deletion does not affect the understanding of the edu to which they belong. For example, the unit shown in italics in text (2), below, is parenthetic.
  • [0082]
    (2) This book, which I have received from John, is the best book that I have read in a while.
  • [0083]
    The annotation process involved assigning edu and parenthetical unit boundaries, assembling edus and spans into discourse trees, and labeling the relations between edus and spans with rhetorical relation names from a taxonomy of 71 relations. No explicit distinction was made between intentional, informational, and textual relations. In addition, two constituency relations were marked that were ubiquitous in the corpus and that often subsumed complex rhetorical constituents. These relations were ATTRIBUTION, which was used to label the relation between a reporting and a reported clause, and APPOSITION. The rhetorical tagging tool used—namely, the RST Annotation Tool downloadable from, and described at:
  • [0084]
    http://www.isi.edu/˜marcu/software.html
  • [0085]
    maintains logs of all tree-construction operations. As a result, in addition to the rhetorical structure of 90 texts, a corpus of logs was created that reflects the way that human judges determine edu and parenthetical unit boundaries. The following two publications—Daniel Marcu, Estibaliz Amorrortu, and Magdalena Romera, “Experiments in Constructing a Corpus of Discourse Trees,” The ACL'99 Workshop on Standards and Tools for Discourse Tagging, Maryland, June 1999; and Daniel Marcu, Magdalena Romera, and Estibaliz Amorrortu, “Experiments in Constructing a Corpus of Discourse Trees: Problems, Annotation Choices, Issues,” The Workshop on Levels of Representation in Discourse, pages 71-78, Edinburgh, Scotland, July 1999—both of which are incorporated by reference, discuss in detail the annotation tool and protocol and assess the inter-judge agreement and the reliability of the annotation.
  • [0086]
    The Discourse Parsing Model
  • [0087]
    The discourse parsing process is modeled as a sequence of shift-reduce operations. The input to the parser is an empty stack and an input list that contains a sequence of elementary discourse trees (“edts”), one edt for each edu produced by the discourse segmenter. The status and rhetorical relation associated with each edt is “UNDEFINED”, and the promotion set is given by the corresponding edu. At each step, the parser applies a “Shift” or a “Reduce” operation. Shift operations transfer the first edt of the input list to the top of the stack. Reduce operations pop the two discourse trees located on the top of the stack; combine them into a new tree updating the statuses, rhetorical relation names, and promotion sets associated with the trees involved in the operation; and push the new tree on the top of the stack.
  • [0088]
    Assume, for example, that the discourse segmenter partitions a text given as input as shown in text (3) below (only the edus numbered from 12 to 19 are shown):
  • [0089]
    (3) . . . [Close parallels between tests and practice tests are common,12] [some educators and researchers say.l3] [Test preparation booklets, software and worksheets are a booming publishing subindustry.14] [But some practice products are so similar to the tests themselves that critics say they represent a form of school-sponsored cheating.15]
  • [0090]
    [“If they took these preparation booklets into my classroom, ] [I'd have a hard time justifying to my students and parents that it wasn't cheating,”17] [says John Kaminsky,18] [a Traverse City, Mich., teacher who has studied test coaching.19.]
  • [0091]
    [0091]FIG. 4 shows the actions taken by a shift-reduce discourse parser starting with step i. At step i, the stack contains 4 partial discourse trees, which span units [1,11], [12,15], [16,17], and [18], and the input list contains the edts that correspond to units whose numbers are higher than or equal to 19. At step i the parser decides, based on its predetermined decision rules, to perform a Shift operation. As a result, the edt corresponding to unit 19 becomes the top of the stack. At step i+1, the parser performs a “Reduce-Apposition-NS” operation, that combines edts 18 and 19 into a discourse tree whose nucleus is unit 18 and whose satellite is unit 19. The rhetorical relation that holds between units 18 and 19 is APPOSITION. At step i+2, the trees that span over units [16,17] and [18,19] are combined into a larger tree, using a “Reduce-Attribution-NS” operation. As a result, the status of the tree [16,17] becomes “nucleus” and the status of the tree [18,19] becomes “satellite.” The rhetorical relation between the two trees is SMALL ATTRIBUTION. At step i+3, the trees at the top of the stack are combined using a “Reduce-Elaboration-NS” operation. The effect of the operation is shown at the bottom of FIG. 4.
  • [0092]
    In order to enable a shift-reduce discourse parser to be able to derive any discourse tree, it is sufficient to implement one Shift operation and six types of Reduce operations, whose operational semantics are shown in FIG. 5. In other words, the shift operation and the six reduce operations shown in FIG. 5 are mathematically sufficient to derive the discourse tree of any unrestricted input text.
  • [0093]
    For each possible pair of nuclearity assignments “nucleus-satellite” (ns), “satellite-nucleus” (sn), and “nucleus-nucleus” (nn) there are two possible ways to attach the tree located at position top in the stack to the tree located at position top-1. To create a binary tree whose immediate children are the trees at top and top-1, an operation of type “reduce-ns”, “reduce-sn”, or “reduce-nn” is used. To attach the tree at position top as an extra-child of the tree at top-1, thus creating or modifying a non-binary tree, an operation of type “reduce-below-ns”, “reduce-below-sn”, or “reduce-below-nn” is used. FIG. 5 illustrates how the statuses and promotion sets associated with the trees involved in the reduce operations are affected in each case.
  • [0094]
    Because the labeled data in the training corpus used was relatively sparse, the relations that shared some rhetorical meaning were grouped into clusters of rhetorical similarity. For example, the cluster named “contrast” contained the contrast-like rhetorical relations of ANTITHESIS, CONTRAST, and CONCESSION. The cluster named “evaluation-interpretation” contained the rhetorical relations EVALUATION and INTERPRETATION. And the cluster named “other” contained rhetorical relations such as question-answer, proportion, restatement, and comparison, which were used very seldom in the corpus. The grouping process yielded 17 clusters, each characterized by a generalized rhetorical relation name. These names are as follows: APPOSITION-PARENTHETICAL, ATTRIBUTION, CONTRAST, BACKGROUND-CIRCUMSTANCE, CAUSE-REASON-EXPLANATION, CONDITION, ELABORATION, EVALUATION-INTERPRETATION, EVIDENCE, EXAMPLE, MANNER-MEANS, ALTERNATIVE, PURPOSE, TEMPORAL, LIST, TEXTUAL, and OTHER.
  • [0095]
    If a sufficiently large number of texts were labeled manually, however, the clustering described above would be unnecessary.
  • [0096]
    In developing the discourse parser, one design parameter was to automatically derive rhetorical structures trees that were labeled with relation names that corresponded to the 17 clusters of rhetorical similarity. Since there are 6 types of reduce operations and since each discourse tree uses relation names that correspond to the 17 clusters of rhetorical similarity, it follows that the discourse parser needs to learn what operation to choose from a set of 6×17+1=103 operations (the 1 corresponds to the SHIFT operation).
  • [0097]
    The Discourse Segmenter
  • [0098]
    [0098]FIG. 6 is a flowchart of a generalized process 600 for generating decision rules for the discourse segmenter. The first step in the process was to build, or otherwise obtain, the training corpus. As discussed above, this corpus was built manually using an annotation tool. In general, human annotators looked at text segments and for each lexeme (word or punctuation mark) determined whether an edu boundary existed at the lexeme under consideration and either marked it with a segment break or not, depending on whether an edu boundary existed.
  • [0099]
    Next, in step 604, for each lexeme, a set of one or more features was associated to each of the edu boundary decisions, based on the context in which these decisions were made. The result of such association is a set of learning cases—essentially, discrete instances that capture the edu-boundary decision-making process for a particular lexeme in a particular context. More specifically, the leaves of the discourse trees that were built manually were used in order to derive the learning cases. To each lexeme in a text, one learning case was associated using the features described below. The classes to be learned, which are associated with each lexeme, are “sentence-break”, “edu-break”, “start-paren”, and “end-paren”, and “none”. Further details of the features used in step 604 for learning follow.
  • [0100]
    To partition a text into edus and to detect parenthetical unit boundaries, features were relied on that model both the local and global contexts. The local context consists of a window of size 5 (1+2+2) that enumerates the Part-Of-Speech (POS) tags of the lexeme under scrutiny and the two lexemes found immediately before (2) and after it (2). The POS tags are determined automatically, using the “Brill Tagger,” as described in Eric Brill, “Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging,” Computational Linguistics, 21(4):543-565, which is incorporated by reference. Because discourse markers, such as “because” and “and”, typically play a major role in rhetorical parsing, also considered was a list of features that specify whether a lexeme found within the local contextual window is a potential discourse marker; hence, for each lexeme under scrutiny, it is specified whether it is a special orthographical marker, such as comma, dash, and parenthesis, or whether it is a potential discourse marker, such as “accordingly,” “afterwards,” and “and.” The local context also contains features that estimate whether the lexemes within the window are potential abbreviations. In this regard, a hard-coded list of 250 potential abbreviations can be used.
  • [0101]
    The global context reflects features that pertain to the boundary identification process. These features specify whether there are any commas, closed parentheses, and dashes before the estimated end of the sentence, whether there are any verbs in the unit under consideration, and whether any discourse marker that introduces expectations was used in the sentence under consideration. These markers include phrases such as Although and With.
  • [0102]
    The decision-based segmenter uses a total of twenty-five features, some of which can take as many as 400 values. When we represent these features in a binary format, we obtain learning examples with 2417 binary features/example.
  • [0103]
    In step 606, a learning algorithm such as the C4.5 algorithm as described in J. Ross Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers (1993), to learn a set of decision rules from the learning cases. The result is set of discourse segmenter decision rules 608 that collectively define whether a previously unseen lexeme, given its particular context, represents an edu boundary within its particular context in the text segment under consideration.
  • [0104]
    [0104]FIG. 6A shows some of the rules that were learned by the C4.5 program using a binary representation of the features and learning cases extracted from the MUC corpus. Rule 1 specifies that if the POS tag of the lexeme that immediately precedes the lexeme under scrutiny is a closed parenthesis and the previous marker recognized during the processing or the current sentence was an open parenthesis, then the action to be taken is to insert an end of parenthetic unit. Rule 1 can correctly identify the end of the parenthetic unit at the location marked with the symbol T in sentence (4.1) below.
  • [0105]
    (4.1) Surface temperatures typically average about −60 degrees Celsius (−76 degrees Fahrenheit) ↑ at the equator.
  • [0106]
    Rule 2 can correctly identify the beginning of the parenthetic unit 44 years old in sentence 4.2 because the unit is preceded by a comma and starts with a numeral (CD) followed by a plural noun (NNS).
  • [0107]
    (4.2) Ms. Washington, 44 years old, would be the first woman and the first black to head the five-member commission that oversees the securities markets.
  • [0108]
    Rule 3 identifies the end of a sentence after the occurrence of a DOT (period, questions mark, or exclamation mark) that is not preceded or followed by another DOT and that is not followed by a DOUBLEQUOTE. This rule will correctly identify the sentence end after the period in example 4.3, but will not insert a sentence end after the period in example 4.4. However, another rule that is derived automatically will insert a sentence break after the double quote that follows the T mark in example 4.4.
  • [0109]
    (4.3) The meeting went far beyond Mr. Clinton's normal weekly gathering of business leaders. ↑ Economic advisor Gene Sperling described it as “a true full-court press” to pass the deficit-reduction bill, the final version of which is now being hammered out by House and Senate Negotiators.
  • [0110]
    (4.4) The executives “are here, just as I am, not because anyone agrees with every last line and jot and title of this economic program,” Mr. Clinton acknowledged, but “because it does far more good than harm.↑” Despite resistance from some lawmakers I his own party, the president predicted the bill would pass.
  • [0111]
    Rule 4 identifies an edu boundary before the occurrence of an “and” followed by a verb in the past tense (VPT). This rule will correctly identify the marked edu boundary in sentence 4.5.
  • [0112]
    (4.5) Ashley Boone ran marketing and distribution ↑ and left the company late last year.
  • [0113]
    Rule 5 inserts edu boundaries before the occurrence of the word “until”, provided that “until” is followed not necessarily by a verb. This rule will correctly insert an edu boundary in example 4.6.
  • [0114]
    (4.6) Several appointees of President Bush are likely to stay in office at least temporarily,↑ until permanent successors can be names.
  • [0115]
    Rule 6 is an automatically derived rule that mirrors the manually derived rule specific to COMMA-like actions in the surface-based unit identification algorithm. Rule 6 will correctly insert an edu boundary after the comma marked in example 4.7, because the marker “While” was used at the beginning of the sentence.
  • [0116]
    (4.7) While the company hasn't commented on the probe,↑ persons close to the board said that Messrs. Lavin and Young, along with some other top Woolworth executives were under investigation by the special committee for their possible involvement in the alleged irregularities.
  • [0117]
    Rule 7 specifies that no elementary or parenthetical unit boundary should be inserted immediately before a DOT.
  • [0118]
    As one can notice, the rules in FIG. 6a are more complex than typical manually derived rules. The automatically derived rules make use not only of orthographic and cue-phrase-specific information, but also of syntactic information, which is encoded as part of speech tags.
  • [0119]
    In step 606, the C4.5 program was used in order to learn decision trees and rules that classify lexemes as boundaries of sentences, edus, or parenthetical units, or as non-boundaries. Learning was accomplished both from binary representations (when possible) and non-binary representations of the cases. (Learning from binary representations of features in the Brown corpus was too computationally expensive to terminate—the Brown data file had about 0.5 Giga-bytes.) In general the binary representations yielded slightly better results than the non-binary representations and the tree classifiers were slightly better than the rule-based ones.
  • [0120]
    Table 1 shows accuracy results of non-binary, decision-tree classifiers. The accuracy figures were computed using a ten-fold cross-validation procedure. In Table 1, B1 corresponds to a majority-based baseline classifier that assigns the class “none” to all lexemes, and B2 to a baseline classifier that assigns a sentence boundary to every “DOT” (that is, a period (.), question mark (?), and/or exclamation point (!)) lexeme and a non-boundary to all other lexemes.
    TABLE 1
    Performance of a discourse segmenter that uses
    a decision-tree, non-binary classifier.
    Corpus # cases B1 (%) B2 (%) Acc (%)
    MUC 14362 91.28 93.1 96.24 ± 0.06
    WSJ 31309 92.39 94.6 97.14 ± 0.10
    Brown 72092 93.84 69.8 97.87 ± 0.04
  • [0121]
    [0121]FIG. 7 shows the learning curve that corresponds to the MUC corpus. It suggests that more data can increase the accuracy of the classifier.
  • [0122]
    The confusion matrix shown in Table 2 corresponds to a non-binary-based tree classifier that was trained on cases derived from 27 Brown texts and that was tested on cases derived from 3 different Brown texts, which were selected randomly. The matrix shows that the segmenter encountered some difficulty with identifying the beginning of parenthetical units and the intra-sentential edu boundaries; for example, it correctly identifies 133 of the 220 edu boundaries. The performance is high with respect to recognizing sentence boundaries and ends of parenthetical units. The performance with respect to identifying sentence boundaries appears to be close to that of systems aimed at identifying “only” sentence boundaries, such as described in David D. Palmer and Marti A. Hearst, “Adaptive multilingual sentence boundary disambiguation,” Computational Linguistics, 23(2):241-269 (1997) (hereinafter, “Hearst (1997)”), whose accuracy is in the range of 99%.
    TABLE 2
    Confusion matrix for the decision-tree,
    non-binary classifier (the Brown corpus).
    Action (a) (b) (c) (d) (e)
    sentence-break (a) 272 4
    edu-break (b) 133 3 84
    start-paren (c) 4 26
    end-paren (d) 20 6
    none (e) 2 38 1 4 7555
  • [0123]
    Training the Discourse Parser
  • [0124]
    [0124]FIG. 8 shows a generalized flowchart for a process 800 for generating decision rules for the discourse parser. Put another way, the process 800 can be used to train the discourse parser about when and under what circumstances, and in what sequence, it should perform the various shift-reduce operations.
  • [0125]
    In step 802, the process receives as input the training corpus of discourse trees and, for each discourse tree, a set of edus from the discourse segmenter. Next, in step 804, for each discourse tree/edu set, the process 800 determines a sequence of shift-reduce operations that reconstructs the discourse tree from the edus in that tree's corresponding set. Next, in step 806, the process 800 associates features with each entry in each sequence. Finally, in step 808, the process 800 applies a learning algorithm (e.g., C4.5) to generate decision rules 810 for the discourse parser. As noted above, the discourse parser will then be able to use these decision rules 810 to determine the rhetorical structure for any input text and, from it, generate a discourse tree as output.
  • [0126]
    Additional details of training the discourse parser follow.
  • [0127]
    Shift-Reduce Action Identifier: Generation of Learning Examples
  • [0128]
    The learning cases were generated automatically, in the style of Magerman, “Statistical decision-tree models for parsing,” Proceedings of ACL'95, pages 276-283 (1995), by traversing in-order the final rhetorical structures built by annotators and by generating a sequence of discourse parse actions that used only SHIFT and REDUCE operations of the kinds discussed above. When a derived sequence is applied as described above with respect to the parsing model, it produces a rhetorical tree that is a one-to-one copy of the original tree that was used to generate the sequence. For example, the tree at the bottom of FIG. 4—the tree found at the top of the stack at step i+4—can be built if the following sequence of operations is performed: {SHIFT 12; SHIFT 13; REDUCE-ATTRIBUTION-NS; SHIFT 14; REDUCE-JOINT-NN; SHIFT 15; REDUCE-CONTRAST-SN; SHIFT 16; SHIFT 17; REDUCE-CONTRAST-SN; SHIFT 18; SHIFT 19; REDUCE-APPOSITION-NS; REDUCE ATTRIBUTION-NS; REDUCE-ELABORATION-NS.}
  • [0129]
    The Shift-Reduce Action Identifier: Features used for Learning
  • [0130]
    To make decisions with respect to parsing actions, the shift-reduce action identifier focuses on the three topmost trees in the stack and the first edt in the input list. These trees are referred to as the trees “in focus.” The identifier relies on the following classes of features: structural features, lexical (cue-phase-like) features, operational features, and semantic-similarity-based features. Each is described in turn.
  • [0131]
    Structural Features.
  • [0132]
    Structural features include the following:
  • [0133]
    (1) Features that reflect the number of trees in the stack and the number of edts in the input list.
  • [0134]
    (2) Features that describe the structure of the trees in focus in terms of the type of textual units that they subsume (sentences, paragraphs, titles). These may include the number of immediate children of the root nodes, the rhetorical relations that link the immediate children of the root nodes, and the like. The identifier assumes that each sentence break that ends in a period and is followed by two ‘\n’ characters, for example, is a paragraph break; and that a sentence break that does not end in a punctuation mark and is followed by two ‘\n’ characters is a title.
  • [0135]
    Lexical (Cue-Phrase-Like) and Syntactic Features.
  • [0136]
    Lexical features include the following:
  • [0137]
    (1) Features that denote the actual words and POS tags of the first and last two lexemes of the text spans subsumed by the trees in focus.
  • [0138]
    (2) Features that denote whether the first and last units of the trees in focus contain potential discourse markers and the position of these markers in the corresponding textual units (beginning, middle, or end).
  • [0139]
    Operational Features.
  • [0140]
    Operational features includes features that specify what the last five parsing operations performed by the parser were. These features could be generated because, for learning, sequences of shift-reduce operations were used and not discourse trees.
  • [0141]
    Semantic-Similarity-Based Features.
  • [0142]
    Semantic-similarity-based features include the following:
  • [0143]
    (1) Features that denote the semantic similarity between the textual segments subsumed by the trees in focus. This similarity is computed by applying in the style of Hearst (1997) a cosine-based metric on the morphed segments. If two segments S1 and S2 are represented as sequences of (t, t)) pairs, where t is a token and w(t) is its weight, the similarity between the segments can be computed using the formula shown below, where w(t)s1 and w(t)s2 represent the weights of token t in segments S1 and S2 respectively. sim ( S 1 , S 2 ) = t S 1 S 2 w ( t ) S 1 w ( t ) S 2 t S 1 w ( t ) S 1 2 t S 2 w ( t ) S 2 2
    Figure US20020046018A1-20020418-M00001
  • [0144]
    The weights of tokens are given by their frequencies in the segments.
  • [0145]
    (2) Features that denote Wordnet-based measures of similarity between the bags of words in the promotion sets of the trees in focus. Fourteen Wordnet-based measures of similarity were used, one for each Wordnet relation (Fellbaum, Wordnet: An Electronic Lexical Database, The MIT Press, 1998). Each of these similarities is computed using a metric similar to the cosine-based metric. Wordnet-based similarities reflect the degree of synonymy, antonymy, meronymy, hyponymy, and the like between the textual segments subsumed by the trees in focus. The Wordnet-based similarities are computed over the tokens that are found in the promotion units associated with each segment. If the words in the promotion units of two segments S1 and S2 are represented as two sequences W1 and W2, the Wordnet-based similarities between the two segments can be computed using the formula shown in below, where the function σ(w1, w2) returns 1 if there exists a Wordnet relation of type R between the words w1 and w2, and 0 otherwise. sim wordnet Relation ( W 1 , W 2 ) = w 1 W 1 , w 2 W 2 σ wordnet Relation ( w 1 , w 2 ) W 1 × W 2
    Figure US20020046018A1-20020418-M00002
  • [0146]
    The Wordnet-based similarity function takes values in the interval [0,1]: the larger the value, the more similar with respect to a given Wordnet relation the two segments are.
  • [0147]
    In addition to these features that modeled the Wordnet-based similarities of the trees in focus, 14×13/2=91 relative Wordnet-based measures of similarity were used, one of each possible pair of Wordnet-based relations. For each pair of Wordnet-based measures of similarity wr1 and Wr2, each relative measure (feature) takes the value <, =, or >, depending on whether the Wordnet-based similarity wr1 between the bags of words in the promotion sets of the trees in focus is lower, equal, or higher than the Wordnet-based similarity Wr2 between the same bags of words. For example, if both the synonymy- and meronymy-based measures of similarity are 0, the relative similarity between the synonymy and meronymy of the trees in focus will have the value=.
  • [0148]
    A binary representation of these features yields learning examples with 2789 features/example.
  • [0149]
    Examples of Rule Specific to the Action Identifier
  • [0150]
    [0150]FIG. 8A shows some of the rules that were learned by the C4.5 program using a binary representation of the features and learning cases extracted from the MUC corpus. Rule 1, which is similar to a typical rule derived manually, specifies that if the last lexeme in the tree at position top-1 in the stack is a comma and there is a marker “if” that occurs at the beginning of the text that corresponds to the same tree, then the trees at position top-1 and top should be reduced using a REDUCE-CONDITION-SN operation. This operation will make the tree at position top-1 the satellite of the tree at position top. If the edt at position top-1 in the stack subsumes unit 1 in example 5.1 and the edt at position top subsumes unit 2, this reduce action will correctly replace the two edts with a new rhetorical tree, that shown in FIG. 8B.
  • [0151]
    (5.1) [If you refer to someone as a butt-head,1] [ordinarily speaking, no one is going to take that as any specific charge of any improper conduct or insinuation of any character trait.2]
  • [0152]
    Rule 2 makes the tree at the top of the stack the BACKGROUND-CIRCUMSTANCE satellite of the tree at position top-1 when the first word in the text subsumed by the top tree is “when”, which is a while-adverb (WRB), when the second word in the same text is not a gerund or past participle verb (VBG), and when the cosine-based similarity between the text subsumed by the top node in the stack and the first unit in the list of elementary discourse units that have not been shifted to the stack is greater than 0.0793052. If the edt as position top-1 in the stack subsumes unit 1 in example 5.2 and the edt at position top subsumes unit 2, rule 2 will correctly replace the two edts with the rhetorical tree shown in FIG. 8C.
  • [0153]
    (5.2) [Mrs. Graham, 76 years old, has not been involved in day-to-day operations at the company since May 1991.1] [when Mr. Graham assumed the chief executive officer's title.2 ]
  • [0154]
    In case the last word in the text subsumed by the tree at position top-1 in the stack is a plural noun (NNS), the first word in the text subsumed by the tree at the top of the stack is a preposition or subordinating conjunction (IN), and the hyponymy-based similarity between the two trees at the top of the stack is equal with their synonymy-based similarity, then the action to be applied is REDUCE-BACKGROUND-CIRCUMSTANCE-NS. When this rule is applied in conjunction with the edts that correspond to the units marked in 5.3, the resulting tree has the same shape as the tree shown in FIG. 8C.
  • [0155]
    (5.3) [In an April 7 Wall Street Journal article, several experts suggested that IBM's accounting grew much more liberal since the mid-1980s1] [as its business turned sour.2]
  • [0156]
    When the tree at the top of the stack subsumes a paragraph and starts with the marker “but”, the action to be applied is REDUCE-CONTRAST-NN. For example, if the trees at the top of the stack subsume the paragraphs shown in FIG. 8D and are characterized by promotion sets P1 and P2, as a result of applying rule 4 in FIG. 8A, one would obtain a new tree, whose shape is shown in FIG. 8E; the promotion units of the root node of this tree are given by the union of the promotion units of the child nodes.
  • [0157]
    The last rule in FIG. 8A reflects the fact that each text in the MUC corpus is characterized by a title. When there are no units left in the input list (noUnitsInList=0) and a tree that subsumes the whole text has been built (noTreesInStack<=2), the two trees that are left in the tree—the one that corresponds to the title and the one that corresponds to the text—are reduced using a REDUCE-TEXTUAL-NN operation.
  • [0158]
    Evaluation of Shift-Reduce-Action Identifier
  • [0159]
    Table 3 below displays the accuracy of the shift-reduce action identifiers, determined for each of the three corpora (MUC, Brown, WSJ) by means of a ten-fold cross-validation procedure. In table 3, the B3 column gives the accuracy of a majority-based classifier, which chooses action SHIFT in all cases. Since choosing only the action SHIFT never produces a discourse tree, column B4 presents the accuracy of a baseline classifier that chooses shift-reduce operations randomly, with probabilities that reflect the probability distribution of the operations in each corpus.
    TABLE 3
    Performance of the tree-based, shift-reduce action classifiers.
    Corpus # cases B3 (%) B4 (%) Acc (%)
    MUC 1996 50.75 26.9 61.12 ± 1.61
    WSJ 4360 50.34 27.3 61.65 ± 0.41
    Brown 8242 50.18 28.1 61.81 ± 0.48
  • [0160]
    [0160]FIG. 9 shows the learning curve that corresponds to the MUC corpus. As in the case of the discourse segmenter, this learning curve also suggests that more data can increase the accuracy of the shift-reduce action identifier. Evaluation of the rhetorical parser By applying the two classifiers sequentially, one can derive the rhetorical structure of any text. The performance results presented above suggest how well the discourse segmenter and the shift-reduce action identifier perform with respect to individual cases, but provide no information about the performance of a rhetorical parser that relies on these classifiers.
    TABLE 4
    Performance of the rhetorical parser:
    labeled (R)ecall and (P)recision. The segmenter
    is either Decision-Tree-Based (DT) or Manual (M).
    Elementary units Hierarchical spans Span nuclearity Rhetorical relations
    Seg- Training Judges Parser Judges Parser Judges Parser Judges Parser
    Corpus menter corpus R P R P R P R P R P R P R P R P
    MUC DT MUC 88.0 88.0 100.0 84.4 84 4 38.2 61.0 79.1 83.5 25 5 51.5 78.6 78.6 14 9 28 7
    DT All  37.1 70.9 72.8 58.3 68.9 38.4 45 3
    M MUC  96.9 87.5 82.3 68.8 78.2 72.4 62.8
    M All  75 4 100.0 84.8 73.5 71.0 69 3 66.5 53.9
    100.0 100.0
    100 0
    WSJ DT WSJ 85 1 86.8 79 9 80.1 34.0 65 8 67.6 77.1 21.6 54.0 73.1 73.3 13.0 34.3
    DT All  18 1  95.8 40.1 66.3 30 3 58.5 17.3 36.0
    M WSJ 83.4 84.2 63 7 79.9 56.3 57 9
    M All  25.1  79.6 83.0 85.0 69.0 82.4 59.8 63.2
    100.0 100 0
    100 0 100 0
    Brown DT Brown 89.5 88 5 80 6 7 95 57.3 63 3 67.6 75.8 44.6 57.3 69 7 68.3 26 7 35 3
    DT All  60 5  79 4 44.7 59.1 33 2 51 8 15.7 25.7
    M Brown 88 1 73.4 60.1 67 0 59 5 45 5
    M All  44 2  80.3 80 8 77.5 60.0 72 0 51 8 44 7
    100.0 100 0
    100 0 100 0
  • [0161]
    In order to evaluate the rhetorical parser as a whole, each corpus was partitioned randomly into two sets of texts: 27 texts were used for training and the last 3 texts were used for testing. The evaluation employs “labeled recall” and “labeled precision” measures, which are extensively used to study the performance of syntactic parsers. “Labeled recall” reflects the number of correctly labeled constituents identified by the rhetorical parser with respect to the number of labeled constituents in the corresponding manually built tree. “Labeled precision” reflects the number of correctly labeled constituents identified by the rhetorical parser with respect to the total number of labeled constituents identified by the parser.
  • [0162]
    Labeled recall and precision figures were computed with respect to the ability of the discourse parser to identify elementary units, hierarchical text spans, text span nuclei and satellites, and rhetorical relations. Table 4 displays results obtained using segmenters and shift-reduce action identifiers that were trained either on 27 texts from each corpus and tested on 3 unseen texts from the same corpus; or that were trained on 27×3 texts from all corpora and tested on 3 unseen texts from each corpus. The training and test texts were chosen randomly. Table 4 also displays results obtained using a manual discourse segmenter, which identified correctly all edus. Since all texts in the corpora were manually annotated by multiple judges, an upper-bound of the performance of the rhetorical parser was computed by calculating, for each text in the test corpus and each judge, the average labeled recall and precision figures with respect to the discourse trees built by the other judges. Table 4 displays these upper-bound figures as well.
  • [0163]
    The results in table 4 primarily show that errors in the discourse segmentation stage affect significantly the quality of the trees the parser builds. When a segmenter is trained only on
  • [0164]
    [0164]27 texts (especially for the MUC and WSJ corpora, which have shorter texts that the Brown corpus), it has very low performance. Many of the intra-sentential edu boundaries are not identified, and as a consequence, the overall performance of the parser is low. When the segmenter is trained on 27×3 texts, its performance increases significantly with respect to the MUC and WSJ corpora, but decreases with respect to the Brown corpus. This can be explained by the significant differences in style and discourse marker usage between the three corpora. When a perfect segmenter is used, the rhetorical parser determines hierarchical constituents and assigns them a nuclearity status at levels of performance that are not far from those of humans. However, the rhetorical labeling of discourse spans even in this case is about 15-20% below human performance. These results suggest that the features used are sufficient for determining the hierarchical structure of texts and the nuclearity statuses of discourse segments.
  • [0165]
    Alternative embodiments of the discourse parser and its parsing procedure are possible. For example, probabilities could be incorporated into the process that builds the discourse trees. Alternatively, or in addition, multiple trees could be derived in parallel and the best one selected in the end. In the current embodiment, the final discourse tree is generated in a sequence of deterministic steps with no recursion or branching. Alternatively, it is possible to associate a probability with each individual step and build the discourse tree of a text by exploring multiple alternatives at the same time. The probability of a discourse tree is given by the product of the probabilities of all steps that led to the derivation of that tree. In such a case, the discourse tree of a text will be taken to be the resulting tree of maximum probability. An advantage of such an approach is that it enables the creation of multiple trees, each one having associated a probability.
  • [0166]
    Summarization
  • [0167]
    Various summarizing systems and techniques are described in detail below. In general, two different embodiments of a summarizer are described. First, a “channel-based” summarizer that uses a probabilistic approach for summarization (equivalently, compression) is described, and second, a “decision-based” summarizer that uses learned decision rules for summarization is described.
  • [0168]
    Channel-Based Summarizer
  • [0169]
    This section describes a probabilistic approach to the compression problem. In particular, a “noisy channel” framework is used. In this framework, a long text string is regarded as (1) originally being a short string, that (2) someone added some additional, optional text to it. Compression is a matter of identifying the original short string. It is not critical whether or not the “original” string is real or hypothetical. For example, in statistical machine translation, a French string could be regarded as originally being in English, but having noise added to it. The French may or may not have been translated from English originally, but by removing the noise, one can hypothesize an English source—and thereby translate the string. In the case of compression, the noise consists of optional text material that pads out the core signal. For the larger case of text summarization, it may be useful to imagine a scenario in which a news editor composes a short document, hands it to a reporter, and tells the reporter to “flesh it out” . . . which results in the final article published in the newspaper. In summarizing the final article, the summarizer typically will not have access to the editor's original version (which may or may not exist), but the summarizer can guess at it—which is where probabilities come in.
  • [0170]
    In a noisy channel application, three problems must be solved:
  • [0171]
    Source model. To every string s a probability s) must be assigned. s) represents the chance that s is generated as an “original short string” in the above hypothetical process. For example, it may be desirable to have s) to be very low if s is ungrammatical.
  • [0172]
    Channel model. To every pair of strings (s,t) a probability t | s) is assigned. t || s) represents the chance that when the short string s is expanded, the result is the long string t. For example, if t is the same as s except for the extra word “not,” then it may be desirable to have a very low t | s) because the word “not” is not optional, additional material.
  • [0173]
    Decoder. Given a long string t, a short string s is searched for that maximizes P(s | t). This is equivalent to searching for the s that maximizes P(s) · t | s).
  • [0174]
    It is advantageous to break down the noisy channel problem this way, as it decouples the somewhat independent goals of creating a short text that (1) is grammatical and coherent, and (2) preserves important information. It is easier to build a channel model that focuses exclusively on the latter, without having to worry about the former. That is, one can specify that a certain substring may represent unimportant information without worrying that deleting the substring will result in an ungrammatical structure. That concern is left to the source model, which worries exclusively about well-formedness. In that regard, well-known prior work in source language modeling for speech recognition, machine translation, and natural language generation can be used. The same goes for actual compression (“decoding” in noisy-channel jargon)—one can re-use generic software packages to solve problems in all these application domains.
  • [0175]
    Statistical Models
  • [0176]
    In the experiments discussed here, relatively simple source and channel models were built and used. In a departure from the above discussion and from previous work on statistical channel models, probabilities Ptree(s) and Pexpand tree(t | s) were assigned to trees rather than strings. In decoding a new string, first it is parsed into a large syntactic tree t (for example, using the parser described in M. Collins, “Three generative, lexicalized models for statistical parsing,” Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL-97), 16-23 (1997)) and then various small syntactic trees are hypothesized and ranked.
  • [0177]
    Good source strings are ones that have both (1) a normal-looking parse tree, and (2) normal-looking word pairs. Ptree(s) is a combination of a standard probabilistic context-free grammar (PCFG) score, which is computed over the grammar rules that yielded the tree s, and a standard word-bigram score, which is computed over the leaves of the tree. For example, the tree s=(S (NP John) (VP (VB saw) (NP Mary))) is assigned a score based on these factors:
  • [0178]
    Ptree(s)=TOP → S | TOP) ·
  • [0179]
    S → NP VP | S) · NP → John | NP) ·
  • [0180]
    VP → VP NP | VP) · VP → saw | VB) ·
  • [0181]
    NP → Mary | NP) ·
  • [0182]
    John | EOS) · saw | John) ·
  • [0183]
    Mary | saw) · EOS | Mary)
  • [0184]
    The stochastic channel model performs minimal operations on a small tree s to create a larger tree t. For each internal node in s, an expansion template is chosen probabilistically based on the labels of the node and its children. For example, when processing the S node in the tree above, one may wish to add a prepositional phrase as a third child. This is done with probability S → NP VP PP | S NP VP). Or one may choose to leave it alone, with probability S → NP VP | S → NP VP). After an expansion template is chosen, then for each new child node introduced (if any), a new subtree is grown rooted at that node—for example, (PP (P in) (NP Pittsburgh)). Any particular subtree is grown with probability given by its PCFG factorization, as above (no bigrams).
  • [0185]
    Example of Using Statistical Models for Compression
  • [0186]
    This example demonstrates how to tell whether one potential compression is more likely than another, according to the statistical models described above. FIG. 11 shows examples of parse trees. As shown, the tree t in FIG. 11 spans the string abcde. Consider the parse tree for compression s1, which is also shown in FIG. 11.
  • [0187]
    The factors Ptree(s1) and Pexpand tree(t | s1) are computed. Breaking this down further, the source PCFG and word-bigram factors, which describe Ptree (s1), are as follows:
    P (TOP → G | TOP) P (H → a | H)
    P (G → H A | G) P (C → b | C)
    P (A → C D | A) P (D → e |D)
    P (a | EOS) P (e | b)
    P (b | a) P (EOS | e)
  • [0188]
    The channel expansion-template factors and the channel PCFG (new tree growth) factors, which describe Pexpand tree(t | s1) are:
    P (G → H A | G → H A)
    P (A → C B D | A → C D)
    P (B → Q R | B) P (Z → c | Z)
    P (Q → Z | Q) P (R → d | R)
  • [0189]
    A different compression will be scored with a different set of factors. For example, consider a compression of t that leaves t completely untouched. In that case, the source costs Ptree(t) are:
    P (TOP → G | TOP) P (H → a | H) P (a | EOS)
    P (G → H A | G) P (C → b | C) P (b | a )
    P (A → C D | A) P (Z → c | Z) P (c | b )
    P (B → Q R | B) P (R → d |R) P (d | c )
    P (Q → Z | Q) P (D → e | D) P (e | d)
    P (EOS | e)
  • [0190]
    The channel costs Pexpand tree(t | t) are:
    P (G → H A | G → H A)
    P (A → C B D | A → C B D)
    P (B → Q R | B → Q R)
    P (Q → Z | Q → Z)
  • [0191]
    Now, the following values are compared—Pexpand tree(s1 | t)=Ptree(s1) · Pexpand tree(t | s1))/Ptree(t) versus Pexpand tree(t | t)=Ptree(t) · Pexpand tree(t | t) / Ptree(t) —and the more likely one is selected. Note that Ptree(t) and all the PCFG factors can be canceled out, as they appear in any potential compression. Therefore, one need only compare compressions of the basis of the expansion-template probabilities and the word-bigram probabilities. The quantities that differ between the two proposed compressions are boxed above. Therefore, s1 will be preferred over t if and only if:
  • [0192]
    E | b) · A → C B D | A → C D ) >
  • [0193]
    b | a) · c | b) · d | c) ·
  • [0194]
    A → C B D | A → C B D) ·
  • [0195]
    B → Q R | B → Q R) Q → Z | Q → Z)
  • [0196]
    Training Corpus
  • [0197]
    In order to train the channel-based summarizing system, the Ziff-Davis corpus—a collection of newspaper articles announcing computer products—was used. Many of the articles in the corpus are paired with human written abstracts. A set of 1067 sentence pairs were automatically extracted from the corpus. Each pair consisted of a sentence t=t1, t2, . . . , tn that occurred in the article and a possible compressed version of it s=s1, s2, . . . sm, which occurred in the human written abstract. FIG. 12 shows a few examples of sentence pairs extracted from the corpus.
  • [0198]
    This corpus was chosen because it is consistent with two desiderata specific to summarization work: (i) the human-written Abstract sentences are grammatical; (ii) the Abstract sentences represent in a compressed form the salient points of the original newspaper Sentences. The uncompressed sentences were kept in the corpus as well, since an objective was to learn not only how to compress a sentence, but also when to do it.
  • [0199]
    Learning Model Parameters For Channel-Based Summarizer
  • [0200]
    Expansion-template probabilities were collected from parallel corpus. First, both sides of the parallel corpus were parsed, and then corresponding syntactic nodes were identified. For example, the parse tree for one sentence may begin
  • [0201]
    (S (NP . . . )
  • [0202]
    (VP . . . )
  • [0203]
    (PP . . . ))
  • [0204]
    while the parse tree for its compressed version may begin
  • [0205]
    (S (NP . . . )
  • [0206]
    (VP . . . )).
  • [0207]
    If these two S nodes are deemed to correspond, then a joint event (S → NP VP, S → NP VP PP) is recorded. Afterwards the events are normalized so that the probabilities add up to one. Not all nodes have corresponding partners; some non-correspondences are due to incorrect parses, while others are due to legitimate reformulations that are beyond the scope of the simple channel model. Standard methods based on counting and parameters were used to estimate word-bigram probabilities.
  • [0208]
    Decoding
  • [0209]
    A vast number of potential compressions of a large tree t exist, but all of them can be packed efficiently into a shared-forest structure. For each node of t that has n children, the following operations are performed:
  • [0210]
    generate 2n—1 new nodes, one for each non-empty subset of the children, and
  • [0211]
    Pack those nodes so that they are referred to as a whole.
  • [0212]
    For example, consider the large tree t above. All compressions can be represented with the following forest:
    G → H A B → R A → B C H → a
    G → H Q → Z A → C C → b
    G → A A → C B D A → B Z → c
    B → Q R A → C B A → D R → d
    B → Q A → C D D →e
  • [0213]
    An expansion-template probability can be assigned to each node in the forest. For example, to the B → Q node, one can assign B → Q R | B → Q). If the observed probability from the parallel corpus is zero, then a small floor value of 10−6 is assigned. In reality, forests are produced that are much slimmer, as only methods of compressing a node that are locally grammatical according to the Penn Treebank are considered. (Penn Treebank is a collection of manually built syntactic parse trees available from the Linguistic Data Consortium at the University of Pennsylvania.) If a rule of the type A → C B has never been observed, then it will not appear in the forest.
  • [0214]
    Next, a set of high-scoring trees is extracted from the forest, taking into account both expansion-template probabilities and word-bigram probabilities. A generic extractor such as described by I. Langkilde, “Forest-based statistical sentence generation,” Proceedings of the 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics (2000) can be used for this purpose.
  • [0215]
    The extractor selects the trees with the best combination of word-bigram and expansion template scores. It returns a list of such trees, one for each possible compression length. For example, as shown in FIG. 16, for the sentence Beyond that basic level, the operations of the three products vary, the following “best” compressions are obtained, with negative log-probabilities shown in parentheses (smaller=more likely):
  • [0216]
    Length Selection
  • [0217]
    It is useful to have multiple answers to choose from, as one user may seek 20% compression, while another seeks 60% compression. However, for purposes of evaluation, the summarizing system was designed to be able to select a single compression. If log-probabilities as shown in FIG. 16 are relied upon, then typically the shortest compression will be chosen. (Note above, however, how the three-word compression scores better than the two-word compression, as the models are not entirely happy removing the article “the”). To create a more reasonable competition, the log-probability is divided by the length of the compression, rewarding longer strings. This technique often is applied speech recognition.
  • [0218]
    If this normalized score is plotted against compression length, typically a (bumpy) U-shaped curve results, as illustrated in FIG. 13. In a typical more difficult case, a 25-word sentence may be optimally compressed by a 17-word version. Of course, if a user requires a shorter compression than that, another region of the curve may be selected and inspected for a local minimum.
  • [0219]
    [0219]FIGS. 17 and 18 are respectively generalized flowcharts of the channel-based summarization and training processes described above.
  • [0220]
    As shown in FIG. 17, the first step 1702 in the channel-based summarization process 1700 is to receive the input text. Although the embodiment described above uses sentences as the input text, any other text segment could be used instead, for example, clauses, paragraphs, or entire treatises.
  • [0221]
    Next, in step 1704, the input text is parsed to produce a syntactic tree in the style of FIG. 11, which is used in step 1706 as the basis of generating multiple possible solutions (e.g., the shared-forest structure described above). If a whole text is given as input, the text can be parsed to produce a discourse tree, and the algorithm described here will operate on the discourse tree.
  • [0222]
    Next, the multiple possible solutions generated in step 1706 are ranked using pre-generated ranking statistics from a statistical model. For example, step 1706 may involve assigning an expansion-template probability to each node in the forest, as described above.
  • [0223]
    Finally, the best scoring candidate (or candidates) is (are) chosen as the final compression solutios) in step 1710. As described above, the best scoring candidate may be the one having the smallest log-probability/length of compression ratio.
  • [0224]
    [0224]FIG. 18 shows a generalized process for training a channel-based summarizer. As shown therein, the process 1800 starts in step 1802 with an input training set (or corpus). As discussed above, this input training set comprises pairs of long-short text fragments, for example, long/short sentence pairs or treatise/abstract pairs. Typically, because a main purpose of the training set is to teach the summarizer how to properly compress text, the training set used will have been generated manually by experienced editors who know how to create relevant, coherent and grammatical summaries of longer text segments.
  • [0225]
    Next, in step 1804, the long-short text pairs are parsed to generate syntactic parse trees such as shown in FIG. 11, thereby resulting in corresponding long-short syntactic tree pairs. Each item of text in each pair is parsed individually in this manner. Also, the entire text is parsed using the discourse parser.
  • [0226]
    Next, in step 1806, the resulting parse tree pairs are compared—that is, the discourse or syntactic parse tree for a long segment is compared against the discourse or syntactic parse tree for its paired short segment—to identify similarities and differences between nodes of the tree pairs. A difference might occur, for example, if, in generating the short segment, an editor deleted a prepositional phrase from the long segment. In any event, the results of this comparison are “events” that are collected for each of the long/short pairs and stored in a database. In general, two different types of events are detected: “joint events” which represent a detected correspondence between a long and short segment pair and Context-Free Grammar (CFG) events, which relate only to characteristics of the short segment in each pair.
  • [0227]
    Next, in step 1808, the collected events are normalized to generate probabilities. These normalized events collectively represent the statistical learning model 1810 used by the channel-based summarizer.
  • [0228]
    Decision-based Summarizer
  • [0229]
    A description of a decision-based, history model of sentence compression follows. As in the noisy-channel approach, it is assumed that a parse tree t is given as input. The goal is to “rewrite” t into a smaller tree s, which corresponds to a compressed version of the original sentence subsumed by t. Assume the trees t and s2 in FIG. 11 are in the corpus. In the decision-based summarizer model, the question presented is how may tree t be rewritten into s2. One possible solution is to decompose the rewriting operation into a sequence of shift- reduce-drop actions that are specific to an extended shift-reduce parsing paradigm.
  • [0230]
    In the decision-based model, the rewriting process starts with an empty Stack and an Input List that contains the sequence of words subsumed by the large tree t. Each word in the input list is labeled with the name of all syntactic constituents in t that start with that word (see FIG. 14). At each step, the rewriting module applies an operation that is aimed at reconstructing the smaller tree s2. In the context of the sentence-compression module, four types of operations are used:
  • [0231]
    SHIFT operations transfer the first word from the input list into the stack;
  • [0232]
    REDUCE operations pop the k syntactic trees located at the top of the stack; combine them into a new tree; and push the new tree on the top of the stack. Reduce operations are used to derive the structure of the syntactic tree of the short sentence.
  • [0233]
    DROP operations are used to delete from the input list subsequences of words that correspond to syntactic constituents. A DROP X operations deletes from the input list all words that are spanned by constituent X in t.
  • [0234]
    ASSIGNTYPE operations are used to change the label of trees at the top of the stack. These actions assign POS tags to the words in the compressed sentence, which may be different from the POS tags in the original sentence.
  • [0235]
    The decision-based model is more flexible than the channel model because it enables the derivation of a tree whose skeleton can differ quite drastically from that of the tree given as input. For example, the channel-based model was unable to obtain tree s2 from t. However, using the four operations listed above (SHIFT, REDUCE, DROP, ASSIGNTYPE), the decision-based model was able to rewrite a tree t into any tree s, as long as an in-order traversal of the leaves of s produces a sequence of words that occur in the same order as the words in the tree t. For example, the tree s2 can be obtained from the tree t by following this sequence of actions, whose effects are shown in FIG. 14: SHIFT, ASSIGNTYPE H; SHIFT; ASSIGNTYPE K; REDUCE 2 F; DROP B; SHIFT; ASSIGNTYPE D; REDUCE 2 G.
  • [0236]
    To save space, the SHIFT and ASSIGNTYPE operations are shown in FIG. 14 on the same line. However, it should be understood that the SHIFT and ASSIGNTYPE operations correspond to two distinct actions. The ASSIGNTYPE K operation rewrites the POS tag of the word b; the REDUCE operations modify the skeleton of the tree given as input. To increase readability, the input list is shown in FIG. 14 in a format that resembles the graphical representation of the trees in FIG. 11.
  • [0237]
    Learning the Parameters of (Training) the Decision-Based Model
  • [0238]
    To train the decision-based model, each configuration of our shift-reduce-drop rewriting model is associated with a learning case. The learning cases are generated automatically by a program that derives sequences of actions that map each of the large trees in our corpus into smaller trees. The rewriting procedure simulates a bottom-up reconstruction of the smaller trees.
  • [0239]
    Overall, the 1067 pairs of long and short sentences yielded 46383 learning cases. Each case was labeled with one action name from a set of 210 possible actions: There are 37 different ASSIGNTYPE actions, one for each POS tag. There are 63 distinct DROP actions, one for each type of syntactic constituent that can be deleted during compression. There are 109 distinct REDUCE actions, one for each type of reduce operation that is applied during the reconstruction of the compressed sentence. And there is one SHIFT operation. Given a tree t and an arbitrary configuration of the stack and input list, the purpose of the decision-based classifier is to learn what action to choose from the set of 210 possible actions.
  • [0240]
    To each learning example, a set of 99 features was associated from the following two classes: operation features and original-tree-specific features.
  • [0241]
    Operational features reflect the number of trees in the stack, the input list, and the types of the last five operations performed. Operational features also encode information that denotes the syntactic category of the root nodes of the partial trees built up to a certain time. Examples of operational features include the following: numberTreesInStack, wasPreviousOperationShift, syntacticLabelOfTreeAtTheTopOfStack.
  • [0242]
    Original-tree-specific features denote the syntactic constituents that start with the first unit in the input list. Examples of such features include inputListStartsWithA_CC and inputListStartsWithA_PP.
  • [0243]
    The decision-based compression module uses the C4.5 program as described in J. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers (1993), in order to learn decision trees that specify how large syntactic trees can be compressed into shorter trees. A ten-fold cross-validation evaluation of the classifier yielded an accuracy of 98.16% (±0.14). A majority baseline classifier that chooses the action SHIFT has an accuracy of 28.72%.
  • [0244]
    [0244]FIG. 18A shows examples of rules that were learned automatically by the C4.5 program. As seen therein, Rule 1 enables the deletion of WH prepositional phrases in the context in which they follow other constituents that the program decided to delete. Rule 2 enables the deletion of WHNP constituents. Since this deletion is carried out only when the stack contains only one NP constituent, it follows that this rule is applied only in conjunction with complex nounphrases that occur at the beginning of sentences. Rule 3 enables the deletion of adjectival phrases.
  • [0245]
    Employing the Decision-Based Model
  • [0246]
    To compress sentences, the shift-reduce-drop model is applied in a deterministic fashion. The sentence to be compressed is parsed and the input list is initialized with the words in the sentence and the syntactic constituents that “begin” at each word, as shown in FIG. 14. Afterwards, the learned classifier is asked in a stepwise manner what action to propose. Each action is then simulated, thus incrementally building a parse tree. The procedure ends when the input list is empty and when the stack contains only one tree. An in-order traversal of the leaves of this tree produces the compressed version of the sentence given as input.
  • [0247]
    Because the decision-based model is deterministic, it produces only one output. An advantage of this result is that compression using the decision-based model is very fast: it takes only a few milliseconds per sentence. One potential disadvantage, depending on one's objectives, is that the decision-based model does not produce a range of compressions, from which another system may subsequently choose. It would be relatively straightforward to extend the model within a probabilistic framework by applying, for example, techniques described in D. Magerman, “Statistical decision-tree models for parsing,” Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 276-283 (1995).
  • [0248]
    [0248]FIGS. 19 and 20 are respectively generalized flowcharts of the decision-based summarization and training processes described above.
  • [0249]
    As shown in FIG. 19, the first step 1902 in the decision-based summarization process 1900 is to receive the input text. Although the embodiment described above uses sentences as the input text, any other text segment could be used instead, for example, clauses, paragraphs, or entire treatises.
  • [0250]
    Next, in step 1904, the input text is parsed to produce a syntactic tree in the style of FIG. 11. If a full text is used, one can use a discourse parse to build the discourse tree of the text.
  • [0251]
    In step 1906, the shift-reduce-drop algorithm is applied to the syntactic/discourse tree generated in step 1904 As discussed above, the shift-reduce-algorithm applies a sequence of predetermined decision rules (learned during training of the decision-based model, and identifying under what circumstances, and in what order, to perform the various shift-reduce-drop operations) to produce a compressed syntactic/discourse tree 1908. The resulting syntactic/discourse tree can be used for various purposes, for example, it can be rendered into a compressed text segment and output to a user (e.g., either a human end-user or a computer process). Alternatively, the resulting syntactic/discourse tree can be supplied to a process that further manipulates the tree for other purposes. For example, the resulting compressed syntactic/discourse tree could be supplied to a tree rewriter to convert it into another form, e.g., to translate it into a target language. An example of such a tree rewriter is described in Daniel Marcu et al., “The Automatic Translation of Discourse Structures,” Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 9-17, Seattle, Wash., (Apr. 29 -May 3, 2000).
  • [0252]
    [0252]FIG. 20 shows a generalized process for training a decision-based summarizer. As shown therein, the training process 2000 starts in step 2002 with an input training set as discussed above with reference to FIG. 17.
  • [0253]
    Next, in step 2004, the long-short text pairs are parsed to generate syntactic parse trees such as shown in FIG. 11, thereby resulting in corresponding long-short syntactic tree pairs.
  • [0254]
    Next, in step 2006, for each long-short tree pair, the training process 2000 determines a sequence of shift-reduce-drop operations that will convert the long tree into the short tree. As discussed above, this step is performed based on the following four basic operations, referred to collectively as the “shift-reduce-drop” operations—shift, reduce, drop, and assignType. These four operations are sufficient to rewrite any given long tree into its paired short tree, provided that the order of the leaves does not change.
  • [0255]
    The output of step 2006 is a set of learning cases—one learning case for each long-short tree pair in the training set. In essence, each learning case is an ordered set of shift-reduce-drop operations that when applied to a long tree will generate the paired short tree.
  • [0256]
    Next, in step 2008, the training process 2000 associates features (e.g., operational and original-tree-specific features) with the learning cases to reflect the context in which the operations are to be performed.
  • [0257]
    Next, in step 2010, the training process 2000 applies a learning algorithm, for example, the C4.5 algorithm as described in J. Ross Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers (1993), to learn a set of decision rules 2012 from the learning cases. This set of decision rules 2012 then can be used by the decision-based summarizer to summarize any previously unseen text or syntactic tree into a compressed version that is both coherent and grammatical. Evaluation of the Summarizer Models To evaluate the compression algorithms, 32 sentence pairs were randomly selected from the parallel corpus. This random subset is referred to as the Test Corpus. The other 1035 sentence pairs were used for training as described above. FIG. 15 shows three sentences from the Test Corpus, together with the compressions produced by humans, the two compression algorithms described here (channel-based and decision-based), and a baseline algorithm that produces compressions with highest word-bigram scores. The examples were chosen so as to reflect good, average, and bad performance cases. The first sentence in FIG. 15 (“Beyond the basic level, the operations of the three products vary widely.”) was compressed in the same manner by humans and by both of the channel-based and decision-based algorithms (the baseline algorithm chooses though not to compress this sentence).
  • [0258]
    For the second example in FIG. 15, the output of the Decision-based algorithm is grammatical, but the semantics are negatively affected. The noisy-channel algorithm deletes only the word “break”, which affects the correctness of the output less. In the last example in FIG. 15, the noisy-channel model is again more conservative and decides not to drop any constituents. In contrast, the decision-based algorithm compresses the input substantially, but it fails to produce a grammatical output.
  • [0259]
    Each original sentence in the Test Corpus was presented to four judges, together with four compressions of it: the human generated compression, the outputs of the noisy-channel and decision-based algorithms, and the output of the baseline algorithm. The judges were told that all outputs were generated automatically. The order of the outputs was scrambled randomly across test cases.
  • [0260]
    To avoid confounding, the judges participated in two experiments. In the first experiment, they were asked to determine on a scale from 1 to 5 how well the systems did with respect to selecting the most important words in the original sentence. In the second experiment, they were asked to determine on a scale from 1 to 5 how grammatical the outputs were.
  • [0261]
    It was also investigated how sensitive the channel-based and decision-based algorithms are with respect to the training data by carrying out the same experiments on sentences of a different genre, the scientific one. To this end, the first sentence of the first 26 articles made available in 1999 on the cmplg archive was used. A second parallel corpus, referred to as the Cmplg Corpus, was created by generating compressed grammatical versions of these sentences. Because some of the sentences in this corpus were extremely long, the baseline algorithm could not produce compressed versions in reasonable time.
  • [0262]
    The results of Table 5 show compression rate, and mean and standard deviation results across all judges, for each algorithm and corpus. The results show that the decision-based algorithm is the most aggressive: on average, it compresses sentences to about half of their original size. The compressed sentences produced by both the channel-based algorithm and by the decision-based algorithm are more “grammatical” and contain more important words than the sentences produced by the baseline. T-test experiments showed these differences to be statistically significant at p<0.01 both for individual judges and for average scores across all judges. T-tests showed no significant statistical differences between the two algorithms. As Table 1 shows, the performance of the each of the compression algorithms is much closer to human performance than baseline performance; yet, humans perform statistically better than our algorithms at p<0.01.
    TABLE 5
    Experimental results
    Corpus Avg. orig. sent. Length Baseline Noisy-channel Decision-based Humans
    Test 21 words Compression 63.70% 70.37% 57.19% 53.33%
    Grammaticality 1.78 ± 1.19 4.34 ± 1.02 4.30 ± 1.33 4.92 ± 0.18
    Importance 2.17 ± 0.89 3.38 ± 0.67 3.54 ± 1.00 4.24 ± 0.52
    Cmplg 26 words Compression 65.68% 54.25% 65.68%
    Grammaticality 4.22 ± 0.99 3.72 ± 1.53 4.97 ± 0.08
    Importance 3.42 ± 0.97 3.24 ± 0.68 4.32 ± 0.54
  • [0263]
    When applied to sentences of a different genre, the performance of the noisy-channel compression algorithm degrades smoothly, while the performance of the decision-based algorithm drops sharply. This is due to a few sentences in the Cmplg Corpus that the decision-based algorithm over-compressed to only two or three words. This characteristic of the decision-based summarizer can be adjusted if the decision-based compression module is extended as described in D. Magerman, “Statistical decision-tree models for parsing,” Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, 276-283 (1995), by computing probabilities across the sequences of decisions that correspond to a compressed sentence.
  • [0264]
    Similarly, noisy-channel modeling could be enhanced by taking into account subcategory and head-modifier statistics (in addition to simple word-bigrams). For example, the subject of a sentence may be separated from the verb by intervening prepositional phrases. In this case, statistics should be collected over subject/verb pairs, which can be extracted from parsed text.
  • [0265]
    Although only a few embodiments have been described in detail above, those having ordinary skill in the art will certainly understand that many modifications are possible in the preferred embodiment without departing from the teachings thereof. All such modifications are encompassed within the following claims.

Claims (67)

    What is claimed is:
  1. 1. A computer-implemented method of determining discourse structures, the method comprising:
    generating a set of one or more discourse parsing decision rules based on a training set; and
    determining a discourse structure for an input text segment by applying the generated set of discourse parsing decision rules to the input text segment.
  2. 2. The method of claim 1 wherein the training set comprises a plurality of annotated text segments and a plurality of elementary discourse units (EDUs), each annotated text segment being associated with a set of EDUs that collectively represent the annotated text segment.
  3. 3. The method of claim 2 wherein the annotated text segments are built manually by human annotators.
  4. 4. The method of claim 2 wherein generating the set of discourse parsing decision rules comprises iteratively performing one or more operations on a set of EDUs to incrementally build the annotated text segment associated with the set of EDUs.
  5. 5. The method of claim 4 wherein the one or more operations iteratively perform comprise a shift operation and/or one or more reduce operations.
  6. 6. The method of claim 5 wherein the reduce operations comprise one or more of the following six operations: reduce-ns, reduce-sn, reduce-nn, reduce-below-ns, reduce-below-sn, reduce- below-nn.
  7. 7. The method of claim 5 wherein the six reduce operations and the shift operation are sufficient to derive the discourse tree of any input text segment.
  8. 8. The method of claim 1 wherein determining a discourse structure comprises incrementally building a discourse tree for the input text segment.
  9. 9. The method of claim 8 wherein incrementally building a discourse tree for the input text segment comprises selectively combining elementary discourse trees (EDTs) into larger discourse tree units.
  10. 10. The method of claim 8 wherein incrementally building a discourse tree for the input text segment comprises performing operations on a stack and an input list of elementary discourse trees (EDTs), one EDT for each elementary discourse unit (EDU) in a set of EDUs corresponding to the input text segment.
  11. 11. The method of claim 10 further comprising, prior to determining the discourse structure for the input text segment, segmenting the input text segment into EDUs and inserting the EDUs into the input list.
  12. 12. The method of claim 1 wherein determining the discourse structure for the input text segment further comprises:
    segmenting the input text segment into elementary discourse units (EDUs);
    incrementally building a discourse tree for the input text segment by performing operations on the EDUs to selectively combine the EDUs into larger discourse tree units; and
    repeating the incremental building of the discourse tree until all of the EDUs have been combined.
  13. 13. The method of claim 12 wherein segmenting the input text segment into EDUs is performed by applying a set of automatically learned discourse segmenting decision rules to the input text segment.
  14. 14. The method of claim 13 further comprising generating the set of discourse segmenting decision rules by analyzing a training set.
  15. 15. The method of claim 1 wherein the input text segment comprises a clause, a sentence, a paragraph or a treatise.
  16. 16. A computer-implemented text parsing method comprising:
    generating a set of one or more discourse segmenting decision rules based on a training set; and
    determining boundaries in an input text segment by applying the generated set of discourse segmenting decision rules to the input text segment.
  17. 17. The method of claim 16 wherein determining boundaries comprises examining each lexeme in the input text segment in order.
  18. 18. The method of claim 17 further comprising assigning, for each lexeme, one of the following designations: sentence- break, EDU-break, start-parenthetical, end-parenthetical, and none.
  19. 19. The method of claim 17 wherein examining each lexeme in the input text segment comprises associating features with the lexeme based on surrounding context.
  20. 20. The method of claim 16 wherein determining boundaries in the input text segment comprises recognizing sentence boundaries, elementary discourse unit (EDU) boundaries, parenthetical starts, and parenthetical ends.
  21. 21. A computer-implemented method of generating discourse trees, the method comprising:
    segmenting an input text segment into elementary discourse units (EDUs); and
    incrementally building a discourse tree for the input text segment by performing operations on the EDUs to selectively combine the EDUs into larger discourse tree units.
  22. 22. The method of claim 21 further comprising repeating the incremental building of the discourse tree until all of the EDUs have been combined into a single discourse tree.
  23. 23. The method of claim 21 wherein the incremental building of the discourse tree is based on predetermined decision rules.
  24. 24. The method of claim 23 wherein the predetermined decision rules comprise automatically learned decision rules.
  25. 25. The method of claim 23 further comprising generating the predetermined decisions rules by analyzing a training set of annotated discourse trees.
  26. 26. The method of claim 21 wherein the operations performed on the EDUs comprise one or more of the following: shift, reduce-ns, reduce-sn, reduce-nn, reduce-below-ns, reduce-below-sn, reduce-below-nn.
  27. 27. A discourse parsing system comprising:
    a plurality of automatically learned decision rules;
    an input list comprising a plurality of elementary discourse trees (EDTs), each EDT corresponding to an elementary discourse unit (EDU) of an input text segment;
    a stack for holding discourse tree segments while a discourse tree for the input text segment is being built; and
    a plurality of operators for incrementally building the discourse tree for the input text segment by selectively combining the EDTs into a discourse tree segment according to the plurality of decision rules and moving the discourse tree segment onto the stack.
  28. 28. The system of claim 27 further comprising a discourse segmenter for partitioning the input text segment into EDUs and inserting the EDUs into the input list.
  29. 29. A computer-implemented method comprising determining a discourse structure for an input text segment by applying a set of automatically learned discourse parsing decision rules to an input text segment.
  30. 30. A computer-implemented summarization method comprising:
    generating a set of one or more summarization decision rules based on a training set; and
    compressing a tree structure by applying the generated set of summarization decision rules to the tree structure.
  31. 31. The method of claim 30 wherein the tree structure comprises a discourse tree.
  32. 32. The method of claim 30 wherein the tree structure comprises a syntactic tree.
  33. 33. The method of claim 30 further comprising generating the tree structure to be compressed by parsing an input text segment.
  34. 34. The method of claim 33 wherein the input text segment comprises a clause, a sentence, a paragraph, or a treatise.
  35. 35. The method of claim 30 further comprising converting the compressed tree structure into a summarized text segment.
  36. 36. The method of claim 35 wherein the summarized text segment is grammatical and coherent.
  37. 37. The method of claim 35 wherein the summarized text segment includes sentences not present in a text segment from which the pre-compressed tree structure was generated.
  38. 38. The method of claim 30 wherein applying the generated set of summarization decision rules comprises performing a sequence of modification operations on the tree structure.
  39. 39. The method of claim 38 wherein the sequence of modification operations comprises one or more of the following: a shift operation, a reduce operation, and a drop operation.
  40. 40. The method of claim 39 wherein the reduce operation combines a plurality of trees into a larger tree.
  41. 41. The method of claim 39 wherein the drop operation deletes constituents from the tree structure.
  42. 42. The method of claim 30 wherein the training set comprises pre-generated long/short tree pairs.
  43. 43. The method of claim 42 wherein generating the set of summarization decision rules comprises iteratively performing one or more tree modification operations on a long tree until the paired short tree is realized.
  44. 44. The method of claim 43 wherein a plurality of long/short tree pairs are processed to generate a plurality of learning cases.
  45. 45. The method of claim 44 wherein generating the set of decision rules comprises applying a learning algorithm to the plurality of learning cases.
  46. 46. The method of claim 44 further comprising associating one or more features with each of the learning cases to reflect context.
  47. 47. A computer-implemented summarization method comprising:
    generating a parse tree for an input text segment; and
    iteratively reducing the generated parse tree by selectively eliminating portions of the parse tree.
  48. 48. The method of claim 47 wherein the generated parse tree comprises a discourse tree.
  49. 49. The method of claim 47 wherein the generated parse tree comprises a syntactic tree.
  50. 50. The method of claim 47 wherein the iterative reduction of the parse tree is performed based on a plurality of learned decision rules.
  51. 51. The method of claim 47 wherein iteratively reducing the parse tree comprises performing tree modification operations on the parse tree.
  52. 52. The method of claim 51 wherein the tree modification operations comprise one or more of the following: a shift operation, a reduce operation, and a drop operation.
  53. 53. The method of claim 52 wherein the reduce operation combines a plurality of trees into a larger tree.
  54. 54. The method of claim 52 wherein the drop operation deletes constituents from the tree structure.
  55. 55. A computer-implemented summarization method comprising:
    parsing an input text segment to generate a parse tree for the input segment;
    generating a plurality of potential solutions;
    applying a statistical model to determine a probability of correctness for each of potential solution;
    extracting one or more high-probability solutions based on the solutions' respective determined probabilities of correctness.
  56. 56. The method of claim 55 wherein the generated parse tree comprises a discourse tree.
  57. 57. The method of claim 55 wherein the generated parse tree comprises a syntactic tree.
  58. 58. The method of claim 55 wherein applying a statistical model comprises using a stochastic channel model algorithm.
  59. 59. The method of claim 58 wherein using a stochastic channel model algorithm comprises performing minimal operations on a small tree to create a larger tree.
  60. 60. The method of claim 58 wherein using a stochastic channel model algorithm comprises probabilistically choosing an expansion template.
  61. 61. The method of claim 55 wherein generating a plurality of potential solutions comprises identifying a forest of potential compressions for the parse tree.
  62. 62. The method of claim 61 wherein the generated parse tree has one or more nodes, each node having N children (wherein N is an integer), and wherein identifying a forest of potential compressions comprises:
    generating 2N—1 new nodes, one node for each non-empty subset of the children; and
    packing the newly generated nodes into a whole.
  63. 63. The method of claim 61 wherein the generated parse tree has one or more nodes, and wherein identifying a forest of potential compressions comprises assigning an expansion-template probability to each node in the forest.
  64. 64. The method of claim 55 wherein extracting one or more high-probability solutions comprises selecting one or more trees based on a combination of each tree's word-bigram and expansion-template score.
  65. 65. The method of claim 64 wherein selecting one or more trees comprises selecting a list of trees, one for each possible compression length.
  66. 66. The method of claim 55 further comprising normalizing each potential solution based on compression length.
  67. 67. The method of claim 55 further comprising, for each potential solution, dividing a log-probability of correctness for the solution by a length of compression for the solution.
US09854301 2000-05-11 2001-05-11 Discourse parsing and summarization Abandoned US20020046018A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US20364300 true 2000-05-11 2000-05-11
US09854301 US20020046018A1 (en) 2000-05-11 2001-05-11 Discourse parsing and summarization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US09854301 US20020046018A1 (en) 2000-05-11 2001-05-11 Discourse parsing and summarization

Publications (1)

Publication Number Publication Date
US20020046018A1 true true US20020046018A1 (en) 2002-04-18

Family

ID=22754752

Family Applications (2)

Application Number Title Priority Date Filing Date
US09854327 Active 2023-10-05 US7533013B2 (en) 2000-05-11 2001-05-11 Machine translation techniques
US09854301 Abandoned US20020046018A1 (en) 2000-05-11 2001-05-11 Discourse parsing and summarization

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09854327 Active 2023-10-05 US7533013B2 (en) 2000-05-11 2001-05-11 Machine translation techniques

Country Status (6)

Country Link
US (2) US7533013B2 (en)
EP (1) EP1352338A2 (en)
JP (1) JP2004501429A (en)
CN (1) CN1465018A (en)
CA (1) CA2408819C (en)
WO (2) WO2001086489A3 (en)

Cited By (116)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138248A1 (en) * 2001-01-26 2002-09-26 Corston-Oliver Simon H. Lingustically intelligent text compression
US20020186241A1 (en) * 2001-02-15 2002-12-12 Ibm Digital document browsing system and method thereof
US20030167252A1 (en) * 2002-02-26 2003-09-04 Pliant Technologies, Inc. Topic identification and use thereof in information retrieval systems
US20030188255A1 (en) * 2002-03-28 2003-10-02 Fujitsu Limited Apparatus for and method of generating synchronized contents information, and computer product
US20040001893A1 (en) * 2002-02-15 2004-01-01 Stupp Samuel I. Self-assembly of peptide-amphiphile nanofibers under physiological conditions
US20040044519A1 (en) * 2002-08-30 2004-03-04 Livia Polanyi System and method for summarization combining natural language generation with structural analysis
WO2004046956A1 (en) * 2002-11-14 2004-06-03 Educational Testing Service Automated evaluation of overly repetitive word use in an essay
US20040117734A1 (en) * 2002-09-30 2004-06-17 Frank Krickhahn Method and apparatus for structuring texts
US20040167884A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Methods and products for producing role related information from free text sources
US20040230415A1 (en) * 2003-05-12 2004-11-18 Stefan Riezler Systems and methods for grammatical text condensation
US20040258726A1 (en) * 2003-02-11 2004-12-23 Stupp Samuel I. Methods and materials for nanocrystalline surface coatings and attachment of peptide amphiphile nanofibers thereon
US20050038643A1 (en) * 2003-07-02 2005-02-17 Philipp Koehn Statistical noun phrase translation
US20050086592A1 (en) * 2003-10-15 2005-04-21 Livia Polanyi Systems and methods for hybrid text summarization
EP1535261A1 (en) * 2002-06-24 2005-06-01 Educational Testing Service Automated essay annotation system and method
US20050137855A1 (en) * 2003-12-19 2005-06-23 Maxwell John T.Iii Systems and methods for the generation of alternate phrases from packed meaning
US20050138556A1 (en) * 2003-12-18 2005-06-23 Xerox Corporation Creation of normalized summaries using common domain models for input text analysis and output text generation
US20050170325A1 (en) * 2002-02-22 2005-08-04 Steinberg Linda S. Portal assessment design system for educational testing
US20050187772A1 (en) * 2004-02-25 2005-08-25 Fuji Xerox Co., Ltd. Systems and methods for synthesizing speech using discourse function level prosodic features
US20050209145A1 (en) * 2003-12-05 2005-09-22 Stupp Samuel I Self-assembling peptide amphiphiles and related methods for growth factor delivery
US20050208589A1 (en) * 2003-12-05 2005-09-22 Stupp Samuel I Branched peptide amphiphiles, related epitope compounds and self assembled structures thereof
US20050221266A1 (en) * 2004-04-02 2005-10-06 Mislevy Robert J System and method for assessment design
US6961692B1 (en) * 2000-08-01 2005-11-01 Fuji Xerox Co, Ltd. System and method for writing analysis using the linguistic discourse model
US20050256848A1 (en) * 2004-05-13 2005-11-17 International Business Machines Corporation System and method for user rank search
US20060004732A1 (en) * 2002-02-26 2006-01-05 Odom Paul S Search engine methods and systems for generating relevant search results and advertisements
US20060010138A1 (en) * 2004-07-09 2006-01-12 International Business Machines Corporation Method and system for efficient representation, manipulation, communication, and search of hierarchical composite named entities
US20060009961A1 (en) * 2004-06-23 2006-01-12 Ning-Ping Chan Method of decomposing prose elements in document processing
US20060020473A1 (en) * 2004-07-26 2006-01-26 Atsuo Hiroe Method, apparatus, and program for dialogue, and storage medium including a program stored therein
US20060020571A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based generation of document descriptions
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system
US20060031195A1 (en) * 2004-07-26 2006-02-09 Patterson Anna L Phrase-based searching in an information retrieval system
US20060095250A1 (en) * 2004-11-03 2006-05-04 Microsoft Corporation Parser for natural language processing
US20060116860A1 (en) * 2004-11-30 2006-06-01 Xerox Corporation Systems and methods for user-interest sensitive condensation
US20060142995A1 (en) * 2004-10-12 2006-06-29 Kevin Knight Training for a text-to-text application which uses string to tree conversion for training and decoding
US20060149036A1 (en) * 2002-11-12 2006-07-06 Stupp Samuel I Composition and method for self-assembly and mineralizatin of peptide amphiphiles
US20060155530A1 (en) * 2004-12-14 2006-07-13 International Business Machines Corporation Method and apparatus for generation of text documents
US20060194183A1 (en) * 2005-02-28 2006-08-31 Yigal Attali Method of model scaling for an automated essay scoring system
US20060247165A1 (en) * 2005-01-21 2006-11-02 Stupp Samuel I Methods and compositions for encapsulation of cells
US20060277028A1 (en) * 2005-06-01 2006-12-07 Microsoft Corporation Training a statistical parser on noisy data by filtering
US20060294155A1 (en) * 2004-07-26 2006-12-28 Patterson Anna L Detecting spam documents in a phrase based information retrieval system
US20070192309A1 (en) * 2005-10-12 2007-08-16 Gordon Fischer Method and system for identifying sentence boundaries
US20070240078A1 (en) * 2004-12-21 2007-10-11 Palo Alto Research Center Incorporated Systems and methods for using and constructing user-interest sensitive indicators of search results
US20070260449A1 (en) * 2006-05-02 2007-11-08 Shimei Pan Instance-based sentence boundary determination by optimization
US20070260598A1 (en) * 2005-11-29 2007-11-08 Odom Paul S Methods and systems for providing personalized contextual search results
US20070265996A1 (en) * 2002-02-26 2007-11-15 Odom Paul S Search engine methods and systems for displaying relevant topics
US20070277250A1 (en) * 2005-03-04 2007-11-29 Stupp Samuel I Angiogenic heparin-binding epitopes, peptide amphiphiles, self-assembled compositions and related methods of use
US20080033715A1 (en) * 2002-01-14 2008-02-07 Microsoft Corporation System for normalizing a discourse representation structure and normalized data structure
WO2007117652A3 (en) * 2006-04-07 2008-05-02 Basis Technology Corp Method and system of machine translation
US7426507B1 (en) 2004-07-26 2008-09-16 Google, Inc. Automatic taxonomy generation in search results using phrases
US20080270109A1 (en) * 2004-04-16 2008-10-30 University Of Southern California Method and System for Translating Information with a Higher Probability of a Correct Translation
US20080306943A1 (en) * 2004-07-26 2008-12-11 Anna Lynn Patterson Phrase-based detection of duplicate documents in an information retrieval system
US20080319971A1 (en) * 2004-07-26 2008-12-25 Anna Lynn Patterson Phrase-based personalization of searches in an information retrieval system
US20090006080A1 (en) * 2007-06-29 2009-01-01 Fujitsu Limited Computer-readable medium having sentence dividing program stored thereon, sentence dividing apparatus, and sentence dividing method
US20090042804A1 (en) * 2007-04-17 2009-02-12 Hulvat James F Novel peptide amphiphiles having improved solubility and methods of using same
US7491690B2 (en) 2001-11-14 2009-02-17 Northwestern University Self-assembly and mineralization of peptide-amphiphile nanofibers
US20090045971A1 (en) * 2006-03-06 2009-02-19 Koninklijke Philips Electronics N.V. Use of decision trees for automatic commissioning
US7534761B1 (en) 2002-08-21 2009-05-19 North Western University Charged peptide-amphiphile solutions and self-assembled peptide nanofiber networks formed therefrom
US7567959B2 (en) 2004-07-26 2009-07-28 Google Inc. Multiple index based information retrieval system
US20090306964A1 (en) * 2008-06-06 2009-12-10 Olivier Bonnet Data detection
US20090326927A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Adaptive generation of out-of-dictionary personalized long words
US20100042398A1 (en) * 2002-03-26 2010-02-18 Daniel Marcu Building A Translation Lexicon From Comparable, Non-Parallel Corpora
US7683025B2 (en) 2002-11-14 2010-03-23 Northwestern University Synthesis and self-assembly of ABC triblock bola peptide amphiphiles
US7693813B1 (en) 2007-03-30 2010-04-06 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US7702614B1 (en) 2007-03-30 2010-04-20 Google Inc. Index updating using segment swapping
US7702618B1 (en) 2004-07-26 2010-04-20 Google Inc. Information retrieval system for archiving multiple document versions
US20100131274A1 (en) * 2008-11-26 2010-05-27 At&T Intellectual Property I, L.P. System and method for dialog modeling
US20100169359A1 (en) * 2008-12-30 2010-07-01 Barrett Leslie A System, Method, and Apparatus for Information Extraction of Textual Documents
US20100174524A1 (en) * 2004-07-02 2010-07-08 Philipp Koehn Empirical Methods for Splitting Compound Words with Application to Machine Translation
US20100266557A1 (en) * 2009-04-13 2010-10-21 Northwestern University Novel peptide-based scaffolds for cartilage regeneration and methods for their use
US7827029B2 (en) * 2004-11-30 2010-11-02 Palo Alto Research Center Incorporated Systems and methods for user-interest sensitive note-taking
US20100318348A1 (en) * 2002-05-20 2010-12-16 Microsoft Corporation Applying a structured language model to information extraction
US7925655B1 (en) 2007-03-30 2011-04-12 Google Inc. Query scheduling using hierarchical tiers of index servers
US7925496B1 (en) * 2007-04-23 2011-04-12 The United States Of America As Represented By The Secretary Of The Navy Method for summarizing natural language text
US20110225104A1 (en) * 2010-03-09 2011-09-15 Radu Soricut Predicting the Cost Associated with Translating Textual Content
US20110282651A1 (en) * 2010-05-11 2011-11-17 Microsoft Corporation Generating snippets based on content features
US8086594B1 (en) 2007-03-30 2011-12-27 Google Inc. Bifurcated document relevance scoring
US20120035912A1 (en) * 2010-07-30 2012-02-09 Ben-Gurion University Of The Negev Research And Development Authority Multilingual sentence extractor
US8117223B2 (en) 2007-09-07 2012-02-14 Google Inc. Integrating external related phrase information into a phrase-based indexing information retrieval system
US20120065960A1 (en) * 2010-09-14 2012-03-15 International Business Machines Corporation Generating parser combination by combining language processing parsers
US8166045B1 (en) 2007-03-30 2012-04-24 Google Inc. Phrase extraction using subphrase scoring
US8166021B1 (en) 2007-03-30 2012-04-24 Google Inc. Query phrasification
US20120109945A1 (en) * 2010-10-29 2012-05-03 Emilia Maria Lapko Method and system of improving navigation within a set of electronic documents
US20120143595A1 (en) * 2010-12-06 2012-06-07 Xin Li Fast title/summary extraction from long descriptions
US8214196B2 (en) 2001-07-03 2012-07-03 University Of Southern California Syntax-based statistical translation model
US8296127B2 (en) 2004-03-23 2012-10-23 University Of Southern California Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US8380486B2 (en) 2009-10-01 2013-02-19 Language Weaver, Inc. Providing machine-generated translations and corresponding trust levels
US8433556B2 (en) 2006-11-02 2013-04-30 University Of Southern California Semi-supervised training for statistical word alignment
US8468149B1 (en) 2007-01-26 2013-06-18 Language Weaver, Inc. Multi-lingual online community
US8615389B1 (en) 2007-03-16 2013-12-24 Language Weaver, Inc. Generation and exploitation of an approximate language model
US20140067379A1 (en) * 2011-11-29 2014-03-06 Sk Telecom Co., Ltd. Automatic sentence evaluation device using shallow parser to automatically evaluate sentence, and error detection apparatus and method of the same
US8676563B2 (en) 2009-10-01 2014-03-18 Language Weaver, Inc. Providing human-generated and machine-generated trusted translations
US8694303B2 (en) 2011-06-15 2014-04-08 Language Weaver, Inc. Systems and methods for tuning parameters in statistical machine translation
US8805677B1 (en) * 2010-02-10 2014-08-12 West Corporation Processing natural language grammar
US8825466B1 (en) 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
US8831928B2 (en) 2007-04-04 2014-09-09 Language Weaver, Inc. Customizable machine translation service
US8886518B1 (en) 2006-08-07 2014-11-11 Language Weaver, Inc. System and method for capitalizing machine translated text
US8886515B2 (en) 2011-10-19 2014-11-11 Language Weaver, Inc. Systems and methods for enhancing machine translation post edit review processes
US8886517B2 (en) 2005-06-17 2014-11-11 Language Weaver, Inc. Trust scoring for language translation systems
US8914279B1 (en) * 2011-09-23 2014-12-16 Google Inc. Efficient parsing with structured prediction cascades
US8942973B2 (en) 2012-03-09 2015-01-27 Language Weaver, Inc. Content page URL translation
US8943080B2 (en) 2006-04-07 2015-01-27 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
US8990064B2 (en) 2009-07-28 2015-03-24 Language Weaver, Inc. Translating documents based on content
WO2015003143A3 (en) * 2013-07-03 2015-05-14 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US20150205786A1 (en) * 2012-07-31 2015-07-23 Nec Corporation Problem situation detection device, problem situation detection method and problem situation detection-use program
US9122674B1 (en) 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US9152622B2 (en) 2012-11-26 2015-10-06 Language Weaver, Inc. Personalized machine translation via online adaptation
US9213694B2 (en) 2013-10-10 2015-12-15 Language Weaver, Inc. Efficient online domain adaptation
WO2015191061A1 (en) * 2014-06-11 2015-12-17 Hewlett-Packard Development Company, L.P. Functional summarization of non-textual content based on a meta-algorithmic pattern
US9336185B1 (en) * 2012-09-18 2016-05-10 Amazon Technologies, Inc. Generating an electronic publication sample
US9336186B1 (en) * 2013-10-10 2016-05-10 Google Inc. Methods and apparatus related to sentence compression
US9355372B2 (en) 2013-07-03 2016-05-31 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US20160283588A1 (en) * 2015-03-27 2016-09-29 Fujitsu Limited Generation apparatus and method
US9483568B1 (en) 2013-06-05 2016-11-01 Google Inc. Indexing system
US9501506B1 (en) 2013-03-15 2016-11-22 Google Inc. Indexing system
US9582501B1 (en) * 2014-06-16 2017-02-28 Yseop Sa Techniques for automatic generation of natural language text
US20170132529A1 (en) * 2000-09-28 2017-05-11 Intel Corporation Method and Apparatus for Extracting Entity Names and Their Relations
US9916306B2 (en) 2012-10-19 2018-03-13 Sdl Inc. Statistical linguistic analysis of source content

Families Citing this family (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7054803B2 (en) * 2000-12-19 2006-05-30 Xerox Corporation Extracting sentence translations from translated documents
US6990439B2 (en) * 2001-01-10 2006-01-24 Microsoft Corporation Method and apparatus for performing machine translation using a unified language model and translation model
US7734459B2 (en) * 2001-06-01 2010-06-08 Microsoft Corporation Automatic extraction of transfer mappings from bilingual corpora
US7146358B1 (en) * 2001-08-28 2006-12-05 Google Inc. Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
WO2003038663A3 (en) * 2001-10-29 2004-06-24 British Telecomm Machine translation
JP3959453B2 (en) * 2002-03-14 2007-08-15 沖電気工業株式会社 Translation mediation system and the translation intermediary server
US7634398B2 (en) * 2002-05-16 2009-12-15 Microsoft Corporation Method and apparatus for reattaching nodes in a parse structure
US7925493B2 (en) * 2003-09-01 2011-04-12 Advanced Telecommunications Research Institute International Machine translation apparatus and machine translation computer program
JP3919771B2 (en) * 2003-09-09 2007-05-30 株式会社国際電気通信基礎技術研究所 Machine translation system, a control system, and a computer program
US8037102B2 (en) * 2004-02-09 2011-10-11 Robert T. and Virginia T. Jenkins Manipulating sets of hierarchical data
US7721271B2 (en) 2004-04-22 2010-05-18 Microsoft Corporation Language localization and intercepting data using translation tables
US9646107B2 (en) * 2004-05-28 2017-05-09 Robert T. and Virginia T. Jenkins as Trustee of the Jenkins Family Trust Method and/or system for simplifying tree expressions such as for query reduction
US7620632B2 (en) * 2004-06-30 2009-11-17 Skyler Technology, Inc. Method and/or system for performing tree matching
US7882147B2 (en) * 2004-06-30 2011-02-01 Robert T. and Virginia T. Jenkins File location naming hierarchy
US7801923B2 (en) 2004-10-29 2010-09-21 Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust Method and/or system for tagging trees
US7627591B2 (en) * 2004-10-29 2009-12-01 Skyler Technology, Inc. Method and/or system for manipulating tree expressions
US7630995B2 (en) 2004-11-30 2009-12-08 Skyler Technology, Inc. Method and/or system for transmitting and/or receiving data
US7636727B2 (en) * 2004-12-06 2009-12-22 Skyler Technology, Inc. Enumeration of trees from finite number of nodes
US8316059B1 (en) 2004-12-30 2012-11-20 Robert T. and Virginia T. Jenkins Enumeration of rooted partial subtrees
JP4301515B2 (en) * 2005-01-04 2009-07-22 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Maschines Corporation Text display method, information processing apparatus, an information processing system, program
US8615530B1 (en) 2005-01-31 2013-12-24 Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust Method and/or system for tree transformation
US7681177B2 (en) 2005-02-28 2010-03-16 Skyler Technology, Inc. Method and/or system for transforming between trees and strings
JP4050755B2 (en) * 2005-03-30 2008-02-20 株式会社東芝 Communication support equipment, communication support method and communication support program
US8356040B2 (en) * 2005-03-31 2013-01-15 Robert T. and Virginia T. Jenkins Method and/or system for transforming between trees and arrays
US7899821B1 (en) 2005-04-29 2011-03-01 Karl Schiffmann Manipulation and/or analysis of hierarchical data
US8612203B2 (en) * 2005-06-17 2013-12-17 National Research Council Of Canada Statistical machine translation adapted to context
US20070010989A1 (en) * 2005-07-07 2007-01-11 International Business Machines Corporation Decoding procedure for statistical machine translation
US7779396B2 (en) * 2005-08-10 2010-08-17 Microsoft Corporation Syntactic program language translation
US8924212B1 (en) * 2005-08-26 2014-12-30 At&T Intellectual Property Ii, L.P. System and method for robust access and entry to large structured data using voice form-filling
US8296123B2 (en) * 2006-02-17 2012-10-23 Google Inc. Encoding and adaptive, scalable accessing of distributed models
US8959011B2 (en) 2007-03-22 2015-02-17 Abbyy Infopoisk Llc Indicating and correcting errors in machine translation systems
US9262409B2 (en) 2008-08-06 2016-02-16 Abbyy Infopoisk Llc Translation of a selected text fragment of a screen
US8214199B2 (en) * 2006-10-10 2012-07-03 Abbyy Software, Ltd. Systems for translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions
US20080086298A1 (en) * 2006-10-10 2008-04-10 Anisimovich Konstantin Method and system for translating sentences between langauges
US9633005B2 (en) 2006-10-10 2017-04-25 Abbyy Infopoisk Llc Exhaustive automatic processing of textual information
US9047275B2 (en) 2006-10-10 2015-06-02 Abbyy Infopoisk Llc Methods and systems for alignment of parallel text corpora
US8195447B2 (en) 2006-10-10 2012-06-05 Abbyy Software Ltd. Translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions
US9645993B2 (en) 2006-10-10 2017-05-09 Abbyy Infopoisk Llc Method and system for semantic searching
US8548795B2 (en) * 2006-10-10 2013-10-01 Abbyy Software Ltd. Method for translating documents from one language into another using a database of translations, a terminology dictionary, a translation dictionary, and a machine translation system
US9235573B2 (en) 2006-10-10 2016-01-12 Abbyy Infopoisk Llc Universal difference measure
US8145473B2 (en) 2006-10-10 2012-03-27 Abbyy Software Ltd. Deep model statistics method for machine translation
JP5082374B2 (en) * 2006-10-19 2012-11-28 富士通株式会社 Phrase alignment program, the translation program, the phrase alignment device and the phrase alignment method
JP4997966B2 (en) * 2006-12-28 2012-08-15 富士通株式会社 Translated example sentence search program, translated example sentence search unit, and the translated example sentence search process
US7895030B2 (en) * 2007-03-16 2011-02-22 International Business Machines Corporation Visualization method for machine translation
US7877251B2 (en) * 2007-05-07 2011-01-25 Microsoft Corporation Document translation system
US9779079B2 (en) * 2007-06-01 2017-10-03 Xerox Corporation Authoring system
US8452585B2 (en) * 2007-06-21 2013-05-28 Microsoft Corporation Discriminative syntactic word order model for machine translation
US8812296B2 (en) 2007-06-27 2014-08-19 Abbyy Infopoisk Llc Method and system for natural language dictionary generation
US8103498B2 (en) * 2007-08-10 2012-01-24 Microsoft Corporation Progressive display rendering of processed text
US8229728B2 (en) * 2008-01-04 2012-07-24 Fluential, Llc Methods for using manual phrase alignment data to generate translation models for statistical machine translation
US20120284015A1 (en) * 2008-01-28 2012-11-08 William Drewes Method for Increasing the Accuracy of Subject-Specific Statistical Machine Translation (SMT)
US8244519B2 (en) * 2008-12-03 2012-08-14 Xerox Corporation Dynamic translation memory using statistical machine translation
CN101996166B (en) * 2009-08-14 2015-08-05 张龙哺 Dual mode of recording the statement of the method and translation method and translation system
US9710429B1 (en) * 2010-11-12 2017-07-18 Google Inc. Providing text resources updated with translation input from multiple users
US20120143593A1 (en) * 2010-12-07 2012-06-07 Microsoft Corporation Fuzzy matching and scoring based on direct alignment
US8903707B2 (en) 2012-01-12 2014-12-02 International Business Machines Corporation Predicting pronouns of dropped pronoun style languages for natural language translation
US20150161109A1 (en) * 2012-01-13 2015-06-11 Google Inc. Reordering words for machine translation
CN102662935A (en) * 2012-04-08 2012-09-12 北京语智云帆科技有限公司 Interactive machine translation method and machine translation system
US8989485B2 (en) 2012-04-27 2015-03-24 Abbyy Development Llc Detecting a junction in a text line of CJK characters
US8971630B2 (en) 2012-04-27 2015-03-03 Abbyy Development Llc Fast CJK character recognition
CN102999486B (en) * 2012-11-16 2016-12-21 沈阳雅译网络技术有限公司 Phrase extraction rule based on a combination of
CN105808076A (en) * 2012-12-14 2016-07-27 中兴通讯股份有限公司 Setting method and device of browser bookmark, and terminal
US9298703B2 (en) 2013-02-08 2016-03-29 Machine Zone, Inc. Systems and methods for incentivizing user feedback for translation processing
US8996352B2 (en) * 2013-02-08 2015-03-31 Machine Zone, Inc. Systems and methods for correcting translations in multi-user multi-lingual communications
US9600473B2 (en) 2013-02-08 2017-03-21 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
US9031829B2 (en) 2013-02-08 2015-05-12 Machine Zone, Inc. Systems and methods for multi-user multi-lingual communications
JP6058513B2 (en) * 2013-10-01 2017-01-11 日本電信電話株式会社 Word order sorting apparatus, the translation apparatus, method, and program
KR20150056690A (en) * 2013-11-15 2015-05-27 삼성전자주식회사 Method for recognizing a translatable situation and performancing a translatable function and electronic device implementing the same
US9626358B2 (en) 2014-11-26 2017-04-18 Abbyy Infopoisk Llc Creating ontologies by analyzing natural language texts
RU2592395C2 (en) 2013-12-19 2016-07-20 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Resolution semantic ambiguity by statistical analysis
CN103645931B (en) * 2013-12-25 2016-06-22 盛杰 A method and apparatus for transcoding
JP2015127894A (en) * 2013-12-27 2015-07-09 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Support apparatus, information processing method, and program
RU2586577C2 (en) 2014-01-15 2016-06-10 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Filtering arcs parser graph
US9524293B2 (en) 2014-08-15 2016-12-20 Google Inc. Techniques for automatically swapping languages and/or content for machine translation
RU2596600C2 (en) 2014-09-02 2016-09-10 Общество с ограниченной ответственностью "Аби Девелопмент" Methods and systems for processing images of mathematical expressions
US9372848B2 (en) 2014-10-17 2016-06-21 Machine Zone, Inc. Systems and methods for language detection
CN105117389B (en) * 2015-07-28 2018-01-19 百度在线网络技术(北京)有限公司 Translation method and apparatus
US20170103062A1 (en) * 2015-10-08 2017-04-13 Facebook, Inc. Language independent representations
CN106021239A (en) * 2016-04-29 2016-10-12 北京创鑫旅程网络技术有限公司 Method for real-time evaluation of translation quality

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5642520A (en) * 1993-12-07 1997-06-24 Nippon Telegraph And Telephone Corporation Method and apparatus for recognizing topic structure of language data
US5761631A (en) * 1994-11-17 1998-06-02 International Business Machines Corporation Parsing method and system for natural language processing
US5903858A (en) * 1995-06-23 1999-05-11 Saraki; Masashi Translation machine for editing a original text by rewriting the same and translating the rewrote one
US5991710A (en) * 1997-05-20 1999-11-23 International Business Machines Corporation Statistical translation system with features based on phrases or groups of words
US6112168A (en) * 1997-10-20 2000-08-29 Microsoft Corporation Automatically recognizing the discourse structure of a body of text
US6205456B1 (en) * 1997-01-17 2001-03-20 Fujitsu Limited Summarization apparatus and method

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0411906B2 (en) * 1985-03-25 1992-03-02 Tokyo Shibaura Electric Co
DE3616751A1 (en) * 1985-05-20 1986-11-20 Sharp Kk Translation System
JPH02301869A (en) * 1989-05-17 1990-12-13 Hitachi Ltd Method for maintaining and supporting natural language processing system
US5369574A (en) * 1990-08-01 1994-11-29 Canon Kabushiki Kaisha Sentence generating system
GB9312598D0 (en) * 1993-06-18 1993-08-04 Canon Res Ct Europe Ltd Processing a bilingual database
US6304841B1 (en) * 1993-10-28 2001-10-16 International Business Machines Corporation Automatic construction of conditional exponential models from elementary features
US5510981A (en) * 1993-10-28 1996-04-23 International Business Machines Corporation Language translation apparatus and method using context-based translation models
JP3377290B2 (en) * 1994-04-27 2003-02-17 シャープ株式会社 Machine translation apparatus with the idiom processing function
GB9423995D0 (en) * 1994-11-28 1995-01-11 Sharp Kk Machine translation system
DE69837979T2 (en) * 1997-06-27 2008-03-06 International Business Machines Corp. A system for extracting a multi-lingual terminology
US6533822B2 (en) 1998-01-30 2003-03-18 Xerox Corporation Creating summaries along with indicators, and automatically positioned tabs
GB9810795D0 (en) * 1998-05-20 1998-07-15 Sharp Kk Method of and apparatus for retrieving information and storage medium
GB9811744D0 (en) * 1998-06-02 1998-07-29 Sharp Kk Method of and apparatus for forming an index use of an index and a storage medium
US6092034A (en) * 1998-07-27 2000-07-18 International Business Machines Corporation Statistical translation system and method for fast sense disambiguation and translation of large corpora using fertility models and sense models
JP2000132550A (en) * 1998-10-26 2000-05-12 Matsushita Electric Ind Co Ltd Chinese generating device for machine translation
US6393389B1 (en) * 1999-09-23 2002-05-21 Xerox Corporation Using ranked translation choices to obtain sequences indicating meaning of multi-token expressions

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5477451A (en) * 1991-07-25 1995-12-19 International Business Machines Corp. Method and system for natural language translation
US5805832A (en) * 1991-07-25 1998-09-08 International Business Machines Corporation System for parametric text to text language translation
US5642520A (en) * 1993-12-07 1997-06-24 Nippon Telegraph And Telephone Corporation Method and apparatus for recognizing topic structure of language data
US5761631A (en) * 1994-11-17 1998-06-02 International Business Machines Corporation Parsing method and system for natural language processing
US5903858A (en) * 1995-06-23 1999-05-11 Saraki; Masashi Translation machine for editing a original text by rewriting the same and translating the rewrote one
US6205456B1 (en) * 1997-01-17 2001-03-20 Fujitsu Limited Summarization apparatus and method
US5991710A (en) * 1997-05-20 1999-11-23 International Business Machines Corporation Statistical translation system with features based on phrases or groups of words
US6112168A (en) * 1997-10-20 2000-08-29 Microsoft Corporation Automatically recognizing the discourse structure of a body of text

Cited By (222)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961692B1 (en) * 2000-08-01 2005-11-01 Fuji Xerox Co, Ltd. System and method for writing analysis using the linguistic discourse model
US20170132529A1 (en) * 2000-09-28 2017-05-11 Intel Corporation Method and Apparatus for Extracting Entity Names and Their Relations
US20020138248A1 (en) * 2001-01-26 2002-09-26 Corston-Oliver Simon H. Lingustically intelligent text compression
US20060184351A1 (en) * 2001-01-26 2006-08-17 Microsoft Corporation Linguistically intelligent text compression
US7069207B2 (en) * 2001-01-26 2006-06-27 Microsoft Corporation Linguistically intelligent text compression
US7398203B2 (en) 2001-01-26 2008-07-08 Microsoft Corporation Linguistically intelligent text compression
US20020186241A1 (en) * 2001-02-15 2002-12-12 Ibm Digital document browsing system and method thereof
US7454698B2 (en) * 2001-02-15 2008-11-18 International Business Machines Corporation Digital document browsing system and method thereof
US8214196B2 (en) 2001-07-03 2012-07-03 University Of Southern California Syntax-based statistical translation model
US7838491B2 (en) 2001-11-14 2010-11-23 Northwestern University Self-assembly and mineralization of peptide-amphiphile nanofibers
US7491690B2 (en) 2001-11-14 2009-02-17 Northwestern University Self-assembly and mineralization of peptide-amphiphile nanofibers
US20090156505A1 (en) * 2001-11-14 2009-06-18 Northwestern University Self-assembly and mineralization of peptide-amphiphile nanofibers
US8412515B2 (en) * 2002-01-14 2013-04-02 Microsoft Corporation System for normalizing a discourse representation structure and normalized data structure
US20080033715A1 (en) * 2002-01-14 2008-02-07 Microsoft Corporation System for normalizing a discourse representation structure and normalized data structure
US20080177033A1 (en) * 2002-02-15 2008-07-24 Stupp Samuel I Self-Assembly of Peptide-Amphiphile Nanofibers under Physiological Conditions
US7371719B2 (en) 2002-02-15 2008-05-13 Northwestern University Self-assembly of peptide-amphiphile nanofibers under physiological conditions
US20110008890A1 (en) * 2002-02-15 2011-01-13 Northwestern University Self-Assembly of Peptide-Amphiphile Nanofibers Under Physiological Conditions
US20040001893A1 (en) * 2002-02-15 2004-01-01 Stupp Samuel I. Self-assembly of peptide-amphiphile nanofibers under physiological conditions
US7745708B2 (en) 2002-02-15 2010-06-29 Northwestern University Self-assembly of peptide-amphiphile nanofibers under physiological conditions
US8063014B2 (en) 2002-02-15 2011-11-22 Northwestern University Self-assembly of peptide-amphiphile nanofibers under physiological conditions
US8651873B2 (en) 2002-02-22 2014-02-18 Educational Testing Service Portal assessment design system for educational testing
US20050170325A1 (en) * 2002-02-22 2005-08-04 Steinberg Linda S. Portal assessment design system for educational testing
US20100262603A1 (en) * 2002-02-26 2010-10-14 Odom Paul S Search engine methods and systems for displaying relevant topics
US7340466B2 (en) * 2002-02-26 2008-03-04 Kang Jo Mgmt. Limited Liability Company Topic identification and use thereof in information retrieval systems
US20070265996A1 (en) * 2002-02-26 2007-11-15 Odom Paul S Search engine methods and systems for displaying relevant topics
US7716207B2 (en) 2002-02-26 2010-05-11 Odom Paul S Search engine methods and systems for displaying relevant topics
US20060004732A1 (en) * 2002-02-26 2006-01-05 Odom Paul S Search engine methods and systems for generating relevant search results and advertisements
US20030167252A1 (en) * 2002-02-26 2003-09-04 Pliant Technologies, Inc. Topic identification and use thereof in information retrieval systems
US8234106B2 (en) 2002-03-26 2012-07-31 University Of Southern California Building a translation lexicon from comparable, non-parallel corpora
US20100042398A1 (en) * 2002-03-26 2010-02-18 Daniel Marcu Building A Translation Lexicon From Comparable, Non-Parallel Corpora
US20030188255A1 (en) * 2002-03-28 2003-10-02 Fujitsu Limited Apparatus for and method of generating synchronized contents information, and computer product
US7516415B2 (en) * 2002-03-28 2009-04-07 Fujitsu Limited Apparatus for and method of generating synchronized contents information, and computer product
US8706491B2 (en) * 2002-05-20 2014-04-22 Microsoft Corporation Applying a structured language model to information extraction
US20100318348A1 (en) * 2002-05-20 2010-12-16 Microsoft Corporation Applying a structured language model to information extraction
EP1535261A1 (en) * 2002-06-24 2005-06-01 Educational Testing Service Automated essay annotation system and method
EP1535261A4 (en) * 2002-06-24 2011-02-09 Educational Testing Service Automated essay annotation system and method
US7534761B1 (en) 2002-08-21 2009-05-19 North Western University Charged peptide-amphiphile solutions and self-assembled peptide nanofiber networks formed therefrom
US7305336B2 (en) * 2002-08-30 2007-12-04 Fuji Xerox Co., Ltd. System and method for summarization combining natural language generation with structural analysis
US20040044519A1 (en) * 2002-08-30 2004-03-04 Livia Polanyi System and method for summarization combining natural language generation with structural analysis
US20040117734A1 (en) * 2002-09-30 2004-06-17 Frank Krickhahn Method and apparatus for structuring texts
US8124583B2 (en) 2002-11-12 2012-02-28 Northwestern University Composition and method for self-assembly and mineralization of peptide-amphiphiles
US7554021B2 (en) 2002-11-12 2009-06-30 Northwestern University Composition and method for self-assembly and mineralization of peptide amphiphiles
US20060149036A1 (en) * 2002-11-12 2006-07-06 Stupp Samuel I Composition and method for self-assembly and mineralizatin of peptide amphiphiles
GB2411028A (en) * 2002-11-14 2005-08-17 Educational Testing Service Automated evaluation of overly repetitive word use in an essay
WO2004046956A1 (en) * 2002-11-14 2004-06-03 Educational Testing Service Automated evaluation of overly repetitive word use in an essay
US7683025B2 (en) 2002-11-14 2010-03-23 Northwestern University Synthesis and self-assembly of ABC triblock bola peptide amphiphiles
US20040167884A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Methods and products for producing role related information from free text sources
US20040167911A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Methods and products for integrating mixed format data including the extraction of relational facts from free text
US20040167910A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Integrated data products of processes of integrating mixed format data
US20050108256A1 (en) * 2002-12-06 2005-05-19 Attensity Corporation Visualization of integrated structured and unstructured data
US20040167883A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Methods and systems for providing a service for producing structured data elements from free text sources
US20040167885A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Data products of processes of extracting role related information from free text sources
US20040215634A1 (en) * 2002-12-06 2004-10-28 Attensity Corporation Methods and products for merging codes and notes into an integrated relational database
US20040167908A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Integration of structured data with free text for data mining
US20040167887A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Integration of structured data with relational facts from free text for data mining
US20040167870A1 (en) * 2002-12-06 2004-08-26 Attensity Corporation Systems and methods for providing a mixed data integration service
US7390526B2 (en) 2003-02-11 2008-06-24 Northwestern University Methods and materials for nanocrystalline surface coatings and attachment of peptide amphiphile nanofibers thereon
US20040258726A1 (en) * 2003-02-11 2004-12-23 Stupp Samuel I. Methods and materials for nanocrystalline surface coatings and attachment of peptide amphiphile nanofibers thereon
US20040230415A1 (en) * 2003-05-12 2004-11-18 Stefan Riezler Systems and methods for grammatical text condensation
US8548794B2 (en) 2003-07-02 2013-10-01 University Of Southern California Statistical noun phrase translation
US20050038643A1 (en) * 2003-07-02 2005-02-17 Philipp Koehn Statistical noun phrase translation
US7610190B2 (en) * 2003-10-15 2009-10-27 Fuji Xerox Co., Ltd. Systems and methods for hybrid text summarization
US20050086592A1 (en) * 2003-10-15 2005-04-21 Livia Polanyi Systems and methods for hybrid text summarization
US8138140B2 (en) 2003-12-05 2012-03-20 Northwestern University Self-assembling peptide amphiphiles and related methods for growth factor delivery
US20050208589A1 (en) * 2003-12-05 2005-09-22 Stupp Samuel I Branched peptide amphiphiles, related epitope compounds and self assembled structures thereof
US20050209145A1 (en) * 2003-12-05 2005-09-22 Stupp Samuel I Self-assembling peptide amphiphiles and related methods for growth factor delivery
US20090269847A1 (en) * 2003-12-05 2009-10-29 Northwestern University Self-assembling peptide amphiphiles and related methods for growth factor delivery
US8580923B2 (en) 2003-12-05 2013-11-12 Northwestern University Self-assembling peptide amphiphiles and related methods for growth factor delivery
US7452679B2 (en) 2003-12-05 2008-11-18 Northwestern University Branched peptide amphiphiles, related epitope compounds and self assembled structures thereof
US7544661B2 (en) 2003-12-05 2009-06-09 Northwestern University Self-assembling peptide amphiphiles and related methods for growth factor delivery
US20050138556A1 (en) * 2003-12-18 2005-06-23 Xerox Corporation Creation of normalized summaries using common domain models for input text analysis and output text generation
US20050137855A1 (en) * 2003-12-19 2005-06-23 Maxwell John T.Iii Systems and methods for the generation of alternate phrases from packed meaning
US7788083B2 (en) * 2003-12-19 2010-08-31 Palo Alto Research Center Incorporated Systems and methods for the generation of alternate phrases from packed meaning
US7657420B2 (en) 2003-12-19 2010-02-02 Palo Alto Research Center Incorporated Systems and methods for the generation of alternate phrases from packed meaning
US20070250305A1 (en) * 2003-12-19 2007-10-25 Xerox Corporation Systems and methods for the generation of alternate phrases from packed meaning
US20050187772A1 (en) * 2004-02-25 2005-08-25 Fuji Xerox Co., Ltd. Systems and methods for synthesizing speech using discourse function level prosodic features
US8296127B2 (en) 2004-03-23 2012-10-23 University Of Southern California Discovery of parallel text portions in comparable collections of corpora and training using comparable texts
US20050221266A1 (en) * 2004-04-02 2005-10-06 Mislevy Robert J System and method for assessment design
US20080270109A1 (en) * 2004-04-16 2008-10-30 University Of Southern California Method and System for Translating Information with a Higher Probability of a Correct Translation
US8666725B2 (en) 2004-04-16 2014-03-04 University Of Southern California Selection and use of nonstatistical translation components in a statistical machine translation framework
US8977536B2 (en) 2004-04-16 2015-03-10 University Of Southern California Method and system for translating information with a higher probability of a correct translation
US20050256848A1 (en) * 2004-05-13 2005-11-17 International Business Machines Corporation System and method for user rank search
US7562008B2 (en) * 2004-06-23 2009-07-14 Ning-Ping Chan Machine translation method and system that decomposes complex sentences into two or more sentences
US20060009961A1 (en) * 2004-06-23 2006-01-12 Ning-Ping Chan Method of decomposing prose elements in document processing
US20100174524A1 (en) * 2004-07-02 2010-07-08 Philipp Koehn Empirical Methods for Splitting Compound Words with Application to Machine Translation
US8768969B2 (en) * 2004-07-09 2014-07-01 Nuance Communications, Inc. Method and system for efficient representation, manipulation, communication, and search of hierarchical composite named entities
US20060010138A1 (en) * 2004-07-09 2006-01-12 International Business Machines Corporation Method and system for efficient representation, manipulation, communication, and search of hierarchical composite named entities
US7536408B2 (en) 2004-07-26 2009-05-19 Google Inc. Phrase-based indexing in an information retrieval system
US7599914B2 (en) 2004-07-26 2009-10-06 Google Inc. Phrase-based searching in an information retrieval system
US7603345B2 (en) 2004-07-26 2009-10-13 Google Inc. Detecting spam documents in a phrase based information retrieval system
US20080306943A1 (en) * 2004-07-26 2008-12-11 Anna Lynn Patterson Phrase-based detection of duplicate documents in an information retrieval system
US7426507B1 (en) 2004-07-26 2008-09-16 Google, Inc. Automatic taxonomy generation in search results using phrases
US9817886B2 (en) 2004-07-26 2017-11-14 Google Llc Information retrieval system for archiving multiple document versions
US7584175B2 (en) * 2004-07-26 2009-09-01 Google Inc. Phrase-based generation of document descriptions
US7580929B2 (en) 2004-07-26 2009-08-25 Google Inc. Phrase-based personalization of searches in an information retrieval system
US20100030773A1 (en) * 2004-07-26 2010-02-04 Google Inc. Multiple index based information retrieval system
US7580921B2 (en) 2004-07-26 2009-08-25 Google Inc. Phrase identification in an information retrieval system
US20060020473A1 (en) * 2004-07-26 2006-01-26 Atsuo Hiroe Method, apparatus, and program for dialogue, and storage medium including a program stored therein
US20060020571A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based generation of document descriptions
US20060020607A1 (en) * 2004-07-26 2006-01-26 Patterson Anna L Phrase-based indexing in an information retrieval system
US7702618B1 (en) 2004-07-26 2010-04-20 Google Inc. Information retrieval system for archiving multiple document versions
US7711679B2 (en) 2004-07-26 2010-05-04 Google Inc. Phrase-based detection of duplicate documents in an information retrieval system
US8560550B2 (en) 2004-07-26 2013-10-15 Google, Inc. Multiple index based information retrieval system
US9037573B2 (en) 2004-07-26 2015-05-19 Google, Inc. Phase-based personalization of searches in an information retrieval system
CN1728143B (en) 2004-07-26 2010-06-09 咕果公司 Phrase-based generation of document description
US8108412B2 (en) 2004-07-26 2012-01-31 Google, Inc. Phrase-based detection of duplicate documents in an information retrieval system
US20100161625A1 (en) * 2004-07-26 2010-06-24 Google Inc. Phrase-based detection of duplicate documents in an information retrieval system
US20060294155A1 (en) * 2004-07-26 2006-12-28 Patterson Anna L Detecting spam documents in a phrase based information retrieval system
US8078629B2 (en) 2004-07-26 2011-12-13 Google Inc. Detecting spam documents in a phrase based information retrieval system
US9361331B2 (en) 2004-07-26 2016-06-07 Google Inc. Multiple index based information retrieval system
US8489628B2 (en) 2004-07-26 2013-07-16 Google Inc. Phrase-based detection of duplicate documents in an information retrieval system
US9384224B2 (en) 2004-07-26 2016-07-05 Google Inc. Information retrieval system for archiving multiple document versions
US20110131223A1 (en) * 2004-07-26 2011-06-02 Google Inc. Detecting spam documents in a phrase based information retrieval system
US20060031195A1 (en) * 2004-07-26 2006-02-09 Patterson Anna L Phrase-based searching in an information retrieval system
US9569505B2 (en) 2004-07-26 2017-02-14 Google Inc. Phrase-based searching in an information retrieval system
US9817825B2 (en) 2004-07-26 2017-11-14 Google Llc Multiple index based information retrieval system
US20080319971A1 (en) * 2004-07-26 2008-12-25 Anna Lynn Patterson Phrase-based personalization of searches in an information retrieval system
US7567959B2 (en) 2004-07-26 2009-07-28 Google Inc. Multiple index based information retrieval system
US8600728B2 (en) 2004-10-12 2013-12-03 University Of Southern California Training for a text-to-text application which uses string to tree conversion for training and decoding
US20060142995A1 (en) * 2004-10-12 2006-06-29 Kevin Knight Training for a text-to-text application which uses string to tree conversion for training and decoding
US20060095250A1 (en) * 2004-11-03 2006-05-04 Microsoft Corporation Parser for natural language processing
US7970600B2 (en) 2004-11-03 2011-06-28 Microsoft Corporation Using a first natural language parser to train a second parser
US20060116860A1 (en) * 2004-11-30 2006-06-01 Xerox Corporation Systems and methods for user-interest sensitive condensation
US7801723B2 (en) * 2004-11-30 2010-09-21 Palo Alto Research Center Incorporated Systems and methods for user-interest sensitive condensation
US7827029B2 (en) * 2004-11-30 2010-11-02 Palo Alto Research Center Incorporated Systems and methods for user-interest sensitive note-taking
US20060155530A1 (en) * 2004-12-14 2006-07-13 International Business Machines Corporation Method and apparatus for generation of text documents
US7890500B2 (en) 2004-12-21 2011-02-15 Palo Alto Research Center Incorporated Systems and methods for using and constructing user-interest sensitive indicators of search results
US20070240078A1 (en) * 2004-12-21 2007-10-11 Palo Alto Research Center Incorporated Systems and methods for using and constructing user-interest sensitive indicators of search results
US20060247165A1 (en) * 2005-01-21 2006-11-02 Stupp Samuel I Methods and compositions for encapsulation of cells
US20100169305A1 (en) * 2005-01-25 2010-07-01 Google Inc. Information retrieval system for archiving multiple document versions
US8612427B2 (en) 2005-01-25 2013-12-17 Google, Inc. Information retrieval system for archiving multiple document versions
US20060194183A1 (en) * 2005-02-28 2006-08-31 Yigal Attali Method of model scaling for an automated essay scoring system
US8632344B2 (en) 2005-02-28 2014-01-21 Educational Testing Service Method of model scaling for an automated essay scoring system
US8202098B2 (en) 2005-02-28 2012-06-19 Educational Testing Service Method of model scaling for an automated essay scoring system
US20070277250A1 (en) * 2005-03-04 2007-11-29 Stupp Samuel I Angiogenic heparin-binding epitopes, peptide amphiphiles, self-assembled compositions and related methods of use
US7851445B2 (en) 2005-03-04 2010-12-14 Northwestern University Angiogenic heparin-binding epitopes, peptide amphiphiles, self-assembled compositions and related methods of use
US20060277028A1 (en) * 2005-06-01 2006-12-07 Microsoft Corporation Training a statistical parser on noisy data by filtering
US8886517B2 (en) 2005-06-17 2014-11-11 Language Weaver, Inc. Trust scoring for language translation systems
US20070192309A1 (en) * 2005-10-12 2007-08-16 Gordon Fischer Method and system for identifying sentence boundaries
US9165039B2 (en) 2005-11-29 2015-10-20 Kang Jo Mgmt, Limited Liability Company Methods and systems for providing personalized contextual search results
US20070260598A1 (en) * 2005-11-29 2007-11-08 Odom Paul S Methods and systems for providing personalized contextual search results
US20090045971A1 (en) * 2006-03-06 2009-02-19 Koninklijke Philips Electronics N.V. Use of decision trees for automatic commissioning
US8416713B2 (en) 2006-03-06 2013-04-09 Koninklijke Philips Electronics N.V. Use of decision trees for automatic commissioning
US8943080B2 (en) 2006-04-07 2015-01-27 University Of Southern California Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections
WO2007117652A3 (en) * 2006-04-07 2008-05-02 Basis Technology Corp Method and system of machine translation
US7827028B2 (en) 2006-04-07 2010-11-02 Basis Technology Corporation Method and system of machine translation
US20070260449A1 (en) * 2006-05-02 2007-11-08 Shimei Pan Instance-based sentence boundary determination by optimization
US7552047B2 (en) * 2006-05-02 2009-06-23 International Business Machines Corporation Instance-based sentence boundary determination by optimization
US8886518B1 (en) 2006-08-07 2014-11-11 Language Weaver, Inc. System and method for capitalizing machine translated text
US8433556B2 (en) 2006-11-02 2013-04-30 University Of Southern California Semi-supervised training for statistical word alignment
US9122674B1 (en) 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US8468149B1 (en) 2007-01-26 2013-06-18 Language Weaver, Inc. Multi-lingual online community
US8615389B1 (en) 2007-03-16 2013-12-24 Language Weaver, Inc. Generation and exploitation of an approximate language model
US8943067B1 (en) 2007-03-30 2015-01-27 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US9652483B1 (en) 2007-03-30 2017-05-16 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US8166045B1 (en) 2007-03-30 2012-04-24 Google Inc. Phrase extraction using subphrase scoring
US8166021B1 (en) 2007-03-30 2012-04-24 Google Inc. Query phrasification
US8682901B1 (en) 2007-03-30 2014-03-25 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US9355169B1 (en) 2007-03-30 2016-05-31 Google Inc. Phrase extraction using subphrase scoring
US8402033B1 (en) 2007-03-30 2013-03-19 Google Inc. Phrase extraction using subphrase scoring
US8600975B1 (en) 2007-03-30 2013-12-03 Google Inc. Query phrasification
US7693813B1 (en) 2007-03-30 2010-04-06 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US7702614B1 (en) 2007-03-30 2010-04-20 Google Inc. Index updating using segment swapping
US7925655B1 (en) 2007-03-30 2011-04-12 Google Inc. Query scheduling using hierarchical tiers of index servers
US9223877B1 (en) 2007-03-30 2015-12-29 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US8086594B1 (en) 2007-03-30 2011-12-27 Google Inc. Bifurcated document relevance scoring
US20100161617A1 (en) * 2007-03-30 2010-06-24 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US8090723B2 (en) 2007-03-30 2012-01-03 Google Inc. Index server architecture using tiered and sharded phrase posting lists
US8831928B2 (en) 2007-04-04 2014-09-09 Language Weaver, Inc. Customizable machine translation service
US8076295B2 (en) 2007-04-17 2011-12-13 Nanotope, Inc. Peptide amphiphiles having improved solubility and methods of using same
US20090042804A1 (en) * 2007-04-17 2009-02-12 Hulvat James F Novel peptide amphiphiles having improved solubility and methods of using same
US7925496B1 (en) * 2007-04-23 2011-04-12 The United States Of America As Represented By The Secretary Of The Navy Method for summarizing natural language text
US8825466B1 (en) 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
US9009023B2 (en) * 2007-06-29 2015-04-14 Fujitsu Limited Computer-readable medium having sentence dividing program stored thereon, sentence dividing apparatus, and sentence dividing method
US20090006080A1 (en) * 2007-06-29 2009-01-01 Fujitsu Limited Computer-readable medium having sentence dividing program stored thereon, sentence dividing apparatus, and sentence dividing method
US8117223B2 (en) 2007-09-07 2012-02-14 Google Inc. Integrating external related phrase information into a phrase-based indexing information retrieval system
US8631027B2 (en) 2007-09-07 2014-01-14 Google Inc. Integrated external related phrase information into a phrase-based indexing information retrieval system
US9122675B1 (en) 2008-04-22 2015-09-01 West Corporation Processing natural language grammar
US9454522B2 (en) 2008-06-06 2016-09-27 Apple Inc. Detection of data in a sequence of characters
US8738360B2 (en) * 2008-06-06 2014-05-27 Apple Inc. Data detection of a character sequence having multiple possible data types
US20090306964A1 (en) * 2008-06-06 2009-12-10 Olivier Bonnet Data detection
US9411800B2 (en) * 2008-06-27 2016-08-09 Microsoft Technology Licensing, Llc Adaptive generation of out-of-dictionary personalized long words
US20090326927A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Adaptive generation of out-of-dictionary personalized long words
US20100131274A1 (en) * 2008-11-26 2010-05-27 At&T Intellectual Property I, L.P. System and method for dialog modeling
US9129601B2 (en) * 2008-11-26 2015-09-08 At&T Intellectual Property I, L.P. System and method for dialog modeling
US20100169359A1 (en) * 2008-12-30 2010-07-01 Barrett Leslie A System, Method, and Apparatus for Information Extraction of Textual Documents
US8450271B2 (en) 2009-04-13 2013-05-28 Northwestern University Peptide-based scaffolds for cartilage regeneration and methods for their use
US20100266557A1 (en) * 2009-04-13 2010-10-21 Northwestern University Novel peptide-based scaffolds for cartilage regeneration and methods for their use
US8990064B2 (en) 2009-07-28 2015-03-24 Language Weaver, Inc. Translating documents based on content
US8676563B2 (en) 2009-10-01 2014-03-18 Language Weaver, Inc. Providing human-generated and machine-generated trusted translations
US8380486B2 (en) 2009-10-01 2013-02-19 Language Weaver, Inc. Providing machine-generated translations and corresponding trust levels
US8805677B1 (en) * 2010-02-10 2014-08-12 West Corporation Processing natural language grammar
US20110225104A1 (en) * 2010-03-09 2011-09-15 Radu Soricut Predicting the Cost Associated with Translating Textual Content
US20110282651A1 (en) * 2010-05-11 2011-11-17 Microsoft Corporation Generating snippets based on content features
US8788260B2 (en) * 2010-05-11 2014-07-22 Microsoft Corporation Generating snippets based on content features
US20120035912A1 (en) * 2010-07-30 2012-02-09 Ben-Gurion University Of The Negev Research And Development Authority Multilingual sentence extractor
US8594998B2 (en) * 2010-07-30 2013-11-26 Ben-Gurion University Of The Negev Research And Development Authority Multilingual sentence extractor
US8838440B2 (en) * 2010-09-14 2014-09-16 International Business Machines Corporation Generating parser combination by combining language processing parsers
US20120065960A1 (en) * 2010-09-14 2012-03-15 International Business Machines Corporation Generating parser combination by combining language processing parsers
US20120109945A1 (en) * 2010-10-29 2012-05-03 Emilia Maria Lapko Method and system of improving navigation within a set of electronic documents
US20120143595A1 (en) * 2010-12-06 2012-06-07 Xin Li Fast title/summary extraction from long descriptions
US9317595B2 (en) * 2010-12-06 2016-04-19 Yahoo! Inc. Fast title/summary extraction from long descriptions
US8694303B2 (en) 2011-06-15 2014-04-08 Language Weaver, Inc. Systems and methods for tuning parameters in statistical machine translation
US8914279B1 (en) * 2011-09-23 2014-12-16 Google Inc. Efficient parsing with structured prediction cascades
US8886515B2 (en) 2011-10-19 2014-11-11 Language Weaver, Inc. Systems and methods for enhancing machine translation post edit review processes
US9336199B2 (en) * 2011-11-29 2016-05-10 Sk Telecom Co., Ltd. Automatic sentence evaluation device using shallow parser to automatically evaluate sentence, and error detection apparatus and method of the same
US20140067379A1 (en) * 2011-11-29 2014-03-06 Sk Telecom Co., Ltd. Automatic sentence evaluation device using shallow parser to automatically evaluate sentence, and error detection apparatus and method of the same
US8942973B2 (en) 2012-03-09 2015-01-27 Language Weaver, Inc. Content page URL translation
US20150205786A1 (en) * 2012-07-31 2015-07-23 Nec Corporation Problem situation detection device, problem situation detection method and problem situation detection-use program
US9336185B1 (en) * 2012-09-18 2016-05-10 Amazon Technologies, Inc. Generating an electronic publication sample
US9916306B2 (en) 2012-10-19 2018-03-13 Sdl Inc. Statistical linguistic analysis of source content
US9152622B2 (en) 2012-11-26 2015-10-06 Language Weaver, Inc. Personalized machine translation via online adaptation
US9501506B1 (en) 2013-03-15 2016-11-22 Google Inc. Indexing system
US9483568B1 (en) 2013-06-05 2016-11-01 Google Inc. Indexing system
WO2015003143A3 (en) * 2013-07-03 2015-05-14 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US9355372B2 (en) 2013-07-03 2016-05-31 Thomson Reuters Global Resources Method and system for simplifying implicit rhetorical relation prediction in large scale annotated corpus
US9213694B2 (en) 2013-10-10 2015-12-15 Language Weaver, Inc. Efficient online domain adaptation
US9336186B1 (en) * 2013-10-10 2016-05-10 Google Inc. Methods and apparatus related to sentence compression
WO2015191061A1 (en) * 2014-06-11 2015-12-17 Hewlett-Packard Development Company, L.P. Functional summarization of non-textual content based on a meta-algorithmic pattern
US9582501B1 (en) * 2014-06-16 2017-02-28 Yseop Sa Techniques for automatic generation of natural language text
US20160283588A1 (en) * 2015-03-27 2016-09-29 Fujitsu Limited Generation apparatus and method
US9767193B2 (en) * 2015-03-27 2017-09-19 Fujitsu Limited Generation apparatus and method

Also Published As

Publication number Publication date Type
CN1465018A (en) 2003-12-31 application
WO2001086491A3 (en) 2003-08-14 application
US7533013B2 (en) 2009-05-12 grant
JP2004501429A (en) 2004-01-15 application
WO2001086489A2 (en) 2001-11-15 application
EP1352338A2 (en) 2003-10-15 application
CA2408819A1 (en) 2001-11-15 application
WO2001086491A2 (en) 2001-11-15 application
US20020040292A1 (en) 2002-04-04 application
WO2001086489A3 (en) 2003-07-24 application
CA2408819C (en) 2006-11-07 grant

Similar Documents

Publication Publication Date Title
Barzilay et al. Information fusion in the context of multi-document summarization
Ratnaparkhi Maximum entropy models for natural language ambiguity resolution
Hirst et al. Lexical chains as representations of context for the detection and correction of malapropisms
Gaizauskas et al. Information extraction: Beyond document retrieval
Appelt Introduction to information extraction
Leacock et al. Using corpus statistics and WordNet relations for sense identification
Woods Conceptual indexing: A better way to organize knowledge
Grishman Computational linguistics: an introduction
Kraaij et al. Embedding web-based statistical translation models in cross-language information retrieval
Surdeanu et al. Using predicate-argument structures for information extraction
Moens Information extraction: algorithms and prospects in a retrieval context
US6424983B1 (en) Spelling and grammar checking system
US6658377B1 (en) Method and system for text analysis based on the tagging, processing, and/or reformatting of the input text
US6694055B2 (en) Proper name identification in chinese
McDonald Discriminative sentence compression with soft syntactic evidence
US5694523A (en) Content processing system for discourse
Cucerzan Large-scale named entity disambiguation based on Wikipedia data
Cimiano et al. Towards large-scale, open-domain and ontology-based named entity classification
Duwairi Machine learning for Arabic text categorization
US5680628A (en) Method and apparatus for automated search and retrieval process
Berger et al. OCELOT: a system for summarizing Web pages
Banko et al. Mitigating the paucity-of-data problem: Exploring the effect of training corpus size on classifier performance for natural language processing
US6115683A (en) Automatic essay scoring system using content-based techniques
Gildea et al. The necessity of parsing for predicate argument recognition
Pecina Lexical association measures and collocation extraction

Legal Events

Date Code Title Description
AS Assignment

Owner name: SOUTHERN CALIFORNIA, UNIVERSITY OF, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MARCU, DANIEL;KNIGHT, KEVIN;REEL/FRAME:011804/0846

Effective date: 20010511