WO2008153566A1 - Techniques pour créer des notes générées par ordinateur - Google Patents
Techniques pour créer des notes générées par ordinateur Download PDFInfo
- Publication number
- WO2008153566A1 WO2008153566A1 PCT/US2007/071015 US2007071015W WO2008153566A1 WO 2008153566 A1 WO2008153566 A1 WO 2008153566A1 US 2007071015 W US2007071015 W US 2007071015W WO 2008153566 A1 WO2008153566 A1 WO 2008153566A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- token
- notes
- nodes
- sequence
- information
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 40
- 238000000354 decomposition reaction Methods 0.000 claims abstract description 42
- 238000006243 chemical reaction Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000000977 initiatory effect Effects 0.000 claims 3
- 230000003542 behavioural effect Effects 0.000 claims 1
- 238000009795 derivation Methods 0.000 claims 1
- 239000012634 fragment Substances 0.000 abstract description 10
- 230000006870 function Effects 0.000 description 58
- 238000010586 diagram Methods 0.000 description 10
- 238000012545 processing Methods 0.000 description 9
- 241000282414 Homo sapiens Species 0.000 description 8
- 230000006399 behavior Effects 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 4
- 150000001875 compounds Chemical class 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 241001272567 Hominoidea Species 0.000 description 2
- 241000282412 Homo Species 0.000 description 2
- 241000124008 Mammalia Species 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000004821 distillation Methods 0.000 description 2
- 239000010931 gold Substances 0.000 description 2
- 229910052737 gold Inorganic materials 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000220317 Rosa Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003306 harvesting Methods 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Definitions
- the invention is related to information technology and, more particularly, to the use of computer generated notes to improve comprehension and utilization of digitized information.
- Note taking is a basic function of human knowledge acquisition from printed or digitized information sources, familiar to every student, professional, or worker who must select words or phrases of interest from a page or a document.
- computer-facilitated or computer-automated implementations of note taking - including the current invention all produce value to a user by distillation and/or reduction of the original text of a document into a form more readily managed by a user.
- the user may perform or seek the reduction and/or distillation of a page or document for the purposes of review and study - or for the purpose of correlating the resulting notes together to produce facts, assertions and conclusions.
- notes generated by a human note taker may sometimes be phrases, sentences or paragraphs captured or paraphrased specifically to be quoted elsewhere
- manual note taking for the purpose of knowledge acquisition typically aims to capture from a page or document some fragments which convey meaning - the fragments having a significance subjectively determined by a user.
- the user may seek only a more or less minimal description of what the document or page "is about”.
- a number of software program products have been developed over time to assist and facilitate the note taking function.
- Questia is an online research and library service with an extensive user interface that presents each page of a user selected digitized reference (such as a digitized encyclopedia) to the user. The user can then highlight and capture as a note any text fragment, phrase, paragraph or larger text fragment and store that fragment in an online project folder, preserving the location from which the fragment was copied. Questia then supports composition of research papers by allowing the easy pasting of the captured text fragments into a document, and then automatically generating and placing correctly formed bibliographic references.
- a user selected digitized reference such as a digitized encyclopedia
- the present invention automatically generates notes from a page or document - or from any other digitized information source. None of the currently available products is able to do so. Further, as described more hereinafter, the novel features and uses of the present invention optimize the utility of the generated notes.
- the present invention discloses a method and apparatus for utilizing the nodes generated by the decomposition function described more hereinafter and in said Serial No. 11/273,568 as notes.
- a decomposition function creates nodes from documents, emails, relational database tables and other digitized information sources.
- Nodes are a particular data structure that stores elemental units of information. The nodes can convey meaning because they relate a subject term or phrase to an attribute term or phrase.
- the node contents take the form of a text fragment which conveys meaning, i.e., a note.
- the notes generated from each digital resource are associated with the digital resource from which they are captured. The notes are then stored, organized and presented in several ways which facilitate knowledge acquisition and utilization by a user.
- Figure 1 is a functional diagram of a computer generated note taking system in accordance with one aspect of the invention.
- Figure 2 is a high-level diagram showing how a user interacts with the system for computer generation of notes in accordance with one aspect of the invention.
- Figure 3 is a block diagram showing the decomposition function of Figure 1 in accordance with one aspect of the invention.
- FIG 4 illustrates that operation of the node to note conversion function of Figure 1 in accordance with one aspect of the invention.
- Figure 5 is a flow chart of the note-taking program in accordance with one aspect of the invention.
- Figure 6 illustrates a software architecture preferably used for the computer generation of notes.
- Figure 7 is a block diagram of a hardware architecture of an exemplary personal computer used in carrying out the invention.
- Figure 8 is a block diagram showing use of the note taking functionality in a network environment.
- Figure 9 is an illustration of an exemplary screen view of a note selection window and related control buttons in accordance with one aspect of the invention.
- Figure 1OA illustrates the contents of a note created from a four-part node in accordance with one aspect of the invention.
- Figure 1OB illustrates the contents of a quotation node in accordance with one aspect of the invention.
- Figure 1 is a flow chart of the process by which the nodes generated by the decomposition function described in said Serial No. 11/273,568 are converted into notes and then stored, organized, and presented to the user in accordance with one preferred embodiment of the invention.
- a digital resource 128 is input to a decomposition function 130, generating nodes 180a - 18On as described hereinafter and in said Serial No. 11/273,568 and said Serial No. 11/314,835.
- the nodes are self contained and require nothing else to convey meaning.
- the contents of each generated node 180 is extracted and converted into a note 160 by a note conversion function 163.
- a note 160 is a text object.
- notes 160a - 16On generated from the same digital resource 128 or discrete part thereof are together referred to as a note set 165.
- the note set 165 is placed in a note container 168, which is a data structure suitable for storing notes 160a - 16On and associating a note set 165 with the digital resource 128 from which the notes 160a - 16On were generated.
- FIG 2 is a high-level diagram showing how a user interacts with the system for computer generation of notes in accordance with one aspect of the invention.
- user 305 utilities a personal computer 200 with a display 210 to view a graphical user interface (172 of Figure 1) which displays the text of the resource 128 in a document window 260.
- the document window 260 is displayed on the display 210 in a window 270 for viewing using, e.g., an Internet browser.
- On the browser screen in addition to the document 260, there is a "take notes" button 410 preferably in the shape of a notes icon.
- the text of the resource 128 is extracted and passed to a decomposition function 130 which is shown more in detail in conjunction with Figure 3.
- the decomposition function then passes the output, described hereinafter, to a note conversion function 163, more particularly described in conjunction with Figure 4.
- a note taking program 170 then receives the output of the note conversion function 163 and displays the document 260 in the window together with a notes selection window 176 containing notes 160 and with one or more save notes buttons 181.
- FIG. 3 is a block diagram showing the decomposition function of Figure 1 in accordance with one aspect of the invention.
- the diagram is a somewhat simplified illustration of the document decomposition Function 130.
- a Document 260 is first subjected to processing by specific components of a Natural Language Parser 310.
- Natural Language Parser 310 Although there a number of Natural Language Parsers 310 available, and all available Natural Language Parsers 310 have widely differing implementations, one well known example is part of the GATE Natural Language Processor. GATE stands for "General Architecture for Text Engineering" and is a project of the University of Sheffield in the United Kingdom. GATE has a very large number of components, most of which have no bearing upon the present invention.
- One embodiment of the current invention utilizes a small subset of GATE components - a Serial Analyzer (called the "ANNIE Serial Analyzer”) 320, a Document of Sentences 330, a Tagger (called the “Hepple Tagger”) 340 - to extract Sentence + Token Sequence Pairs 360. It is the Sentence + Token Sequence Pairs 360 that are utilized by the Document Decomposition Function 130.
- ANNIE Serial Analyzer called the "ANNIE Serial Analyzer”
- Document of Sentences 330
- Tagger called the "Hepple Tagger”
- the set of Sentence + Token Sequence Pairs 360 are produced in GATE as follows:
- the Serial Analyzer 320 extracts "Sentences" from an input Document 260.
- the "Sentences” do not need to conform to actual sentences in an input text, but often do.
- the sentences are "aligned" in a stack termed a Document of Sentences 330.
- Each Sentence in the Document of Sentences 330 is then run through the Tagger 340 which assigns to each word in the Sentence a part of speech token.
- the parts of speech are for the most part the same parts of speech well known to school children, although among Taggers 340, there is no standard for designating tokens.
- Hepple Tagger In the Hepple Tagger, a singular Noun is assigned the token "NN”, an adjective is assigned the token "JJ”, an adverb is assigned the token "RB” and so on. Sometimes, additional parts of speech are created for the benefit of downstream uses. In the described embodiment, the Hepple Tagger 340 created part of speech "TO" is an example. The part of speech tokens are maintained in a token sequence which is checked for one-to-one correspondence with the actual words of the sentence upon which the token sequence is based. The Sentence + Token Sequence Pair 760 is then presented to the Node Generation Function 380.
- a significant element of the present invention are novel Patterns of Tokens ("Patterns”) 370 and Per-Pattern Token Seeking Behavior Constraints ("Constraints”) 375 which are applied to the Sentence + Token Sequence Pair 360 within the Node Generation Function 380 to produce Nodes 180, where such Nodes 180 are specifically intended to be converted into Notes 160, where said Notes 160 conform - with specific exceptions - to notes composed by hand and ad hoc by a human reviewer of the underlying Document 260, and where the set of said Notes 160 represents - with specific exceptions - an exhaustive extraction of all knowledge from said Document 260.
- part of speech patterns and token seeking rules documented in the literature of Information Extraction, the domain with which the current invention is associated, and in the related field of Information Retrieval.
- Text analysis for the purpose of automated document classification or indexing for search engine-based retrieval is a primary use of part of speech patterns.
- Part of speech patterns and token seeking rules are used in text analysis to discover keywords, phrases, clauses, sentences, paragraphs, concepts and topics.
- keyword, clause, sentence, and paragraph conform to the common understanding of the terms, the meanings of phrase, concept, and topic varies by implementation.
- the word phrase is defined using its traditional meaning in grammar.
- phrases include Prepositional Phrases (PP), Noun Phrases (NP), Verb Phrases (VP), Adjective Phrases, and Adverbial Phrases.
- the word phrase may be defined as any proper name (for example "New York City"). Most definitions require that a phrase contain multiple words, although at least one definition permits even a single word to be considered a phrase.
- Some search engine implementations utilize a lexicon (a pre-canned list) of phrases.
- the WordNet Lexical Database is a common source of phrases.
- the Notes 160 generated by the preset invention can not be classified as keywords, phrases, clauses, or sentences (or any larger text unit) using the well known definitions of these terms, except by serendipitous execution of the described functions.
- the word concept generally refers to one of two constructs.
- the first construct is concept as a cluster of related words, similar to a thesaurus, associated with a keyword. In a number of implementations, this cluster is made available to a user - via a Graphic User Interface (GUI) for correction and customization. The user can tailor the cluster of words until the resulting concept is most representative of the user's understanding and intent.
- the second construct is concept as a localized semantic net of related words around a keyword. Here, a local or public ontology and taxonomy is consulted to create a semantic net around the keyword.
- Some implementations of concept include images and other non-text elements.
- Topics in general practice need to be identified or "detected” from a applying a specific set of operations against a body of text.
- Different methodologies for identification and/or detection of topics have been described in the literature.
- the Notes 160 generated by the current invention can not be classified as concepts or topics using the well know definitions of these terms, except by serendipitous execution of the described functions.
- a token seeking rule which might be applied in this case - when processing the second sentence - might be to "go back" to find the noun in the first sentence to which the "He" (or the "it") in the second sentence applies.
- the Constraints 375 described herein do not mirror the token seeking rules present in the prior art except in the most abstract of characteristics.
- the Constraints 375 can not be used to identify keywords, phrases, clauses, sentences, concepts or topics.
- the Patterns 370 crafted for the present invention can not be used to identify keywords, phrases, clauses, sentences, concepts or topics in the formally accepted structures of instantiations of those terms. Further, the Patterns 370 and Constraints 375 required for the current invention differ from those required for Serial No. 11/273,568 and Serial No.
- Pattern 370 and Constraints 375 are designed and intended to produce optimally correlatable Nodes 180, such Nodes 180 ideally capturing a Relation (value of Bond 184) between the values of Subject 182 and Attribute 186.
- the present invention sets no such standard for Node 180 creation, but instead, establishes Patterns 370 and Constraints 375 which can ultimately produce Notes 160 at machine speed.
- word classification identifies words as instances of parts of speech (e.g. nouns, verbs, adjectives). Correct word classification often requires a text called a corpus because word classification is dependent upon not what a word is, but how it is used. Although the task of word classification is unique for each human language, all human languages can be decomposed into parts of speech.
- the human language decomposed by word classification in the preferred embodiment is the English language, and the means of word classification is a natural language parser (NLP) (e.g. GATE, a product of the University of Sheffield, UK).
- NLP natural language parser
- the NLP encodes a sequence of tokens, where each token is a code for the part of speech of the corresponding word in the sentence.
- the resource contains at least one formatting, processing, or special character not permitted in plain text, the method is:
- the NLP encodes a sequence of tokens, where each token is a code for the part of speech of the corresponding word in the sentence.
- characters or words that contain characters not recognizable to the NLP are discarded from both the sentence and the sequence of tokens.
- resources containing any English language text may be decomposed into nodes, including resources formatted as: (i) text (plain text) files, (ii) Rich Text Format (RTF) (a standard developed by Microsoft, Inc.).
- RTF Rich Text Format
- An alternative method is to first obtain clean text from RTF by the intermediate use of a RTF-to-text conversion utility (e.g. RTF-Parser-1.09, a product of Pete Sergeant), (iii) Extended Markup Language (XML) (a project of the World Wide Web
- any dialect of markup language files including, but not limited to: HyperText Markup Language (HTML) and Extensible HyperText Markup Language (XHTMLTM) (projects of the World Wide Web Consortium), RuIeML (a project of the RuIeML Initiative), Standard Generalized Markup Language (SGML) (an international standard), and Extensible Stylesheet Language (XSL) (a project of the World Wide Web Consortium) as described more immediately hereinafter,
- PDF Portable Document Format
- MS WORD files e.g. DOC files used to store documents by MS WORD (a word processing software product of Microsoft, Inc.)
- MS Word-to-text parser e.g. the Apache POI project, a product ofApache.org.
- the POI project API also permits programmatically invoked text extraction from Microsoft Excel spreadsheet files (XLS).
- XLS Microsoft Excel spreadsheet files
- An MS Word file can also be processed by a NLP as a plain text file containing special characters, although XLS files can not.
- event-information capture log files including, but not limited to: transaction logs, telephone call records, employee timesheets, and computer system event logs.
- decomposition is applied only to the English language content enclosed by XML element opening and closing tags with the alternative being that decomposition is applied to the English language content enclosed by XML element opening and closing tags, and any English language tag values of the XML element opening and closing tags.
- This embodiment is useful in cases of the present invention that seek to harvest metadata label values in conjunction with content and informally propagate those label values into the nodes composed from the element content. In the absence of this capability, this embodiment relies upon the XML file being processed by a NLP as a plain text file containing special characters.
- HTML HyperText Markup Language
- XHTMLTM Extensible HyperText Markup Language
- RuIeML a project of the RuIeML Initiative
- Standard Generalized Markup Language SGML
- XSL Extensible Stylesheet Language
- Email messages and email message attachments are decomposed using word classification in a preferred embodiment of the present invention.
- the same programmatically invoked utilities used to access and search email repositories on individual computers and servers are directed to the extraction of English language text from email message and email attachment files.
- the NLP used by the present invention is directed to the extraction of English language text from email message and email attachment files.
- Email attachments are decomposed as described earlier for each respective file format.
- the other means of decomposition is decomposition of the information from a resource using an intermediate format.
- the intermediate format is a first term or phrase paired with a second term or phrase.
- the first term or phrase has a relation to the second term or phrase. That relation is either an implicit relation or an explicit relation, and the relation is defined by a context.
- that context is a schema.
- the context is a tree graph.
- that context is a directed graph (also called a digraph).
- the context is supplied by the resource from which the pair of terms or phrases was extracted.
- the context is supplied by an external resource.
- the relation is an explicit relation defined by a context
- that relation is named by that context.
- the context is a schema
- the resource is a Relational Database (RDB).
- the relation from the first term or phrase to the second term or phrase is an implicit relation, and that implicit relation is defined in an RDB.
- the decomposition method supplies the relation with the pair of concepts or terms, thereby creating a node.
- the first term is a phrase, meaning that it has more than one part (e.g. two words, a word and a numeric value, three words)
- the second term is a phrase, meaning that it has more than one part (e.g. two words, a word and a numeric value, three words).
- the decomposition function takes as input the RDB schema.
- the method includes:
- the first term or phrase is the database name
- the second term or phrase is a database table name.
- database name is "ACCOUNTING”
- database table name is "Invoice”
- a node is produced ("Accounting - has - Invoice") by supplying the relation ("has”) between the pair of concepts or terms;
- the first term or phrase is the database table name
- the second term or phrase is the database table column name.
- database table name is "Invoice” and column name is "Amount Due”;
- step (d) For each table in the RDB, step (d) is followed, with the steps (a) where the database table names are iteratively used, (b) fixed as the relation, (c) where the individual column names are iteratively used, produce a node;
- the entire schema of the RDB is decomposed, and because of the implicit relationship being immediately known by the semantics of the RDB, the entire schema of the RDB can be composed into nodes without additional processing of the intermediate format pair of concepts or terms.
- the decomposition function takes as input the RDB schema plus at least two values from a row in the table.
- the method includes
- the third part of the compound is the column name of a second column in the table (example "Status"),
- a node is produced ("Invoice No. 500024 Status - is - Overdue") by supplying the relation ("is") between the pair of concepts or terms; (i) For each row in the table, the steps (b) fixed as the key column name, (c) varying with each row, (d) fixed as name of second column, (f) varying with the value in the second column for each row, with (g) the fixed relation ("is"), produces a node (h); Q) For each column in the table, step (i) is run; (k) For each table in the database, step Q) is run;
- the entire contents of the RDB can be decomposed, and because of the implicit relationship being immediately known by the semantics of the RDB, the entire contents of the RDB can be composed into nodes without additional processing of the intermediate format pair of terms or phrases.
- the relation from the first term or phrase to the second term or phrase is an implicit relation, and that implicit relation is defined in a taxonomy.
- the decomposition function will capture all the hierarchical relations in the taxonomy.
- the decomposition method is a graph traversal function, meaning that the method will visit every vertex of the taxonomy graph. In a tree graph, a vertex (except for the root) can have only one parent, but many siblings and many children.
- the method includes:
- a node is produced ("mammal - is - living organism") by supplying the relation ("is") between the pair of concepts or terms;
- the parent/child relations of entire taxonomy tree can be decomposed, and because of the implicit relationship being immediately known by the semantics of the taxonomy, the entire contents of the taxonomy can be composed into nodes without additional processing of the intermediate format pair of concepts or terms.
- the decomposition function will capture all the sibling relations in the taxonomy.
- the method includes:
- the value of the first child vertex is the first term or phrase (example "humans");
- a node is produced ("humans - related - apes") by supplying the relation
- the relation from the first term or phrase to the second term or phrase is an explicit relation, and that explicit relation is defined in an ontology.
- the decomposition function will capture all the semantic relations of semantic degree 1 in the ontology.
- the decomposition method is a graph traversal function, meaning that the method will visit every vertex of the ontology graph.
- semantic relations of degree 1 are represented by all vertices exactly 1 link ("hop") removed from any given vertex. Each link must be labeled with the relation between the vertices.
- the method includes:
- the degree one relations of entire ontology tree can be decomposed, and because of the explicit relationship being immediately known by the labeled relation semantics of the ontology, the entire contents of the ontology can be composed into nodes without additional processing of the intermediate format pair of terms or phrases.
- a node is comprised of parts.
- the node parts can hold data types including, but not limited to text, numbers, mathematical symbols, logical symbols, URLs, URIs, and data objects.
- the node data structure is sufficient to independently convey meaning, and is able to independently convey meaning because the node data structure contains a relation.
- the relation manifest by the node is directional, meaning that the relationships between the relata may be uni-directional or bi-directional.
- a uni-directional relationship exists in only a single direction, allowing a traversal from one part to another but no traversal in the reverse direction.
- a bi-directional relationship allows traversal in both directions.
- a node is a data structure comprised of three parts in one preferred embodiment, and the three parts contain the relation and two relata.
- the arrangement of the parts is:
- a node is a data structure and is comprised of four parts.
- the four parts contain the relation, two relata, and a source.
- One of the four parts is a source, and the source contains a URL or URI identifying the resource from which the node was extracted.
- the source contains a URL or URI identifying an external resource which provides a context for the relation contained in the node.
- the four parts contain the relation, two relata, and a source, and the arrangement of the parts is:
- nodes 180A, 180B are generated using the products of decomposition by a natural language processor (NLP) 410, including at least one sentence of words and a sequence of tokens where the sentence and the sequence must have a one-to-one correspondence 415. All nodes 180A, 180B that match at least one syntactical pattern 420 can be constructed.
- NLP natural language processor
- a syntactical pattern 420 of tokens is selected (example: ⁇ noun> ⁇ preposition> ⁇ noun>);
- Steps (a)-(l) represent an example of a per pattern token seeking behavior constraint 375n of Figure 3.
- NLP natural language processor
- Steps (r)-(bb) represent another example of a per pattern token seeking behavior constraint 375 of Figure 3.
- the per pattern token seeking behavior constraints are not necessarily those normally associated with the semantic patterns of a language.
- a preferred embodiment of the present invention is directed to the generation of nodes using all sentences which are products of decomposition of a resource.
- the method includes an inserted step (q) which executes steps (a) through (p) for all sentences generated by the decomposition function of an NLP.
- Nodes can be constructed using more than one pattern.
- the method is:
- the inserted step (al) is preparation of a list of patterns.
- a list of patterns is shown at item 370 of Figure 3. This list can start with two patterns and extend to essentially all patterns usable in making a node, and include but are not limited to:
- nodes are constructed using more than one pattern, and the method for constructing nodes uses a sorted list of patterns.
- the inserted step (a2) sorts the list of patterns by the center token, then left token then right token (example: ⁇ adjective> before ⁇ noun> before ⁇ preposition>), meaning that the search order for the set of patterns (i) through (v) would become (iii)(ii)(iv)(v)(i), and that patterns with the same center token would become a group.
- (b)(c)Each sequence of tokens is searched for the first center token in the pattern list i.e. ⁇ adjective>
- the located ⁇ adjective> token is called the current token; (el) Using the current token,
- Additional interesting nodes can be extracted from a sequence of tokens using patterns of only two tokens.
- the method searches for the right token in the patterns, and the bond value of constructed nodes is supplied by the node constructor.
- the bond value is determined by testing the singular or plural form of the subject (corresponding to the left token) value.
- the token to the left of the current token (called the left token) is examined; (g) If the left token does not match the pattern ( ⁇ noun>), a. the attempt is considered a failure; b. searching of the sequence of tokens is continued from the current token position; c. until a next matching ⁇ adjective> token is located; d. or the end of the sequence of tokens is encountered; (h) if the left token does match the pattern,
- the method for constructing nodes searches for the left token in the patterns, the bond value of constructed nodes is supplied by the node constructor, and the bond value is determined by testing the singular or plural form of the subject (corresponding to the left token) value.
- the bond value is determined by testing the singular or plural form of the subject (corresponding to the left token) value.
- Nodes are constructed using patterns where the left token is promoted to a left pattern containing two or more tokens, the center token is promoted to a center pattern containing no more than two tokens, and the right token is promoted to a right pattern containing two or more tokens.
- the NLP's use of the token "TO" to represent the literal "to” can be exploited. For example,
- Those filters include, but are not limited to:
- Subject, bond, or attribute start or end with a hyphen or an apostrophe;
- Subject, bond, or attribute have a hyphen plus space ("- ") or space plus hyphen (" -") or hyphen plus hyphen (" — ”) embedded in any of their respective values;
- Subject, bond, or attribute contain sequences greater than length three (3) of the same character (ex: "FFFF");
- the fourth part contains a URL or URI of the resource from which the node was extracted.
- the URL or URI from which the sentence was extracted is passed to the node generation function.
- the URL or URI is loaded into the fourth part, called the Sequence 186, of the node data structure.
- the RDB decomposition function will place in the fourth (sequence) part of the node the URL or URI of the RDB resource from which the node was extracted, typically, the URL by which the RDB decomposition function itself created a connection to the database.
- the URL might be the file path, for example: "c: ⁇ anydatabase.mdb”. This embodiment is constrained to those RDBMS implementations where the URL for the RDB is accessible to the RDB decomposition function. Note that the URL of a database resource is usually not sufficient to programmatically access the resource.
- FIG. 26 Figure 4 illustrates the operation of the node to note conversion function of Figure 1 in accordance with one aspect of the invention.
- the Note Conversion Function 163 is simply illustrated in Figure 4.
- the products of the Decomposition Function 130 are Nodes 180.
- An example Node 180 is given.
- the example Node 180 is composed of three parts.
- the first part of the example Node 180 is a Subject 182, which contains the value "GOLD”.
- the second part of the example Node 180 is a Bond 184, which contains the value "IS”.
- the third part of the example Node 180 is an Attribute 186, which contains the value "STANDARD".
- the Note Conversion Function extracts the value from the Subject 182 ("GOLD"), converts it to text if the value is not already in text form, and places the text in the leftmost position of the Note 160, which is, in this embodiment a text data object.
- the Note Conversion Program 163 then concatenates a space character to the current rightmost character of the Note 160 text value.
- the Note Conversion Function then extracts the value from the Bond 184 ("IS”), converts it to text if the value is not already in text form, and places the text in the leftmost position of the Note 160.
- the Note Conversion Program 163 then concatenates a space character to the current rightmost character of the Note 160 text value.
- the Note Conversion Function then extracts the value from the Attribute 186 ("STANDARD"), converts it to text if the value is not already in text form, and places the text in the leftmost position of the Note 160.
- STANDARD Attribute 186
- the Conversion 163 of the Node 180 into a Note 160 is then complete, and the Note 160 is placed in the Note Container 168.
- FIG 5 is a flow chart of the functionality of Note Taking Program 170.
- a User 305 clicks on the Notes Icon 410 to Start.
- the Note Taking Program 170 will first check that a Document 260 is displayed in the Program 270 e.g., an Internet Browser program. If a Document 260 is in fact displayed, the Note Taking Program 170 will invoke the Document Decomposition Function 130. As is illustrated in Figure 3, the Decomposition Function 130 will create Nodes 180, from which the Note Conversion Function 163 will create Notes 160. Referring again to Figure 5, the Note Conversion Function 163 will then place in Memory 220 a Note Container 168 with all Notes 160 that have been created.
- the Note Taking Function 170 will fetch the Note Container 168 from Memory 220, render a Note Selection Window 176 on the Display 210, render the Controls 181 on the Display 210, and populate the Note Selection Window 176 with
- Notes 160 rendered for display. Then the Note Taking Program 170 will enable all Controls 181 which bind to all Notes (as opposed to Selected Notes). When the User 305 selects on of the enabled functions and activates the Control 181, the Note Taking Program 170 executes the selected function and Ends. Alternately, if the User 305 selects Notes 160 from the Note Selection Window 176, the Note Taking Program 170 will enable the Controls 181 that operate on selected Notes only. Alternatively, if there is no Document 260 displayed by the Program 270, the Note Taking Program 270 will check the default directory on Hard Disk 190 for extant Note Container 168 files.
- the Note Taking Program will prompt the User 305 to select a Note Container 168 file.
- the Note Taking Program will retrieve the Note Container 168 from Hard Disk 190 and render the Note Selection Window 176, the Controls 181, and the Notes 160 on the Display 210 for further interactive interface with the User 305.
- Figure 6 illustrates a software architecture preferable used for the computer generation of notes.
- Figure 6 is a representation of how Documents 260, whether residing on a Personal Computer 200, in Personal Computer Memory 220, on Personal Computer Hard Disk 190, on Personal Computer Removable Media 250 or on a Network Server 350, can be presented to a User 305 using the present invention. Also shown are the components of the Personal Computer 200 used in the process, including Main Memory 220, Display 210, Hard Disk 190, and Removable Media Drive 250. Finally, the use of Hard Disk 190 and Removable Media to permanently store (persist) the Notes 160 contained in the Note Container 168 is illustrated.
- Figure 7 is a block diagram of a hardware architecture of a personal computer used in carrying out the invention.
- Figure 7 is an illustration of the internal components of a typical laptop or desktop personal computer 700.
- Programs 770 and data are stored on Hard Disk 190 or Removable Media Drive 750, and are placed into Main Memory 720 via the System Bus 730. User interface and results are rendered on the Display 710.
- Documents 760 may be stored on Hard Disk 190 or Removable Media read by a Removable Media Drive 750, and placed in Main Memory where the Documents 760 and their content can be manipulated by Computer Program 770, of which one embodiment of the present invention is an example, as is an Internet Browser such as Internet Explorer, a product of Microsoft, Inc.
- Figure 8 is a block diagram showing use of the note taking functionality in a networked environment.
- Figure 8 is an illustration of a Personal Computer 700 connected to Network Servers 850.
- the connections are through Communication Links 860.
- the types of connections that can be made include connection via a Broadband Network 810, which can directly connect to a Network Server 850 or can connect to a Network Server 850 through the Internet 840.
- a Personal Computer 700 can be connected to a Network Server 850 via Wireless Access 820 to the Internet 840.
- Documents 760 can be stored on a Network Server 850.
- Documents 760 can be retrieved from a Network Server 850 and transmitted over Communication Links 860 to the Personal Computer 700 for use in Software Program 770 such as that which is one embodiment of the present invention.
- Figure 9 is an illustration of an exemplary screen view of a notes selection window and related control buttons in accordance with one aspect of the invention.
- a note selection window 176 is shown associated with two save buttons 181 A and 18 IB. If it is desirable only to save certain notes from the note selection window, those notes will be selected, using, typically, standard operating system functionality followed by selection of the save selection button 181 A. When button 181 A is activated, the items that were identified for saving are stored on a hard disk, for example hard disk 190 using the save function 182 of Figure 1. If it is desirable to save all of the notes that have been generated, the save all button, 18 IB can be selected.
- the Nodes 180B generated by the Document Decomposition Function 130 are composed of four parts, the fourth part of such Nodes 180B containing bibliographic information.
- the fourth part of such Nodes 180B is referred to as a Sequence or Source 188.
- the type of bibliographic information that may be captured in the fourth part of such Nodes 180B will vary depending upon the application programming interfaces (API) extent for each type of Document 260 and each type of Computer Program 270 used to display the Document 260.
- the bibliographic information captured in the fourth part of said Node 180B will include the URL or URI
- Nodes 180 which are acquired from a Document 260 by the Document Decomposition Function 130 are not clipped or cut and pasted from the text of a Document 260.
- Nodes 180 may be said to be associated with a location in the text of a Document 260, that location being the location in the text corresponding to the location in the Sentence + Token Sequence Pair 760 where the first token of a Pattern 770 was found by the Constraint 775 as it operated upon the Sentence + Token Sequence Pair 760 and was successfully able to complete the generation of a Node 180.
- that location will be captured in the fourth, Sequence 188 part of the Node 180B.
- note 160B made from a four-part Node 180B is composed of two parts, a Note Content part 161, and a Note Source part 162.
- the bibliographic material in the Note Source 162 also can be displayed in the Note Selection Window 176, and subsequently printed or emailed.
- the current invention excludes quotations found in text from the default Tagger 740 algorithm.
- the Tagger 740 will, when encountering either an open or a close quotation marks character, utilize a created part of speech token, "QS" for a open quotation and "QT" for a closed quotation, to delimit the quotation in the Token Sequence.
- the Node Generation Function 780 when processing the Sentence + Token Sequence Pair 760 will use a special Constraint 775 when a "QS" token is encountered.
- the Constraint 775 will then seek the following
- the User 305 can elect to not respect quotations, in which case quoted text will be processed by the Tagger 740 and the Node Generation Function 780, as is other text in the Document 260.
- the User 305 can elect to respect quotations, but not to preserve quotations in Quotation Nodes 1010. Using this method, when a open quotation token is encountered by the Node Generation Function 780 quotation token delimited words and tokens from the Sentence + Token Sequence Pair 760 will be processed into Nodes 180 by the Node Generation Function 780 independently of the other words and tokens in the Sentence + Token Sequence Pair 760.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
La présente invention concerne un texte extrait de documents, d'e-mails, de tables de bases de données relationnelles et autres sources d'informations numérisées et des ressources d'informations de ce type. Le texte extrait est traité à l'aide d'une fonction de décomposition pour la création. Les nœuds sont une structure de données particulière qui conserve les unités élémentaires d'informations. Les nœuds peuvent porter une signification car ils lient un terme ou une expression du sujet à un terme ou une expression d'attribut. Retiré de la structure de données du nœud, le contenu est ou peut devenir un fragment de texte portant un sens, comme une note. Les notes générées par chaque ressource numérique sont associées à la ressource numérique à partir de laquelle elles ont été recueillies. Les notes sont ensuite stockées, organisées et présentées de différentes manières qui facilitent l'acquisition de connaissances et l'utilisation par un utilisateur.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP07798451A EP2035962A4 (fr) | 2006-06-12 | 2007-06-12 | Techniques pour créer des notes générées par ordinateur |
PCT/US2007/071015 WO2008153566A1 (fr) | 2007-06-12 | 2007-06-12 | Techniques pour créer des notes générées par ordinateur |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2007/071015 WO2008153566A1 (fr) | 2007-06-12 | 2007-06-12 | Techniques pour créer des notes générées par ordinateur |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008153566A1 true WO2008153566A1 (fr) | 2008-12-18 |
Family
ID=40129991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2007/071015 WO2008153566A1 (fr) | 2006-06-12 | 2007-06-12 | Techniques pour créer des notes générées par ordinateur |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008153566A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014197282A1 (fr) * | 2013-06-04 | 2014-12-11 | Microsoft Corporation | Services de capture à travers des canaux de communication |
WO2019212788A1 (fr) * | 2018-05-02 | 2019-11-07 | Microsoft Technology Licensing, Llc | Création et mise à jour de notes numériques par l'intermédiaire de messages électroniques |
CN113034995A (zh) * | 2021-04-26 | 2021-06-25 | 读书郎教育科技有限公司 | 一种学生平板生成听写内容的方法及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040163043A1 (en) * | 2003-02-10 | 2004-08-19 | Kaidara S.A. | System method and computer program product for obtaining structured data from text |
US20040169683A1 (en) * | 2003-02-28 | 2004-09-02 | Fuji Xerox Co., Ltd. | Systems and methods for bookmarking live and recorded multimedia documents |
US6836768B1 (en) * | 1999-04-27 | 2004-12-28 | Surfnotes | Method and apparatus for improved information representation |
US20050289168A1 (en) * | 2000-06-26 | 2005-12-29 | Green Edward A | Subject matter context search engine |
WO2006053306A2 (fr) | 2004-11-12 | 2006-05-18 | Make Sence, Inc | Techniques de découverte de connaissance par construction de corrélations de connaissance utilisant des concepts et des termes |
-
2007
- 2007-06-12 WO PCT/US2007/071015 patent/WO2008153566A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6836768B1 (en) * | 1999-04-27 | 2004-12-28 | Surfnotes | Method and apparatus for improved information representation |
US20050289168A1 (en) * | 2000-06-26 | 2005-12-29 | Green Edward A | Subject matter context search engine |
US20040163043A1 (en) * | 2003-02-10 | 2004-08-19 | Kaidara S.A. | System method and computer program product for obtaining structured data from text |
US20040169683A1 (en) * | 2003-02-28 | 2004-09-02 | Fuji Xerox Co., Ltd. | Systems and methods for bookmarking live and recorded multimedia documents |
WO2006053306A2 (fr) | 2004-11-12 | 2006-05-18 | Make Sence, Inc | Techniques de découverte de connaissance par construction de corrélations de connaissance utilisant des concepts et des termes |
Non-Patent Citations (1)
Title |
---|
See also references of EP2035962A4 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014197282A1 (fr) * | 2013-06-04 | 2014-12-11 | Microsoft Corporation | Services de capture à travers des canaux de communication |
CN105493076A (zh) * | 2013-06-04 | 2016-04-13 | 微软技术许可有限责任公司 | 通过通信通道的捕捉服务 |
WO2019212788A1 (fr) * | 2018-05-02 | 2019-11-07 | Microsoft Technology Licensing, Llc | Création et mise à jour de notes numériques par l'intermédiaire de messages électroniques |
US10771420B2 (en) | 2018-05-02 | 2020-09-08 | Microsoft Technology Licensing, Llc | Creating and updating digital notes via electronic messages |
CN113034995A (zh) * | 2021-04-26 | 2021-06-25 | 读书郎教育科技有限公司 | 一种学生平板生成听写内容的方法及系统 |
CN113034995B (zh) * | 2021-04-26 | 2023-04-11 | 读书郎教育科技有限公司 | 一种学生平板生成听写内容的方法及系统 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11113304B2 (en) | Techniques for creating computer generated notes | |
US8484238B2 (en) | Automatically generating regular expressions for relaxed matching of text patterns | |
McEnery et al. | Corpus linguistics: Method, theory and practice | |
KR100594512B1 (ko) | 지식 창조 능력을 가지는 문서 의미 분석/선택 시스템 및그 방법 | |
CN100576201C (zh) | 用于从自然语言文本开发本体的方法和电子数据处理系统 | |
Ferreira et al. | Improving NLTK for processing Portuguese | |
Baykara et al. | Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian | |
Feinerer | A text mining framework in R and its applications | |
Sateli et al. | Automatic construction of a semantic knowledge base from CEUR workshop proceedings | |
Dali et al. | Question answering based on semantic graphs | |
Bolasco et al. | Automatic dictionary-and rule-based systems for extracting information from text | |
WO2008153566A1 (fr) | Techniques pour créer des notes générées par ordinateur | |
JP2997469B2 (ja) | 自然言語理解方法および情報検索装置 | |
Al-Lahham | Index term selection heuristics for Arabic text retrieval | |
JPH09190453A (ja) | データベース装置 | |
EP2035962A1 (fr) | Techniques pour créer des notes générées par ordinateur | |
Muniz et al. | Taming the Tiger Topic: An XCES Compliant Corpus Portal to Generate Subcorpora Based on Automatic Text-Topic Identification | |
Frank et al. | Building literary corpora for computational literary analysis-a prototype to bridge the gap between CL and DH | |
Aulia et al. | WatsaQ: Repository of Al Hadith in Bahasa (Case Study: Hadith Bukhari) | |
Alrehaili et al. | Discovering Qur’anic Knowledge through AQD: Arabic Qur’anic Database, a Multiple Resources Annotation-level Search | |
US12118025B2 (en) | Comprehension engine to comprehend contents of selected documents | |
Speck et al. | On extracting relations using distributional semantics and a tree generalization | |
Petraki et al. | Automated thesaurus population and management | |
Vasuki et al. | English to Tamil machine translation system using parallel corpus | |
Müller et al. | Annotating korean text documents with linked data resources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 2007798451 Country of ref document: EP |
|
DPE2 | Request for preliminary examination filed before expiration of 19th month from priority date (pct application filed from 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 608/DELNP/2009 Country of ref document: IN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07798451 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |