A METHOD, SYSTEM AND COMPUTER PROGRAM FOR INTELLIGENT TEXT ANNOTATION
Field of the Invention
The present invention generally relates to the field of word processing; more particularly, this invention applies to a method for annotating text documents.
Background of the Invention
With word processors one can annotate text by underlining or highlighting parts of a text, writing text in the margin or a text box anywhere in a document. This facilitates and accelerates text interpretation. Furthermore, if a text annotation function is related to a semantic model, it can provide useful knowledge that helps users to even better interpret the text and facilitate for them taking some quick actions. This can also greatly enhance the accuracy of many text related applications such as text categorization, topic detection and document search. Some applications, such as Smart Tags in Microsoft Word, when enabled, identifies a type of data, such as names, dates, or telephone numbers, and contains the logic needed to provide one or more actions for
each data type. The actions that can be taken depend on the type of data that Word recognizes and labels with a smart tag. For example, if a "person name" is recognized in the text, with actions you can take, such as Open Contact, Schedule a Meeting, Add to Contacts, or Insert Address. If the user selects, for example λAdd to Contacts' , the corresponding Outlook function for updating contact lists is started and the user can very quickly paste the information into his address book.
However, it is desirable to improve text interpretation by providing more knowledge to the text and deciding the best actions to be performed. Today no or a very limited knowledge is added to texts.
Summary of the Invention
It is an object of the present invention to intelligently annotate text to improve text interpretation.
This object is reached, according to claim 1, with a method executed on a computer for a developer annotating a text read by a user, said method comprising
- the developer creating a Topic Map comprising topics of interest associated to the user;
- the developer creating a data structure corresponding to a topic model for the user; the computer automatically reading the Topic Map and storing, for each topic, topic information comprising topic name and a knowledge structure;
- the computer automatically reading the text and for each topic found in the text, retrieving the stored topic information and filling a topic data structure;
- the computer automatically attaching the filled topic data structure as annotations to corresponding topics found in the text .
The object of the invention is also reached with the methods of the dependent claims.
The object is also achieved, according to claim 10, with the computer program product comprising programming code instructions for executing the steps of the method according to any one of claims 1 to 8 when said program is executed on a computer.
The object is also achieved, according to claim 11, with a system comprising means adapted for carrying out the method according to any one of the method claims.
The principle of the solution is to provide a semantic model for semantically structuring a text in order to transform the information contained in the text into useful knowledge .
The text annotations created with the method of the present invention will help the user to better understand the text, to navigate the knowledge associated with it, to relate its content to his body of knowledge and to facilitate for him taking some relevant quick actions.
The solution provides the following additional advantages : 1. Embedding a body of knowledge, represented by a topic map which is of interest to a certain user, into a text to help properly interpreting it, increasing the user comprehension of it and guiding the user in taking the right quick actions related to it.
2. Supporting intelligent search in a text based on a semantic model .
3. Permitting knowledge navigation within the context of the text, thereby delivering the right information in the right context at the right time.
4. Covering not only the meta layer above the resources, but also connecting resources within the meta layer (related to a text) .
5. Is able to support text categorization based on text semantic structure.
6. Is able to use standardized knowledge structures from other origins, such as Topic Maps and dictionaries created from Topic Maps, which can be stored and reused.
7. The performance of the method for creating annotations is improved with the use of FSA-based dictionaries.
8. The knowledge brought by the text annotations includes actions which are in relation with the part of the text which is annotated, the text reader being able to activate these actions from the text interface.
Brief Description of the Drawings
Fig. 1 describes the context and logical blocks according to the method for creating text annotations according to the preferred embodiment of the invention;
Fig. 2 is an example of Topic Map built by a designer in relation with the user context preparation according to the preferred embodiment;
Fig. 3 is a sample of a FSA-based Topic Dictionary according to the preferred embodiment, part of the user context preparation;
Fig. 4 describes the entries in a Topic Dictionary and its associated Traversal Dictionary according to the preferred embodiment;
Fig. 5 is a Topic annotation class described in UML according to the preferred embodiment of the invention;
Fig. 6 describes an instanciated Topic Annotation class using the content of dictionaries and action database;
Fig. 7 illustrates the text annotations as the user can see generated according to the preferred embodiment;
Fig. 8 is the general flowchart of the method according to the preferred embodiment;
Fig. 9 is the flowchart of a step of the flowchart of Fig. 8 describing the instanciation of the Topic Annotation classes according to the preferred embodiment;
Fig. 10 describes the context and logical blocks for the user accessing the text annotations created according to the method of the preferred embodiment.
Detailed Description of the preferred embodiment
Fig. 1 describes the context and logical blocks characterizing the method for creating annotated texts according to the preferred embodiment of the invention. The people (100) preparing for text annotation for users will be designers or program developers working on a computer (110). The developer first prepares, through a Graphical User Interface, a Topic Map in which the developer enters inter-related information illustrating the interests of a specific user. This Topic Map will orient the structure and
the content of knowledge by topics of interest to the user which will be used in the text annotations for this user. The principle of the Topic Map is described later in the document in reference to the description of Fig. 2. A Topic Map database (130) is maintained to store already created Topic Maps representing the interests of users. A program or a set of programs, the Annotator (150) is run on the computer to help the developer in automatically executing some steps of the method for annotating text. The developer first asks the program to read one Topic Map and extract all the necessary information to create two dictionaries associated to the corresponding user, a Topic Dictionary (170) and a Traversal Dictionary (160). The description of the dictionaries is described later in the document in reference to the description of Fig. 3 and 4. In the preferred embodiment of the invention, the text annotations includes actions associated to the topics of the text. In this case, the developer creates actions associated with the topics of the topic map and stores them in a database, the Action database (135) . The Action database when it is created and the dictionaries is also part of the context of the user for creating his text annotations. A context is associated to one user or to one population made of people having the same profile such as one department of a company etc... For proceeding with the creation of the annotations of a given text (taken from a database 180 for instance) the developer creates a data structure of the knowledge structure. This data structure may be an Annotation Topic class stored in a database (140) created in the UML language or any other modeling language. The developer can decide to create a class containing an λaction' object if actions are taken into account for creating the text annotations. Then, the developer runs the Annotator (150), which identifies the topics of the text and, for each identified topic it automatically creates an instanciation of the Topic class taking the information from the dictionaries. If an Action database (135) is used,
the Annotator links handlers to the actions of the instanciated topic class. The Annotator then creates the annotations in the text by attaching the instanciated classes to the corresponding topics found in the text. Then, the developer creates a GUI (the User GUI 190) which will allow the user reading the annotated text to access the annotations in a logical way, through menus for instance.
Given the standard structure of Topic Maps, a generic User GUI can be developed for all the texts of all the users. However, the developer may customize a User GUI for one user or for one specific text of one user.
It is noted that the Annotator is a program which helps in creating annotations for texts associated to any user or any population made of people having the same profile. As a matter of fact, the Annotator is able to read any ISO standard Topic Map and any topic object class written in any specified modeling language.
Even if the developer changes the modeling language for describing Topic classes, once the classes are defined the same Annotator program can be used to instanciate the classes and include annotations in the text.
Fig. 2 is an example of Topic Map built by a designer in relation with the user context preparation according to the preferred embodiment. Topic maps are new ISO standard for describing knowledge structures and associating them with information resources, therefore enabling the structuring of unstructured information
(http : //www. topicmaps . org/xtm/1.0/) . A topic map contains a body of knowledge and consists of a collection of topics, each of which represents some concepts.
Topics are related to each other by associations, which are
typed n-ary combinations of topics. A topic may also be related to any number of resources by its occurrences Fig. 2. Since topic maps define a good model for knowledge semantic structuring, the use of Topic Maps allows annotating texts intelligently to achieve the above mentioned objectives.
The Topic Map of Fig. 2 illustrates the interests and relationship of an employee of International Business Machines Corporation working with other companies such as Microsoft and Intel and participating in special interest groups in some specific technical domains. The Topic Map of Fig. 2 includes Topics in relation with his employee's hierarchy in the company and with his professional relationship out of his company.
Fig. 3 is a sample of a FSA-based Topic Dictionary according to the preferred embodiment, part of the user context preparation. A Topic Dictionary is used to discover topics in texts. The Annotator automatically builds the Topic
Dictionary of a user using as input a Topic Map associated to the user. Fig. 3 describes one entry of the Topic Dictionary which contains:
A key. This key is a sequence of characters.
A set of attributes associated with the key. These attributes are separated into logical groups. Each group of attributes (called a "Gloss") contains a specific type of information.
The dictionaries in the preferred embodiment are FSA-based, where keys are represented in the dictionary using a Finite State Automaton and glosses are attached to terminal nodes in the FSA. In this way, dictionary lookup can be done extremely fast by the Annotator. Given a topic map selected by the user (or constructed by him or specifically for him) , that represents a body of knowledge that interests him, the two associated dictionaries are generated for the purpose of
detecting entities (topics) in text documents and retrieving knowledge structures represented in the topic map related to that topic.
Fig. 4 describes the entries in the Topic Dictionary and the Traversal Dictionary according to the preferred embodiment .
Keys of the Topic dictionary are topic names. The value
(gloss) associated with a key is the topic identifier associated with that key (topic name) . For example, with a dictionary that contains the first entry (400) in Fig. 4, when processing a text containing the word "IBM", then, the dictionary will detect this word and return the identifier for the topic with that name.
The Traversal Dictionary is used to retrieve the knowledge structure related to a given topic. When building this dictionary, a property of Topic Maps was considered which states that each Topic Map construct (e.g. Topics, Topic Names, Occurrences, Associations ..., etc.) must have an identifier that is unique across the Map. A key in that dictionary is an identifier for a construct. And the glosses associated with a key contain the information for that construct. A number of gloss types were defined for the Traversal Dictionary to hold information related to Topic Maps :
Topic gloss: A topic gloss (410) contains:
Topic Names Identifiers: used to retrieve the names for that topic .
Types Identifiers: used to retrieve the types (classes) for that topic.
Occurrences Identifiers: used to retrieve the occurrences for that topic.
Roles Played Identifiers: used to retrieve the associations that the topic is participating in along with its roles.
Topic Name gloss: A topic Name gloss (420) contains: Topic Name: the value of the topic name.
Type Identifier: used to retrieve the type (class) for that topic name.
Variants Identifiers: used to retrieve variant forms for that topic name.
Variant gloss: A variant gloss (430) contains:
Topic Name Identifier: refers to the topic name having this variant .
Variant value: the value of this variant form for the topic name.
Occurrence gloss: An occurrence gloss (440) contains:
Occurrence value: the value of the occurrence. It can be a URI or a string for simple properties.
Type Identifier: used to retrieve the type (class) for that occurrence.
Association Role gloss: An association role gloss (450) contains :
Type Identifier: used to retrieve the type (class) for that role . Player Identifier: refers to the Topic that played that role.
Association Identifier: used to retrieve the association that the player topic is participating in.
Association gloss: An association gloss (460) contains:
Type Identifier: used to retrieve the type (class) for that association .
Association Roles Identifiers: used to retrieve the participants in this association along with their roles.
Not illustrated in Fig. 4, the entry of an Action database (135) consists in an action name associated to a topic type, the topic type being an attribute of the Topic object of a Topic class.
Fig. 5 is a Topic annotation class described in UML according to the preferred embodiment of the invention. In order to capture the nature of Topic Maps structure which permits topics to be related to other topics through "associations", a data structure is created by the developer that allows a dynamic expansion recursively filled with topic map knowledge items. This expansion allows for accommodating other topic knowledge items related to other topics in association with the original topic.
In the example of Fig. 5, the actions are taken into account as one topic object is related to one action object, the relation being one to many.
Fig. 6 describes an instanciated Topic Annotation class using the content of dictionaries and Action database (135). The Annotator automatically instanciates the Topic classes for each topic identified in the text by reading the dictionaries whose entries have been described above in the document in relation with the description of Fig. 4 and by reading
actions associated with the topic types from the Action database (135). The step for instanciating the classes according to the topics successively identified in the text is described in detail later in the document in relation with description of Fig. 9.
The instanciated class of Fig. 6 corresponds to the class of Fig. 5. For instance, according to the topic type value only one action has been related to the topic. This information has been read from Action database.
Fig. 7 illustrates the text annotations as the user can see generated according to the preferred embodiment. The developer has created a User GUI to display the topic embedded knowledge. The user can interact with the annotated text in the following way: when the user moves his pointer over an annotated token in the text (which is underlined in Fig. 7), λthe topic type' is fetched from the instanciated λTopic Annotation Data structure' and displayed above the text by the application. If the user clicks on the annotated token, a menu containing the topic names, topic associations, topic occurrences and associated actions which were stored in the Topic Dictionary (see Fig. 3) is displayed. The user can select an item from this menu and a cascade of menus will be displayed, containing the items fetched from the instanciated λTopic Annotation Data Structure' (which is linked to the identified topic) , according to the user selection and interests (Fig. 7 illustrates four possible interests names 700, associations 710 or Occurrences 720), thereby providing the user with comprehensive knowledge related to the annotated token in the text. If the user decides to select a related action to the topic identified in the text, the associated action handler immediately executes a command (e.g. sending an e-mail, or λShow stock value' in 730) .
Fig. 10 describes the context and logical blocks for the user (1000) reading the annotated text. To be able to access and interfaces the text annotations created by the method of the preferred embodiment, the user (1000) starts the User GUI (190) which may, for instance, interface his usual editor
(1010) on his workstation (1020) . The User GUI accesses the text in the TEXT Database (180) and accesses the instanciated
Topic classes as stored by the Annotator in the Annotation
Topic classes database (140) . The User GUI displays the menus providing access to the knowledge related to the topics of the text .
Fig. 8 is the general flowchart of the method for creating annotated texts according to the preferred embodiment. It is noted that the steps of the general flowchart which are not automatically executed by the execution of the Annotator program are executed by a designer (800, 820, 830) or a 'developer' (860) who has competence in designing the applications of steps (800, 820, 830). Most of the time the same person performs indifferently the steps for the 'developer' or for the 'designer' .
The designer creates (800) a Topic Map of user interests. As already mentioned before, the Topic Maps which contain a body of knowledge that interests a user define a good model for knowledge semantic structuring. The Topic Map can be stored in a Topic Map database (130) . This step is preferably performed through a Graphical User Interface (120) running on the designer's workstation.
The designer, starts the execution of the Annotator (150) which automatically transforms (810) the Topic Map into dictionaries. The knowledge represented in the topic map is transformed into the two associated dictionaries, the Topic Dictionary and the Traversal Dictionary described before in the document.
The designer creates a Topic class (820) using an object modeling language such as UML. The Topic class captures the nature of the Topic Map structure which permits topics to be related to other topics through Associations' .
If the designer has defined an λaction' object related to a topic in the Topic class, he links (830) actions to topic types and stores them in a database (135) . This means that an action may be associated to different topics having the same type.
The designer starts the execution of the Annotator program which automatically instanciates (840) the topic class created at the previous step for each topic identified in a text to be annotated. The Topic class data structure allows dynamic expansion when recursively filled in this step with
Topic Map knowledge items stored in the Topic Dictionary and the Traversal Dictionary and, optionally, the action database
(135) . This step is described in more details in relation with description of Fig. 9.
The developer starts the execution of the Annotator program which automatically attaches (850) the corresponding instanciated classes to the topic names identified in the text. The Annotator may store the text which has been so modified in the TEXT database.
The developer creates a GUI for the User (860), the USER GUI, which allows the user when reading the annotated text on his computer to display annotations in a logical way, preferably with menus as illustrated in Fig. 7 in the document. Using this User GUI, the user will interact with the annotated text, will navigate the knowledge associated with it, and will make use of the embedded knowledge linked to it to interpret it and if actions are part of the knowledge, will
take quick informed actions, by simply pointing and clicking on annotated tokens in the text.
Fig. 9 is the flowchart of a step of the flowchart of Fig. 8 describing the instanciation of the Topic Annotation classes (840) according to the preferred embodiment. The Annotator program parses (900) the text and performs a lookup for every parsed token in the Topic Dictionary. If a token is found in the Topic dictionary, then a topic name is identified and its topic identifier is retrieved using the Topic Dictionary. A lookup for the topic identifier is performed in the Traversal dictionary and its associated glosses are retrieved. Based on the identifiers extracted from these glosses, the Annotator performs a series of successive lookups in the traversal dictionary (920) to retrieve the information related to the knowledge structure of the topic, and then uses it to instanciate (930) the Annotation Topic class (or, more generally, to fill the Annotation Topic data structure) .
Then, if the option has been taken of supporting actions in the text annotations, the annotator fetches the actions associated with the topic type from the action database (925) to instanciate the actions in the Annotation Topic class (930). Finally, the Annotator program links (935) a handler to each action of the instanciated Annotation Topic class to complete this instanciation. The instanciated Annotation Topic classes are preferably stored in the Annotation Topic class database (140). The handlers starts execution of existing or new programs on the computer when the user accesses through the User GUI the actions which are associated with a topic in the annotation displayed when reading the annotated text.
The step 840 is completed when all the tokens of the text have been identified (answer No to test 910) and all the Annotation Topic classes have been instanciated. Then the method for annotating a text returns to step 850.