WO2011047644A1 - Procédé et système permettant de générer un résumé pour un objet - Google Patents
Procédé et système permettant de générer un résumé pour un objet Download PDFInfo
- Publication number
- WO2011047644A1 WO2011047644A1 PCT/DE2009/001453 DE2009001453W WO2011047644A1 WO 2011047644 A1 WO2011047644 A1 WO 2011047644A1 DE 2009001453 W DE2009001453 W DE 2009001453W WO 2011047644 A1 WO2011047644 A1 WO 2011047644A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data structure
- tree data
- references
- text
- node
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Definitions
- the invention relates to a method and system for generating a summary for an object, such as an electronic document.
- Search engines are known in the prior art which, as a result of a search for each search result, display a brief overview of each search result.
- document search engines which represent a special type of search engine, these are usually extracts from the text passages in which the searched keyword occurs.
- FIG. 1 shows a result list for a document search engine known from the prior art. It is noticeable that the displayed text excerpts match the document title, which is always displayed above the overview anyway. Ultimately, therefore, the display of the text passages in which the search word occurs, not very helpful to get a meaningful overview.
- the object of the present invention is to provide a method and a system with which reliable and high-quality summaries are generated for objects, without having the disadvantages known from the prior art.
- a method for generating a summary describing at least one object wherein the at least one object is referenced by at least one tree data structure, wherein the at least one tree data structure comprises a number of nodes, of which at least one node references the at least one object, wherein the node of the at least one tree data structure is associated with a text comprising a number of words, and wherein the at least one tree data structure is storable in a storage device, and wherein the method comprises the steps of:
- Tree structures can be used to extract information that can be used to create efficient and high-quality summaries for objects.
- tree data structures may be: directory structures (e.g., file systems), mind maps, or other hierarchical structures capable of storing references to objects.
- a tree data structure may also be a computer network where the objects are stored on different computers and where the objects are in a hierarchical relationship (e.g., LDAP).
- LDAP hierarchical relationship
- an object for example, an electronic file in a directory of a directory structure is called or a document which is referenced or linked from a mind map out.
- BDS An essential advantage of BDS is that they can be analyzed directly and quickly without having to access the contents of the objects. The moment a BDS is created by a user, it can be analyzed immediately. Another advantage is that generating a summary for an object can be done in near real-time, which is particularly advantageous when a user moves a document from one directory to another directory, for example, recreating a summary for the moved object May have consequences.
- Identifying the nodes may include identifying the nodes that are in the tree data structure on the path between a root node of the tree data structure and the node of the tree data structure that references the object.
- At least one distance value can be stored which represents the distance of the text to the node referencing the object.
- the distance value may include the number of edges between the node referencing the object and the node to which the text is associated.
- a step of reducing the tree data structure may be performed prior to identifying the nodes of the at least one tree data structure. As a result, the generation of summaries can be further accelerated, which is advantageous in particular when a very large number of BDSs have to be analyzed.
- Reducing may include:
- Deleting end nodes which do not represent a reference to an object, and / or
- the tree data structure may be transmitted over a communication network from a client device to a server device, wherein the transfer may be performed prior to reading out the nodes of the tree data structure.
- the tree data structure Before transferring or after transfer, the tree data structure may be converted to a normalized tree data structure format. This makes it possible to access all BDS in the same way.
- the normalized tree data structure format can be a tree data structure in XML format.
- the bookmarks can be arranged in a hierarchical structure.
- At least one distance value may be determined in the hierarchical structure, wherein a distance value represents the distance of the bookmark to a leaf node which represents a bookmark.
- the distance value may include the number of edges between a bookmark and a leaf node that represents a bookmark.
- a bookmark may be a predetermined section of the object, preferably a marked section of a textual object.
- those bookmarks that have a predetermined similarity to a heading contained in the object can be ignored. It can be ignored those predetermined sections of the object, which extend over a predetermined number of lines of text.
- the summarizing of the summaries may include discarding the similar and / or identical summaries, except for summarizing and / or storing a similarity value to the similar and / or identical summaries, wherein the similarity value is stored in relation to a stored summary.
- a summary generated for a first object can be assigned to a second object. This is particularly advantageous for objects that are scientific publications. The assignment can be made when the second object has a predetermined similarity to the first object.
- Determining a similarity between the first object and the second object may include: Identifying first references in the first object and identifying second references in the second object;
- the order of the first references in the first object and the order of the second references in the second object may be considered, the two objects being deemed similar if the order of a predetermined number of References matches.
- the texts can be subjected to a text transformation in order to generate a transformed text from the texts.
- the text transformation may include at least one of word stemming and stopword filtering.
- the objects can be stored in a memory device.
- An object can be at least one of document, image, music, movie, internet page, and electronically storable file.
- An object can also be a physical object, e.g. be a book that is read by a BDS using e.g. of the title is referenced.
- FIGS. 2 to 4 show examples of tree data structures in unreduced form and in reduced form
- Fig. 5 shows an example of bookmarks in hierarchical structure
- FIG. 6 shows an example of two documents for determining the similarity of the two documents.
- summaries for objects are generated to display the summary in, for example, a search result list along with the found objects.
- the classification of the objects is based on data obtained from tree data structures, such as mind maps or file systems, where the objects are linked from the BDS, and on the objects themselves.
- summaries for objects linked from a BDS are generated with the words that are near the link or the reference.
- Objects that are not linked from a BDS are analyzed (e.g., by identifying and evaluating bookmarks and markers of the object or document) and a summary is generated from the analysis result.
- both procedures can also be combined.
- both words can be used, which are in a BDS near the reference to the referenced object, as well as words resulting from the analysis result of a document analysis.
- the method for generating summaries for objects can be implemented by software which is e.g. may include client software and / or server software.
- a user may install client software to perform the method of the invention.
- the software identifies all relevant BDS or user-generated data on the user's computer that are suitable for generating summaries.
- a BDS is e.g. identified via the file extension or via the header of files or by being explicitly selected by the user.
- the software either starts automatically in the background when booting up the computer, by explicitly starting it by the user or by calling a third application.
- Documents e.g. Documents in PDF format are preferably automatically identified, either by file extension or file header.
- the software can scan all storage media (hard disk, DVDs, network, etc.) or just look at the main memory, i. analyze only the BDS or documents that are currently open or otherwise processed.
- the BDSs or documents are filtered as needed according to factors, e.g.
- the factors can be set arbitrarily or combined with each other. For example, only BDS or objects that were created in the last 2 months, at least 10 links to objects but in the last 3 days were not changed and were explicitly marked by the user to be transmitted to the server , If necessary, the BDS or objects are converted to another format. For example, proprietary Mind Map files or documents in PDF format could be converted to XML format.
- the BDS or the objects are then transmitted to a server, wherein the server software can possibly run on the computer of the user on which the BDS or the objects are located.
- the transfer is an optional step, i. the method for generating the summaries can also be realized by means of the client software.
- the data received from the server can be transferred to a summary device, which creates summaries for the transferred data and in turn passes these to the server.
- the summarizer may be a computer specially designed to generate summaries.
- the BDSs are converted to another format (for example, from a proprietary format to XML).
- the server stores the data on disk, in memory, in a database or other suitable medium. Possibly.
- the BDS or the objects are filtered again according to the factors already mentioned. 3. Reduce the tree data structure
- Simplifying the BDS can be done by reducing the BDS. Reducing the BDS can be done as follows:
- the link node is merged with the parent node.
- a non-descript description is when the node name is the same as the file name of the linked object or a number. An example of this is given in FIG. 4.
- Certain branches can be selected in the BDS that should (not) be analyzed. This is especially important with file systems, so that the user can e.g. can choose to scan only directories and files in c: ⁇ my files ⁇ and not c: ⁇ windows ⁇ .
- Bookmarks in documents can also be structured hierarchically (see Fig. 5). This hierarchical structure may also be reduced with the abovementioned methods to get a simplified form of the hierarchical structure of bookmarks.
- Preprocessing may include:
- Additional preprocessing steps may be provided in addition thereto or alternatively thereto.
- the objects that are referenced in a BDS can be identified as follows: in the BDS, those nodes are sought that link to an object or that refer to an object. For example, hyperlinks, file names and / or paths, links, and / or indirect references to objects such as BibTeX keys, file numbers, and similar unique keys or document names (or titles) are searched for. Once all the nodes that link to or reference objects are found, these objects must be identified to make it clear what it is. This can be done in one embodiment as follows: a. Was a hyperlink can be found i. the hyperlink itself serve as an identifier
- the title is read from the linked website (the text between the tags ⁇ title> and ⁇ / title>)
- the object type is identified by the file extension (for example, ".pdf") or the header of the file, and other methods can be used, depending on the file type
- Reading the file metadata (title or author, if available), depending on the operating system and file type.
- Reading out the title by determining the text with the largest font on the first page in the upper third and going over less than four lines. This text is then adopted as a title (the numerical values mentioned here can, of course, be changed as desired, so that, for example, not in the upper third but in the upper quarter is searched for).
- iv. otherwise generate a hash value (for example MD5) or file name and path of the file.
- a hash value for example MD5
- the data (eg title, hash value, ...) that have been determined can be compared with existing data in a database (knowledge base). For example, was an item extracted from the document title "The Tree Proximity Index - what is it good for?" And an object titled “The Tree Proximity Index: what is it good for?" Is already in the database. present, it is probably the same object despite the small difference.
- the step b. is also used for objects that are not referenced by a BDS but for which a summary is to be created.
- a summary is generated for the identified objects, preferably as described with reference to FIG. 4, as follows:
- node A e.g., statement 1
- this text is e.g. stored in a database and the object e.g. assigned via a unique identifier. It also stores the type of data the information was extracted from. For example, 1 for PDF, 2 for BDS, etc.
- the text of the parent node may be read from node A and written to step b. continue until the root of the BDS is reached. That is, the texts of those nodes are determined which are on the path between the root and the node referencing the object.
- a distance value is determined for each node is stored how far away it is from the node referencing the object. The distance value may be approximately the number of edges between the node referencing the object and the node containing the text. For example, referring to FIG. 4, for the document which is linked behind "statement 3" (and stored, for example, in the database with the unique identifier "1234"), after the end of all the passes, the following would be stored:
- Deviating variants hereof for example when determining the distance value or when determining the nodes from which the texts for the summary are read out, are likewise encompassed by the method according to the invention.
- a document eg PDF file
- bookmarks can also be stored in a hierarchical tree structure (see Fig. 5).
- the hierarchical tree of bookmarks can be preprocessed, as described above for BDS.
- step b. and c. When saving in step b. and c. It also saves to which object the text refers.
- a summary can be made for each of these objects of a similar object. That is, a summary of object A can be adopted for object B, if object A and object B are very different are similar or nearly identical.
- the similarity of object A to object B can be determined as follows: a. In the documents to be compared, all references in the text are identified; b. The more same references exist in the two documents and the more references in the same order, the more similar the documents are. Fig. 6 illustrates this. Document A and Document B each have three references.
- the respective references are not in the exact same places but in the same order. If two documents each contain the same references in the same order, these two documents according to the invention are considered to be identical or nearly identical. In such a case, for example, it can be assumed that one of the two documents is a translation or an earlier version of a document, but the content of the documents is very similar. In such a case, the summary of a first document can also be adopted for the second document, if no summary can be generated for the second document. Documents can also be considered (very) similar if only some of the references match and / or occur in the same order. Other methods of determining the similarity of documents could also be used.
- the method according to the invention can be used to display summary of objects in search engines (alternatively or in addition to the common information, such as text extracts, author information, etc.).
- the common information such as text extracts, author information, etc.
- either all the data available for an object can be displayed, or only a few of these data, for example only summaries that originate from data of several users to an object (that is, if very similar summaries were created from data of several users).
- summaries can be displayed by Objects that are not in the search hit list are very similar to the displayed objects.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/DE2009/001453 WO2011047644A1 (fr) | 2009-10-19 | 2009-10-19 | Procédé et système permettant de générer un résumé pour un objet |
DE112009005436T DE112009005436A5 (de) | 2009-10-19 | 2009-10-19 | Verfahren und System zum Erzeugen einer Zusammenfassung für ein Objekt |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/DE2009/001453 WO2011047644A1 (fr) | 2009-10-19 | 2009-10-19 | Procédé et système permettant de générer un résumé pour un objet |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011047644A1 true WO2011047644A1 (fr) | 2011-04-28 |
Family
ID=42236390
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/DE2009/001453 WO2011047644A1 (fr) | 2009-10-19 | 2009-10-19 | Procédé et système permettant de générer un résumé pour un objet |
Country Status (2)
Country | Link |
---|---|
DE (1) | DE112009005436A5 (fr) |
WO (1) | WO2011047644A1 (fr) |
-
2009
- 2009-10-19 WO PCT/DE2009/001453 patent/WO2011047644A1/fr active Application Filing
- 2009-10-19 DE DE112009005436T patent/DE112009005436A5/de not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
DELORT J-Y ET AL: "Enhanced Web Document Summarization Using Hyperlinks", HYPERTEXT'03.THE 14TH. ACM CONFERENCE ON HYPERTEXT AND HYPERMEDIA. NOTTINGHAM, UK, AUG. 26 - 30, 2003; [ACM CONFERENCE ON HYPERTEXT AND HYPERMEDIA], NEW YORK, NY : ACM, US LNKD- DOI:10.1145/900051.900097, vol. CONF. 14, 26 August 2003 (2003-08-26), pages 208 - 215, XP002308306, ISBN: 978-1-58113-704-0 * |
Also Published As
Publication number | Publication date |
---|---|
DE112009005436A5 (de) | 2012-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE69637125T2 (de) | Optimaler zugriff auf elektronische dokumente | |
DE60209572T2 (de) | Verfahren und vorrichtung zur automatischen erkennung von datentypen für die datentypenabhängige verarbeitung | |
DE69432575T2 (de) | Dokumentenerkennungssystem mit verbesserter Wirksamkeit der Dokumentenerkennung | |
DE60304331T2 (de) | Abrufen übereinstimmender dokumente durch abfragen in einer nationalen sprache | |
DE102007037646B4 (de) | Computerspeichersystem und Verfahren zum Indizieren, Durchsuchen und zur Datenwiedergewinnung von Datenbanken | |
DE102006040208A1 (de) | Patentbezogenes Suchverfahren und -system | |
EP1311989A2 (fr) | Procede de recherche automatique | |
DE10162418A1 (de) | System zur Verarbeitung strukturierter Dokumente, damit sie sich zur Ablieferung über Netzwerke eignen | |
WO2011044865A1 (fr) | Procédé de détermination d'une similarité d'objets | |
DE102004057862A1 (de) | Verfahren zum Abrufen von Bilddokumenten unter Verwendung von Hierarchie- und Kontexttechniken | |
WO2010078859A1 (fr) | Procédé pour déterminer une similarité entre des documents | |
DE10057634C2 (de) | Verfahren zur Verarbeitung von Text in einer Rechnereinheit und Rechnereinheit | |
EP1030254A1 (fr) | Procédé et systeme de gestion de documents | |
DE10033548C2 (de) | Verfahren zur Vorschau von Internetseiten | |
WO2011047644A1 (fr) | Procédé et système permettant de générer un résumé pour un objet | |
DE102009042659A1 (de) | Verfahren zur automatisierten Katalogisierung von digitalen Rasterdaten mit räumlichem Bezug | |
WO2011044864A1 (fr) | Procédé et système de classification d'objets | |
EP1160688A2 (fr) | Procédé et système de lier automatiquement des ensembles de données d'au-moins une source de données et système de récupérer des données liées | |
EP1094405A2 (fr) | Mèthode de creation d'une interface dynamique d'une base de donnees | |
WO2013075745A1 (fr) | Procédé et système d'élaboration de modèles d'utilisateurs | |
DE102016121922A1 (de) | System und Verfahren zum Filtern von Information | |
WO2011044866A1 (fr) | Procédé et système de détermination d'une similarité entre des personnes | |
DE102008051858B4 (de) | Datenorganisations- und auswertungsverfahren | |
EP2050022A1 (fr) | Procédé pour la fabrication de matrices d'images extensibles | |
EP1170676A1 (fr) | Visualisation d'une structure d'informations de documents sur Internet |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09807595 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1120090054366 Country of ref document: DE Ref document number: 112009005436 Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09807595 Country of ref document: EP Kind code of ref document: A1 |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: R225 Ref document number: 112009005436 Country of ref document: DE Effective date: 20121129 |