WO2011047644A1 - Procédé et système permettant de générer un résumé pour un objet - Google Patents

Procédé et système permettant de générer un résumé pour un objet Download PDF

Info

Publication number
WO2011047644A1
WO2011047644A1 PCT/DE2009/001453 DE2009001453W WO2011047644A1 WO 2011047644 A1 WO2011047644 A1 WO 2011047644A1 DE 2009001453 W DE2009001453 W DE 2009001453W WO 2011047644 A1 WO2011047644 A1 WO 2011047644A1
Authority
WO
WIPO (PCT)
Prior art keywords
data structure
tree data
references
text
node
Prior art date
Application number
PCT/DE2009/001453
Other languages
German (de)
English (en)
Inventor
Jöran BEEL
Béla GIPP
Jan-Olaf Stiller
Original Assignee
Beel Joeran
Gipp Bela
Jan-Olaf Stiller
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beel Joeran, Gipp Bela, Jan-Olaf Stiller filed Critical Beel Joeran
Priority to PCT/DE2009/001453 priority Critical patent/WO2011047644A1/fr
Priority to DE112009005436T priority patent/DE112009005436A5/de
Publication of WO2011047644A1 publication Critical patent/WO2011047644A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users

Definitions

  • the invention relates to a method and system for generating a summary for an object, such as an electronic document.
  • Search engines are known in the prior art which, as a result of a search for each search result, display a brief overview of each search result.
  • document search engines which represent a special type of search engine, these are usually extracts from the text passages in which the searched keyword occurs.
  • FIG. 1 shows a result list for a document search engine known from the prior art. It is noticeable that the displayed text excerpts match the document title, which is always displayed above the overview anyway. Ultimately, therefore, the display of the text passages in which the search word occurs, not very helpful to get a meaningful overview.
  • the object of the present invention is to provide a method and a system with which reliable and high-quality summaries are generated for objects, without having the disadvantages known from the prior art.
  • a method for generating a summary describing at least one object wherein the at least one object is referenced by at least one tree data structure, wherein the at least one tree data structure comprises a number of nodes, of which at least one node references the at least one object, wherein the node of the at least one tree data structure is associated with a text comprising a number of words, and wherein the at least one tree data structure is storable in a storage device, and wherein the method comprises the steps of:
  • Tree structures can be used to extract information that can be used to create efficient and high-quality summaries for objects.
  • tree data structures may be: directory structures (e.g., file systems), mind maps, or other hierarchical structures capable of storing references to objects.
  • a tree data structure may also be a computer network where the objects are stored on different computers and where the objects are in a hierarchical relationship (e.g., LDAP).
  • LDAP hierarchical relationship
  • an object for example, an electronic file in a directory of a directory structure is called or a document which is referenced or linked from a mind map out.
  • BDS An essential advantage of BDS is that they can be analyzed directly and quickly without having to access the contents of the objects. The moment a BDS is created by a user, it can be analyzed immediately. Another advantage is that generating a summary for an object can be done in near real-time, which is particularly advantageous when a user moves a document from one directory to another directory, for example, recreating a summary for the moved object May have consequences.
  • Identifying the nodes may include identifying the nodes that are in the tree data structure on the path between a root node of the tree data structure and the node of the tree data structure that references the object.
  • At least one distance value can be stored which represents the distance of the text to the node referencing the object.
  • the distance value may include the number of edges between the node referencing the object and the node to which the text is associated.
  • a step of reducing the tree data structure may be performed prior to identifying the nodes of the at least one tree data structure. As a result, the generation of summaries can be further accelerated, which is advantageous in particular when a very large number of BDSs have to be analyzed.
  • Reducing may include:
  • Deleting end nodes which do not represent a reference to an object, and / or
  • the tree data structure may be transmitted over a communication network from a client device to a server device, wherein the transfer may be performed prior to reading out the nodes of the tree data structure.
  • the tree data structure Before transferring or after transfer, the tree data structure may be converted to a normalized tree data structure format. This makes it possible to access all BDS in the same way.
  • the normalized tree data structure format can be a tree data structure in XML format.
  • the bookmarks can be arranged in a hierarchical structure.
  • At least one distance value may be determined in the hierarchical structure, wherein a distance value represents the distance of the bookmark to a leaf node which represents a bookmark.
  • the distance value may include the number of edges between a bookmark and a leaf node that represents a bookmark.
  • a bookmark may be a predetermined section of the object, preferably a marked section of a textual object.
  • those bookmarks that have a predetermined similarity to a heading contained in the object can be ignored. It can be ignored those predetermined sections of the object, which extend over a predetermined number of lines of text.
  • the summarizing of the summaries may include discarding the similar and / or identical summaries, except for summarizing and / or storing a similarity value to the similar and / or identical summaries, wherein the similarity value is stored in relation to a stored summary.
  • a summary generated for a first object can be assigned to a second object. This is particularly advantageous for objects that are scientific publications. The assignment can be made when the second object has a predetermined similarity to the first object.
  • Determining a similarity between the first object and the second object may include: Identifying first references in the first object and identifying second references in the second object;
  • the order of the first references in the first object and the order of the second references in the second object may be considered, the two objects being deemed similar if the order of a predetermined number of References matches.
  • the texts can be subjected to a text transformation in order to generate a transformed text from the texts.
  • the text transformation may include at least one of word stemming and stopword filtering.
  • the objects can be stored in a memory device.
  • An object can be at least one of document, image, music, movie, internet page, and electronically storable file.
  • An object can also be a physical object, e.g. be a book that is read by a BDS using e.g. of the title is referenced.
  • FIGS. 2 to 4 show examples of tree data structures in unreduced form and in reduced form
  • Fig. 5 shows an example of bookmarks in hierarchical structure
  • FIG. 6 shows an example of two documents for determining the similarity of the two documents.
  • summaries for objects are generated to display the summary in, for example, a search result list along with the found objects.
  • the classification of the objects is based on data obtained from tree data structures, such as mind maps or file systems, where the objects are linked from the BDS, and on the objects themselves.
  • summaries for objects linked from a BDS are generated with the words that are near the link or the reference.
  • Objects that are not linked from a BDS are analyzed (e.g., by identifying and evaluating bookmarks and markers of the object or document) and a summary is generated from the analysis result.
  • both procedures can also be combined.
  • both words can be used, which are in a BDS near the reference to the referenced object, as well as words resulting from the analysis result of a document analysis.
  • the method for generating summaries for objects can be implemented by software which is e.g. may include client software and / or server software.
  • a user may install client software to perform the method of the invention.
  • the software identifies all relevant BDS or user-generated data on the user's computer that are suitable for generating summaries.
  • a BDS is e.g. identified via the file extension or via the header of files or by being explicitly selected by the user.
  • the software either starts automatically in the background when booting up the computer, by explicitly starting it by the user or by calling a third application.
  • Documents e.g. Documents in PDF format are preferably automatically identified, either by file extension or file header.
  • the software can scan all storage media (hard disk, DVDs, network, etc.) or just look at the main memory, i. analyze only the BDS or documents that are currently open or otherwise processed.
  • the BDSs or documents are filtered as needed according to factors, e.g.
  • the factors can be set arbitrarily or combined with each other. For example, only BDS or objects that were created in the last 2 months, at least 10 links to objects but in the last 3 days were not changed and were explicitly marked by the user to be transmitted to the server , If necessary, the BDS or objects are converted to another format. For example, proprietary Mind Map files or documents in PDF format could be converted to XML format.
  • the BDS or the objects are then transmitted to a server, wherein the server software can possibly run on the computer of the user on which the BDS or the objects are located.
  • the transfer is an optional step, i. the method for generating the summaries can also be realized by means of the client software.
  • the data received from the server can be transferred to a summary device, which creates summaries for the transferred data and in turn passes these to the server.
  • the summarizer may be a computer specially designed to generate summaries.
  • the BDSs are converted to another format (for example, from a proprietary format to XML).
  • the server stores the data on disk, in memory, in a database or other suitable medium. Possibly.
  • the BDS or the objects are filtered again according to the factors already mentioned. 3. Reduce the tree data structure
  • Simplifying the BDS can be done by reducing the BDS. Reducing the BDS can be done as follows:
  • the link node is merged with the parent node.
  • a non-descript description is when the node name is the same as the file name of the linked object or a number. An example of this is given in FIG. 4.
  • Certain branches can be selected in the BDS that should (not) be analyzed. This is especially important with file systems, so that the user can e.g. can choose to scan only directories and files in c: ⁇ my files ⁇ and not c: ⁇ windows ⁇ .
  • Bookmarks in documents can also be structured hierarchically (see Fig. 5). This hierarchical structure may also be reduced with the abovementioned methods to get a simplified form of the hierarchical structure of bookmarks.
  • Preprocessing may include:
  • Additional preprocessing steps may be provided in addition thereto or alternatively thereto.
  • the objects that are referenced in a BDS can be identified as follows: in the BDS, those nodes are sought that link to an object or that refer to an object. For example, hyperlinks, file names and / or paths, links, and / or indirect references to objects such as BibTeX keys, file numbers, and similar unique keys or document names (or titles) are searched for. Once all the nodes that link to or reference objects are found, these objects must be identified to make it clear what it is. This can be done in one embodiment as follows: a. Was a hyperlink can be found i. the hyperlink itself serve as an identifier
  • the title is read from the linked website (the text between the tags ⁇ title> and ⁇ / title>)
  • the object type is identified by the file extension (for example, ".pdf") or the header of the file, and other methods can be used, depending on the file type
  • Reading the file metadata (title or author, if available), depending on the operating system and file type.
  • Reading out the title by determining the text with the largest font on the first page in the upper third and going over less than four lines. This text is then adopted as a title (the numerical values mentioned here can, of course, be changed as desired, so that, for example, not in the upper third but in the upper quarter is searched for).
  • iv. otherwise generate a hash value (for example MD5) or file name and path of the file.
  • a hash value for example MD5
  • the data (eg title, hash value, ...) that have been determined can be compared with existing data in a database (knowledge base). For example, was an item extracted from the document title "The Tree Proximity Index - what is it good for?" And an object titled “The Tree Proximity Index: what is it good for?" Is already in the database. present, it is probably the same object despite the small difference.
  • the step b. is also used for objects that are not referenced by a BDS but for which a summary is to be created.
  • a summary is generated for the identified objects, preferably as described with reference to FIG. 4, as follows:
  • node A e.g., statement 1
  • this text is e.g. stored in a database and the object e.g. assigned via a unique identifier. It also stores the type of data the information was extracted from. For example, 1 for PDF, 2 for BDS, etc.
  • the text of the parent node may be read from node A and written to step b. continue until the root of the BDS is reached. That is, the texts of those nodes are determined which are on the path between the root and the node referencing the object.
  • a distance value is determined for each node is stored how far away it is from the node referencing the object. The distance value may be approximately the number of edges between the node referencing the object and the node containing the text. For example, referring to FIG. 4, for the document which is linked behind "statement 3" (and stored, for example, in the database with the unique identifier "1234"), after the end of all the passes, the following would be stored:
  • Deviating variants hereof for example when determining the distance value or when determining the nodes from which the texts for the summary are read out, are likewise encompassed by the method according to the invention.
  • a document eg PDF file
  • bookmarks can also be stored in a hierarchical tree structure (see Fig. 5).
  • the hierarchical tree of bookmarks can be preprocessed, as described above for BDS.
  • step b. and c. When saving in step b. and c. It also saves to which object the text refers.
  • a summary can be made for each of these objects of a similar object. That is, a summary of object A can be adopted for object B, if object A and object B are very different are similar or nearly identical.
  • the similarity of object A to object B can be determined as follows: a. In the documents to be compared, all references in the text are identified; b. The more same references exist in the two documents and the more references in the same order, the more similar the documents are. Fig. 6 illustrates this. Document A and Document B each have three references.
  • the respective references are not in the exact same places but in the same order. If two documents each contain the same references in the same order, these two documents according to the invention are considered to be identical or nearly identical. In such a case, for example, it can be assumed that one of the two documents is a translation or an earlier version of a document, but the content of the documents is very similar. In such a case, the summary of a first document can also be adopted for the second document, if no summary can be generated for the second document. Documents can also be considered (very) similar if only some of the references match and / or occur in the same order. Other methods of determining the similarity of documents could also be used.
  • the method according to the invention can be used to display summary of objects in search engines (alternatively or in addition to the common information, such as text extracts, author information, etc.).
  • the common information such as text extracts, author information, etc.
  • either all the data available for an object can be displayed, or only a few of these data, for example only summaries that originate from data of several users to an object (that is, if very similar summaries were created from data of several users).
  • summaries can be displayed by Objects that are not in the search hit list are very similar to the displayed objects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un système permettant de générer des résumés pour des objets, comme par exemple des documents, une analyse d'une structure de données ramifiée référençant l'objet ou de signets et/ou de marques à l'intérieur d'un document étant effectuée. Des résumés pour des objets autres que des documents sous forme textuelle peuvent donc également être générés. Les résumés générés et les objets qui y sont associés peuvent être mis en mémoire dans un dispositif mémoire et fournis à des moteurs de recherche afin de pouvoir afficher conjointement les résumés et les objets dans une liste de résultats de recherche.
PCT/DE2009/001453 2009-10-19 2009-10-19 Procédé et système permettant de générer un résumé pour un objet WO2011047644A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/DE2009/001453 WO2011047644A1 (fr) 2009-10-19 2009-10-19 Procédé et système permettant de générer un résumé pour un objet
DE112009005436T DE112009005436A5 (de) 2009-10-19 2009-10-19 Verfahren und System zum Erzeugen einer Zusammenfassung für ein Objekt

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/DE2009/001453 WO2011047644A1 (fr) 2009-10-19 2009-10-19 Procédé et système permettant de générer un résumé pour un objet

Publications (1)

Publication Number Publication Date
WO2011047644A1 true WO2011047644A1 (fr) 2011-04-28

Family

ID=42236390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/DE2009/001453 WO2011047644A1 (fr) 2009-10-19 2009-10-19 Procédé et système permettant de générer un résumé pour un objet

Country Status (2)

Country Link
DE (1) DE112009005436A5 (fr)
WO (1) WO2011047644A1 (fr)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DELORT J-Y ET AL: "Enhanced Web Document Summarization Using Hyperlinks", HYPERTEXT'03.THE 14TH. ACM CONFERENCE ON HYPERTEXT AND HYPERMEDIA. NOTTINGHAM, UK, AUG. 26 - 30, 2003; [ACM CONFERENCE ON HYPERTEXT AND HYPERMEDIA], NEW YORK, NY : ACM, US LNKD- DOI:10.1145/900051.900097, vol. CONF. 14, 26 August 2003 (2003-08-26), pages 208 - 215, XP002308306, ISBN: 978-1-58113-704-0 *

Also Published As

Publication number Publication date
DE112009005436A5 (de) 2012-11-29

Similar Documents

Publication Publication Date Title
DE69637125T2 (de) Optimaler zugriff auf elektronische dokumente
DE60209572T2 (de) Verfahren und vorrichtung zur automatischen erkennung von datentypen für die datentypenabhängige verarbeitung
DE69432575T2 (de) Dokumentenerkennungssystem mit verbesserter Wirksamkeit der Dokumentenerkennung
DE60304331T2 (de) Abrufen übereinstimmender dokumente durch abfragen in einer nationalen sprache
DE102007037646B4 (de) Computerspeichersystem und Verfahren zum Indizieren, Durchsuchen und zur Datenwiedergewinnung von Datenbanken
DE102006040208A1 (de) Patentbezogenes Suchverfahren und -system
EP1311989A2 (fr) Procede de recherche automatique
DE10162418A1 (de) System zur Verarbeitung strukturierter Dokumente, damit sie sich zur Ablieferung über Netzwerke eignen
WO2011044865A1 (fr) Procédé de détermination d'une similarité d'objets
DE102004057862A1 (de) Verfahren zum Abrufen von Bilddokumenten unter Verwendung von Hierarchie- und Kontexttechniken
WO2010078859A1 (fr) Procédé pour déterminer une similarité entre des documents
DE10057634C2 (de) Verfahren zur Verarbeitung von Text in einer Rechnereinheit und Rechnereinheit
EP1030254A1 (fr) Procédé et systeme de gestion de documents
DE10033548C2 (de) Verfahren zur Vorschau von Internetseiten
WO2011047644A1 (fr) Procédé et système permettant de générer un résumé pour un objet
DE102009042659A1 (de) Verfahren zur automatisierten Katalogisierung von digitalen Rasterdaten mit räumlichem Bezug
WO2011044864A1 (fr) Procédé et système de classification d'objets
EP1160688A2 (fr) Procédé et système de lier automatiquement des ensembles de données d'au-moins une source de données et système de récupérer des données liées
EP1094405A2 (fr) Mèthode de creation d'une interface dynamique d'une base de donnees
WO2013075745A1 (fr) Procédé et système d'élaboration de modèles d'utilisateurs
DE102016121922A1 (de) System und Verfahren zum Filtern von Information
WO2011044866A1 (fr) Procédé et système de détermination d'une similarité entre des personnes
DE102008051858B4 (de) Datenorganisations- und auswertungsverfahren
EP2050022A1 (fr) Procédé pour la fabrication de matrices d'images extensibles
EP1170676A1 (fr) Visualisation d'une structure d'informations de documents sur Internet

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09807595

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 1120090054366

Country of ref document: DE

Ref document number: 112009005436

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09807595

Country of ref document: EP

Kind code of ref document: A1

REG Reference to national code

Ref country code: DE

Ref legal event code: R225

Ref document number: 112009005436

Country of ref document: DE

Effective date: 20121129