WO2011044864A1

WO2011044864A1 - Method and system for classifying objects

Info

Publication number: WO2011044864A1
Application number: PCT/DE2009/001420
Authority: WO
Inventors: Jöran BEEL; Béla GIPP; Jan-Olaf Stiller
Original assignee: Beel Joeran; Gipp Bela; Jan-Olaf Stiller
Priority date: 2009-10-12
Filing date: 2009-10-12
Publication date: 2011-04-21

Abstract

The invention relates to a method and a system for classifying at least one object, wherein the object is related to at least one tree data structure, wherein the tree data structure has a number of nodes, wherein at least one text, comprising a number of words, is associated with a node of the tree data structure, wherein the method comprises at least the following steps and the system is designed to perform at least the following steps: reading out the texts associated with the nodes of the tree data structure; weighting the texts, wherein for each word of a text, a weighting value is generated and is associated with the word of the text, wherein different weighting values can be generated for a word that occurs in different texts; and generating a number of classification values, wherein each classification value is represented by a triplet, comprising an object identification that identifies the object, a word, and a weighting value associated with the word (object identification, word, weighting value).

Description

Method and system for classifying objects

Field of the invention

The invention relates to a method and a system for classifying objects which are referenced by at least one tree data structure or which are related to at least one tree data structure.

State of the art

Methods are known by which objects, e.g. Documents can be classified. For example, for documents, the document text is analyzed. It is assumed that the words most commonly found in the document are probably best used to describe the document. There are numerous algorithms for this, such as TF-IDF, BM25. The problem here is that not always the full text is available and the authors often do not use strict terminology, so that the documents are not found later, when looking for synonyms. A further disadvantage is that words that do not appear in the document but would better describe the document can not be used to search for the document since these words are not taken into account by the known classification methods.

Objects such as music or images are often "tagged" or tagged, ie users assign keywords so that the objects can be searched for using the keywords.The disadvantage here is that objects that are not tagged are not found or are not searched for them can. Furthermore, expert search engines are known which can be used to search for persons with specific knowledge. To do this, the system needs to know which areas a person knows or knows very well. In known methods, users can register their knowledge in a database. However, this is very complex and often very inaccurate, for example, if people enter not existing knowledge in the database. Automated methods are also known in which emails or other written documents of the persons are analyzed. However, e-mails often contain much irrelevant information, so that the quality of the classification of persons is usually very low.

Object of the invention

The object of the present invention is to provide a method and a system with which objects can be classified reliably and with high quality, without having the disadvantages known from the prior art.

Inventive solution

This object is achieved by a method having the features of claim 1 and a system having the features of claim 28. Advantageous embodiments of the invention are specified in the following description and the other claims.

Accordingly, there is provided a method of classifying at least one object, wherein the object is related to at least one tree data structure, the at least one tree data structure having a number of nodes connected by edges, wherein one node of the at least one tree data structure comprises at least one text, comprising a number of words, and wherein the at least one at least one tree data structure can be stored in a memory device, and wherein the method comprises at least the following steps:

- reading the texts associated with the nodes of the at least one tree data structure;

Weighting of the texts, wherein for each word of a text a weighting value is generated which is assigned to the word of the text, whereby for a word which occurs in different texts different weighting values can be generated; and

Generating a number of classification values, each classification value being represented by a triple consisting of an object identification identifying object, a word, and a weighting value (object identification, word, weighting value) associated with the word.

The data source for classifying objects is a tree data structure in which the objects are referenced or with which an object is related. An example of a relationship between an object and a tree data structure is an author who has created the tree data structure. The author is then related to the tree data structure. In the following, the term tree data structure or tree data structures is abbreviated BDS. The terms "referencing" and "linking" or the terms "reference" and "link" are used synonymously below. Tree structures can be used to extract information that can significantly improve search engines (for example, search for documents, people, ...).

According to the invention, tree data structures can be: directory structures (eg file systems), mind maps or other hierarchical structures which are suitable for storing references to objects. A tree data structure may also be a computer network where the objects are stored on different computers and where the objects are in a hierarchical relationship. For example, an object is an electronic file in a directory of a directory structure or a document which is referenced or linked from a Mind Map.

An important advantage of BDS is that it can be analyzed directly and quickly without having to access the content of the objects to be classified. The moment a BDS is created by a user, it can be analyzed immediately. Another advantage is that the classification of objects can be determined in near real-time, which is particularly advantageous when, for example, a user moves a document from one directory to another directory, which can result in reclassification of the moved object. Another advantage is that the storage space required to perform an efficient search for documents can be significantly reduced, compared to the methods known from the prior art, since for documents the words of the document content can be disregarded, since only the Words of the tree data structure are included in the classification.

The relationship of the object to the at least one tree data structure may be formed by at least one node, which represents a reference to the object, of the tree data structure.

The weighting value of a word can be generated from the number of edges between the node referencing the object and the node to which the text of the word is assigned. It is advantageous if the weighting value of a word is generated according to the calculation rule 1 / ((number of edges between object and word) + 1).

In generating the classification values, those texts associated with those nodes located in the tree data structure on the path between a root node and the node referencing the object may be taken into account. When generating the classification values, alternatively or additionally, those texts may be used which are associated with such nodes that are sibling nodes of those nodes that are on the path between a root node and the node referencing the object.

When referencing an object through nodes of multiple tree data structures, the weight values of identical words may be combined to produce a total weight value for the word. The combining of the weighting values may include at least adding the weighting values.

In a further embodiment, the relationship of the object to the at least one tree data structure may be formed by an association of the object with the at least one tree data structure.

After reading out the nodes, the number of occurrences of each word and / or each compound word in the tree data structure can be determined.

In generating the weighting value of a word, the number of nodes included in a partial tree data structure may be taken into account, wherein the root of the partial tree data structure is formed by the node containing the word. The weighting value of a word can be calculated according to the calculation rule

be generated.

The weighting value of a word can also be calculated according to the calculation rule

be generated.

For a word that occurs multiple times in a tree data structure, a total weighting value can be generated, which follows the calculation rule

^ highest weighting value + \ _Λ Ι smaller weighting values I can be generated. Melirere tree data structures can also be combined into a single tree data structure.

Before weighting the texts, the texts can be subjected to a text transformation in order to generate a transformed text from the texts. The text transformation may include at least one of word stemming and stopword filtering.

The classification values may be stored in a memory device. The objects may also be stored in a memory device and related to the classification values in the memory device.

Before reading out the nodes of the at least one tree data structure, a step of reducing the tree data structure may be performed. As a result, the determination or determination of similarity values between objects can be accelerated, which is advantageous in particular when a very large number of BDSs have to be analyzed. In addition, the reduction can further increase the quality of the classification, since the reduction removes nodes that are irrelevant to the classification.

The tree data structure may be transmitted over a communication network from a client device to a server device, wherein the transfer may be performed prior to reading out the nodes of the tree data structure.

Before transferring or after transfer, the tree data structure may be converted to a normalized tree data structure format. This makes it possible to access all BDS in the same way. The normalized tree data structure format can be a tree data structure in XML format. An object can be at least one of document, image, music, movie, website, electronically storable file, and author. An object can also be a physical object, eg a book, which is referenced by a BDS on the basis of eg the title.

Provided by the invention and to solve the technical problem is also a system for determining classifying objects, wherein the system is designed to carry out the method according to the invention.

Brief description of the figures

The further explanation of the invention is based on the drawing. In the drawing shows:

FIGS. 1 to 3 show examples of tree data structures in non-reduced form and reduced form;

Figs. 4 to 6 are examples of tree data structure for explaining the weighting; and Fig. 7 shows an example of a tree data structure for classifying words to each other.

Description of a preferred embodiment

According to the invention, objects (eg, web pages, people, documents, pictures, music, movies, words, etc.) are classified to make them searchable by keyword-based search. The classification of the objects is based on data obtained from tree data structures, such as mind maps or file systems, where the objects are linked or referenced from the BDS or are related to the BDS. According to the invention, objects which are linked from a BDS are classified with the words which are in the vicinity of the link or the reference. According to the invention, the author of a BDS with the words of the BDS created by him classified. According to the invention, words connected by edges in the BDS are also related to one another.

The method of classifying objects may be implemented by software, e.g. may include client software and / or server software.

1. Software installation and data transfer to server

A user may install client software to perform the method of the invention. The software identifies all relevant BDS on the user's computer. A BDS is e.g. identified via the file extension or via the header of files or by being explicitly selected by the user. The software either starts automatically in the background when booting up the computer, by explicitly starting it by the user or by calling a third application. The software can scan all storage media (hard disk, DVDs, network, etc.) or only pay attention to the main memory, i. Only analyze the BDS that are currently open or otherwise processed.

The BDS are filtered as needed by factors, e.g.

Size (file size, or number of nodes or referenced objects in the BDS) Last modified date or creation date

- Change frequency (number of changes divided by a period)

- Number of links to objects in a BDS (for example, that a mind map must contain at least 20 links to web pages before being considered)

Location (only the BDSs from specific directories)

- BDS type (only mind maps of a specific software, or just the file system, etc).

The factors can be set arbitrarily or combined with each other. For example, only BDSs created in the last 2 months could be considered, but at least 10 links to objects are missing in the last 3 days were changed more and were explicitly marked by the user to be transferred to the server. If necessary, the BDSs are converted to another format. For example, proprietary Mind Map files could be converted to XML. The BDS are then transmitted to a server, the server software can possibly run on the computer of the user on which the BDS are located.

2. Save the data to server

If necessary, the BDSs are converted to another format (for example, from a proprietary format to XML). The server stores the data on disk, in memory, in a database or other suitable medium. Possibly. the BDS are filtered again according to already mentioned factors.

3. Reduce the tree data structure

In some cases, it is advantageous to simplify the BDS before classifying the objects that are referenced in the BDS. Reducing the BDS can be done as follows:

- Delete all end nodes that have no links to objects. FIG. 1 shows on the left a BDS in non-reduced form and on the right a BDS in reduced form, in which all end nodes which do not contain any links to objects have been deleted.

- Reduce the link nodes that have no sibling nodes to the next possible level, so that siblings arise. An example of this is given in FIG.

- Combine nodes that link to an object without meaningful description. In this case, the link node is merged with the parent node. An unintelligible description is, for example, if the node name is the same as the filename of the linked object or a number. An example of this is given in FIG.

- Filtering according to users or specific texts, such as nodes that are marked as "private" or similar in the BDS, are ignored and / or nodes (and subnodes) whose parent nodes "temp", "todo", "still sort", " xxx "etc. to be ignored. The words can be specified by the user or the programmer.

- Certain branches can be selected in the BDS that should (not) be analyzed. This is especially important with file systems, so that the user can e.g. can choose to scan only directories and files in c: \ my files \ and not c: \ windows \.

Combination of the above methods to reduce BDS.

4. Classification

The BDS searches for those nodes that link to an object or that reference an object. For example, hyperlinks, file names and / or paths, links, and / or indirect references to objects such as BibTeX keys, file numbers, and similar unique keys or document names (or titles) are searched for. Once all the nodes that link to or reference objects are found, these objects must be identified to make it clear what it is. This can be done in one embodiment as follows: a. Was a hyperlink can be found

i. the hyperlink itself serve as an identifier

ii. in the case of a website (for example in HTML or xHTML format) the title is read from the linked website (the text between the tags <title> and </ title>)

iii. in case a file has been linked (PDF, Movie, ...) as in the next step

b. If a file has been linked, the object type is identified by the file extension or the header of the file. Depending on the file type, other methods can then be used. For example

i. Reading the file metadata (title or author, if available), depending on the operating system and file type. ii. in the case of a formatted text document (eg Word document or PDF): read the title by the text with the largest font on the first page in the upper third is determined and goes over less than four lines. This text is then adopted as a title (the numerical values here can of course be exchanged freely, so that, for example, not in the upper third but in the upper quarter is searched).

iii. in the case of a JPEG: read the EXIF or IPTC metadata.

iv. otherwise: generate a hash value (for example MD5) or file name and path of the file.

c. If an indirect reference to an object has been found, for example a BibTeX key, all accessible storage media are searched for the corresponding BibTeX file and the metadata of the object is read there.

d. The data (e.g., title, hash, ...) that has been determined can be matched against existing data in a database (knowledge base). For example, was an item extracted from the document title "The Tree Proximity Index - what is it good for?" And an object titled "The Tree Proximity Index: what is it good for?" Is already in the database. present, it is probably the same object despite the small difference.

In the next step, the classification of the identified and determined objects is carried out as follows:

a. The text of each node is read out of the BDS and processed by popular text mining methods, e.g. Stemming (reducing the words to their root) or Stop Word Filtering (filtering conjunctions, prepositions, and other less meaningful words such as "and", "or", "the", "how", etc.).

b. Each object is classified by the text of its node, its parent node and parent node, etc., as well as the child node and its child node, etc.

c. Each word is weighted as shown with respect to FIG. 4:

The document linked by the node "Statement 1" is classified here with the following words, where the following weights are assigned to the words: Statement 1 Weighting = 1 Branch 1 weight = 1/2

Reduced - weighting = 1/3

In this example, therefore, the rule

1 / ((number of edges between object and word) + i) applied to determine the weighting of the words. Other regulations can also be applied. Words of sibling knots may also be considered. If an object is linked or referenced in several BDS, in which even the same words occur, the weights are combined, for example added.

After an object has been identified and classified, its metadata (title, author, URL, hash value, etc.) is stored in a database. The classification of the object is also stored in the database. Preferably, the stored object is related to the stored classification.

5. Classification of authors

Authors of BDS can also be classified by the method according to the invention. The procedure can be the following:

a. Identifying the author: the name of the author (creator / owner of the BDS) is determined

i. about the metadata of the BDS; and or

ii. a user name entered by the user in the software or with which the user logged on to the system; and or

iii. by creating a random ID during the installation of the software, which identifies the user (even if you have no further personal data).

b. The text of each node is read out of the BDS and processed using common text mining methods, eg Stemming (reducing the words to their root) or Stop Word Filtering (filtering conjunctions prepositions and other less meaningful words like "and", "or" , "The", "how", etc.). c. The number of unique words and compound words is determined.

d. The words are now weighted. The basic idea is that the more child nodes have a node, the more meaningful in terms of the author's expertise is this node. An example: The author of the Mind Map, as shown in Fig. 5, will probably be quite familiar with "Mind Mapping" in general (root node), he knows some studies, knows what Mind Maps are used for and knows some of them Software programs, on the other hand, does not seem to know much about the FreeMind software except where it can be downloaded, so the "Mind Mapping" is most heavily weighted. In this example, the rule {child node and child child node + 1) is assumed as weighting, i. the root of (number of all children and children's nodes +1). Other regulations may be provided. In the example, the weight would be:

Mind mapping = root (19) = 4.36

Studies = root (7) = 2.65

The root node thus has the highest weighting value.

e. If a node contains multiple words, they are considered individually and treated as separate nodes. In the example, the following weighting would be generated from the "How to make Mind Maps best?" Node:

Mind Maps = Root (3) = 1.73

created = root (3) = 1.73

Words like "like", "am", "man" would be filtered (Stop Word Filtering).

f. If words occur several times in a mind map, the total weighting or the total weighting value is calculated from the sum of the highest value plus the root of the smaller values, ie

In the example: If one considered "Mind Map" and "Mind Mapping" as the same word, the total weighting value would be = 4.36 + root (1.73) = 5.68.

G. Instead of the root node, another node of the BDS can also assume the highest weighting value, as will now be explained using the example of FIG. 6. In this In this case, the author would (presumably) be well acquainted with "Citation Analysis" (root node), but his real field of expertise seems to lie in "Citation Proximity Analysis". Here, therefore, the node "Citation Proximity Analysis" will be most heavily weighted, ie, the word "proximity" will be given a higher weighting value than the word "citation", even though "citation" is present in the root node Nodes or words of the node which has the largest sub-tree in total, which may be approximately that node of the BDS which has the most direct child nodes.

H. If an author has created several mind maps, they are combined to form a classification map for the calculation of the classification values.

i. Finally, all words with their frequencies and the weighting value are stored in a database or other suitable storage medium and assigned to the author there.

6. Classification of words to each other

The inventive method can also be used to determine the context of words in a BDS. This may be done as described with reference to FIG. 7 as follows:

a. All possible word combinations are extracted from the BDS. In this case, the combinations are made up of parent / child nodes and sibling nodes. In Fig. 7, this would be:

- Citation Analysis | power determination

- Citation Analysis | similarity calculation

- Performance determination | scientist

- Performance determination | Journals

- Performance determination publications

- Scientist | Journals

- Etc. b. If necessary, methods such as stemming and stop word filtering are also used here. If necessary, all nodes with more than eg three words are ignored. c. Nodes can be split. For example, the caption "Benefits of Research Paper Recommender Systems" could be broken down into the three terms "benefits", "research paper recommender" and "systems". Each term is then considered as a separate node.

d. If the word combinations are not yet available in the system, they are stored in a database. In addition, it is also saved from which author the BDS originates, from which the data was extracted. If the word combination has just been newly entered, the counter combination 1 is assigned to this word combination. If the word combination already exists, the counter is incremented by 1. However, if the current BDS is used by an author who has already used one (or more) other BDSs for the calculation, the counter is only incremented by 0.1 (or some other value).

Industrial Applicability of the Invention

The inventive method can be used together in a search system (or independently as a search system). Based on a search term suitable authors and / or objects can be found and related search terms can be proposed. This can be done as follows:

51 A user visits a website (or desktop software).

52 There he can enter a search term in an input mask. He can search only for authors, only for objects or only for related words, or for combinations thereof.

If he searches for authors, those authors will be shown who have the highest classification value for the keyword you are looking for.

54 If he searches for objects, he will be shown those objects that have the highest classification value for the searched keyword. If he searches for similar words, he will be shown the words that have the highest classification value for the keyword you are looking for.

56 If he carries out a combined search, authors / objects / words are displayed according to the combination (S3-S5).

The process according to the invention can be carried out by known methods, e.g. Full-text analysis of documents, full-text search, etc. can be combined.

Claims

claims

A computer-implemented method for classifying at least one object, wherein the object is related to at least one tree data structure, wherein the at least one tree data structure comprises a number of nodes, wherein a node of the at least one tree data structure comprises at least one text comprising a number of words , and wherein the at least one tree data structure can be stored in a memory device, comprising at least the following steps:

Weighting of the texts, wherein for each word or word combination of a text a weighting value is generated which is assigned to the word of the text, wherein for a word which occurs in different texts different weighting values can be generated; and

2. The method of claim 1, wherein the relationship of the object to the at least one tree data structure is formed by at least one node representing a reference to the object of the tree data structure.

3. The method of claim 2, wherein the weighting value of a word is generated from the number of edges between the object referencing node and the node to which the text of the word is associated.

4. The method of claim 3, wherein the weighting value of a word according to the calculation rule

1 / ({number of edges between object and word) + l).

Method according to one of claims 2 to 4, wherein in generating the classification values those texts are taken into account which are associated with nodes located in the tree data structure on the path between a root node and the node referencing the object.

A method according to any one of claims 2 to 5, wherein in generating the classification values, account is taken of those texts associated with nodes which are sibling nodes of those nodes located on the path between a root node and the node referencing the object.

The method of any one of claims 2 to 6, wherein upon referencing an object by nodes of a plurality of tree data structures, the weighting values of identical words are combined together to produce an overall weighting value for the word.

8. The method of claim 7, wherein combining the weighting values comprises at least adding the weighting values.

9. The method of claim 1, wherein the relationship of the object to the at least one tree data structure is formed by an association of the object to the at least one tree data structure.

10. The method of claim 9, wherein after reading the nodes, the number of occurrences of each word and / or each compound word in the tree data structure is determined.

11. The method of claim 9, wherein in generating the weighting value of a word account is taken of the number of nodes contained in a partial tree data structure, wherein the root of the partial tree data structure is formed by the node containing the word contains.

12. The method of claim 11, wherein the weighting value of a word according to the calculation rule

is produced.

13. The method of claim 11, wherein the weighting value of a word according to the calculation rule

Number of direct child nodes

is produced.

14. The method of claim 9, wherein a total weighting value is generated for a word that occurs multiple times in a tree data structure.

15. The method of claim 14, wherein the total weighting value for a word according to the calculation rule

is produced

16. The method according to any one of claims 9 to 15, wherein a plurality of tree data structures are combined into a single tree data structure.

17. The method according to any one of the preceding claims, wherein prior to weighting of the texts, the texts of a text transformation are subjected to each of the texts to produce a transformed text.

18. The method of claim 17, wherein the text transformation comprises at least one of word stemming and stopword filtering.

19. The method according to any one of the preceding claims, wherein the classification values are stored in a memory device.

The method of claim 19, wherein the objects are stored in a storage device and related to the classification values in the storage device.

21. The method of claim 1, wherein prior to reading the nodes of the at least one tree data structure, a step of reducing the tree data structure is performed.

22. The method of claim 21, wherein the reducing comprises:

Deleting end nodes which do not represent a reference to an object, and / or

Reducing nodes representing a reference to an object to the next higher level of the tree data structure such that each level of the tree data structure has at least two nodes, and / or

Filter the tree data structure according to predetermined filter criteria.

23. The method of claim 1, wherein the tree data structure is transmitted via a communication network from a client device to a server device, wherein the transmission is performed prior to reading out the nodes of the tree data structure.

24. The method of claim 23, wherein prior to transmitting, the tree data structure is converted to a normalized tree data structure format.

25. The method of claim 23, wherein after the transfer, the tree data structure is converted to a normalized tree data structure format.

26. The method of claim 24, wherein the normalized tree data structure format describes the tree data structure in XML format.

27. A method according to any one of the preceding claims, wherein an object is at least one of document, image, music, movie, website and author of a tree data structure.

28. A system for classifying at least one object, wherein the object is related to at least one tree data structure, the at least one tree data structure having a number of nodes, wherein one node of the at least one tree data structure is associated with at least one text comprising a number of words comprising memory means for storing the tree data structure and processing means coupled to the storage means and adapted to carry out a method comprising at least the following steps:

Generating a number of classification values, each classification value being represented by a triple consisting of an object identifying the object Identification, a word, and a weighting value (object identification, word, weighting value) associated with the word.

29. A data carrier product with a program code stored thereon, which is loadable into a computer and / or in a computer network and is configured to carry out a method according to one of claims 1 to 27.