EP2102770A2

EP2102770A2 - Method and device for organising electronic documents, and corresponding computer software product and multimedia electronic equipment

Info

Publication number: EP2102770A2
Application number: EP07871877A
Authority: EP
Inventors: Jérôme BESOMBES; Franck Meyer
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2006-12-22
Filing date: 2007-12-05
Publication date: 2009-09-23
Also published as: WO2008081129A2; FR2910661A1; WO2008081129A3

Abstract

The invention relates to a method and a device for organising electronic documents, as well as to a corresponding computer software product and a corresponding multimedia electronic equipment. The invention relates to a method for organising electronic documents form a set of documents including labelled documents, a labelled document being associated with a set of labels. At least one labelled document is associated with a set of labels including at least two labels. The method comprises the following steps: a) labelling (P2) non-labelled documents from the set of documents y prediction according to he labelled documents and the descriptive data of the documents; from a current directory associated with a set of labelled documents: b) creating (P3, 63) a sub-directory per distinct label among the set of labels associate with the documents of the current directory; for each sub-directory created: associating (P3; 65) documents of the current directory whose label set includes at least one label corresponding to said created sub-directory; d) modifying (P3; 66) the sets of labels of the documents of the new sub-directory by subtracting the label corresponding to the sub-directory; from each created sub-directory that becomes a current directory: e)repeating the steps b), c) and d) until sub-directories only associated with documents having an empty label set are obtained.

Description

Method and device for organizing electronic documents, computer program product and corresponding multimedia electronic equipment.

1. DOMAIN OF THE INVENTION

The field of the invention is that of computerized systems (for example personal computers, connected or not to the Internet).

More specifically, the invention relates to a technique for organizing electronic documents in such a computerized system, to help a user to create and / or modify a tree organization of these electronic documents. The management of electronic documents made available to a user of a computerized system becomes a central issue. Indeed, the ever increasing number of these documents, as well as their sources (result of a request to a search engine, reception of mails, loading of digital cameras, purchases of music files, etc.), makes it difficult to use without the establishment, by the user, of an organization of his data.

2. PRIOR ART

Like any computer file, electronic documents can be organized manually as a tree. In a file management system such as the Windows File Explorer (marked filed), the user creates directories and subdirectories, then distributes all his documents, by successive displacements of each document, or groups of selected documents. This method brings the advantage of an organization perfectly matching the desires of the user: each document is placed in the chosen directory, directory itself created and placed in the tree by the user. Each directory then contains a set of documents and / or sub-directories, representing a homogeneous group according to one or more criteria specific to the user. As examples, photos can be classified by themes (events, people, ...), emails by dates or by source, music files by interpreters, ...

In a complementary manner, many electronic document management systems allow a user to label the documents. A document can be associated with one or more labels (typically a representative word data contained in the document). For example, photo sharing websites (Yahoo Flikr (registered trademark), AOL Pictures (registered trademark), ...) or personal photo management software (Google Picasa (marked filed), Microsoft Digital Image Suite 2006 (marked filed)) offer this possibility that goes beyond the strict framework of conventional file management. A set of image files labeled with a common label is naturally a coherent whole and, consequently, a set of labels constitutes an organization of the labeled documents. The systems mentioned above all offer the possibility of displaying all the documents labeled by a given label, in the same way that documents from the same directory can be displayed. In addition, this organization of data from labels has the property of multi-membership: the same document can be tagged by several different labels. This document will appear each time one of its labels will be used as a criterion for displaying the data. Unlike the manual organization of data (in the aforementioned case of a file management system such as Windows Explorer (marked filed) XP), this organization is purely logical: the documents remain in their initial storage unit and the labeling does not cause any file movement. The organization is visible only on the display and, in particular, a document to which multiple labels are assigned is not copied, although it may appear in several separate sets of documents.

Two distinct known data organization techniques have been presented above: a manual technique via a conventional hierarchical file management system and a manual technique using file-related labels (with logical organization). The manual technique via a conventional hierarchical file management system has the advantage of offering the user an organization of his data that corresponds to his wishes since he has built each directory and sub-directory and has moved each document in the appropriate directory. The major drawback of this technique appears when the number of documents becomes important. Indeed, to build such an organization, each document must be identified by the user, in order to determine in which directory it will have to be moved (and possibly create this directory). Beyond a few dozen documents, it becomes particularly tedious to identify each document, especially if they are distributed in different storage systems.

The other aforementioned known technique, using the labels, does not solve the problem of need to take into account each document one by one, since at least one label must be associated with each document by the user. In addition, the current systems offer only one level of organization: they allow to display the set of documents associated with a particular label (which can be seen as a directory), but this representation is not recursive. It is impossible to select, within a directory corresponding to a label X ₁ , all the documents labeled by a label x ₂ also (which would constitute a subdirectory of the directory constituted by the label X ₁ ) .

There is currently no way to manage documents with multiple labels, to classify them automatically by managing multi-membership and to infer labels from unlabeled documents, and to maintain the hierarchy thus built through a real-time HMI (Human Machine Interface).

3. STATEMENT OF THE INVENTION

In a particular embodiment of the invention, there is provided a method of organizing electronic documents from a set of documents comprising labeled documents, a labeled document being associated with a set of labels. At least one labeled document is associated with a set of labels of at least two labels. The method comprises the following steps: a) labeling unlabeled documents of all the documents, by prediction according to the labeled documents and descriptive data documents; from a current directory associated with a set of labeled documents: b) creation of a sub-directory by separate label among all the labels associated with the documents in the current directory; by subdirectory created: c) association of the documents of the current directory whose label set includes at least one label corresponding to said created subdirectory; d) changing the sets of document labels of the new subdirectory by subtracting the label corresponding to the subdirectory; from each sub-directory created, becoming current directory: e) iteration of steps b), c) and d) until subdirectories associated only with documents whose set of labels is empty.

Thus, in this particular embodiment, the invention is based on a completely new and inventive approach combining an automatic labeling of documents with an automatic construction of a tree organization from the labels of the labeled documents. Indeed, unlike the known manual technique based on the use by a user of a conventional hierarchical file management system (see discussion above), it is not the user who is the tedious task of build a tree organization from a current directory. The construction of the tree structure is performed automatically and recursively by levels, taking into account the labels associated with the electronic documents. The present invention thus allows time savings in the construction of the tree organization and optimization of resources of the computerized system in which is implemented the technique of organizing electronic documents according to the invention.

In addition, thanks to the automatic labeling of non-labeled documents, the user does not have to label all the documents, which saves him a tedious job when the number of documents becomes important.

In the present description, it is considered synonymous that a document is associated with a subdirectory and the fact that this document is contained (or placed) in this subdirectory. Advantageously, by associating a given document with a subdirectory, the fact that said given document is physically moved from an initial storage unit to said given subdirectory.

In this case, the tree organization is generated automatically but then used as a tree created manually by the user with a conventional hierarchical file management system. According to an advantageous variant, by associating a given document with a subdirectory, the fact that said given document remains in an initial storage unit and that a unique identifier of said given document is placed in said given subdirectory . In this variant, the tree organization is a logical organization whose subdirectories contain only identifiers of electronic documents. The identifier of each document is unique and defines the complete path to the storage location of this document. For example, with a centralized storage system such as a hard disk, the identifier defines the path from the root of the hard disk to the subdirectory containing the document (i.e., the path in the storage system tree). With a decentralized storage system, the identifier is for example a URL indicating the path of the document on the Internet.

Advantageously, step b) is preceded by the following step: if the number of different labels associated with the documents in the current repertoire is greater than a predetermined threshold S, then the labels are sorted in decreasing order of frequency and the selections are selected. S first labels, only the selected labels being taken into account to perform the following steps. In this way, it reduces the processing resources (processor and memory in particular) necessary for the construction of the tree organization.

Advantageously, if there are first and second labels such that: the following condition is satisfied: for all the labeled electronic documents whose label set comprises said second label, said set of labels also includes said first label, and the following condition is satisfied: for all labeled electronic documents whose label set includes said first label, said label set also includes said second label, then one of said first and second labels is deleted only during said steps b) to e), each certified electronic document regaining all its labels at the end of said steps b) to e). This taking into account of a first type of inclusion of labels makes it possible to simplify, by pruning, the tree organization obtained and to optimize the processing resources necessary for the construction of this tree organization.

Advantageously, if there are first and second labels such that: the following condition is satisfied: for all the labeled electronic documents whose label set comprises said second label, said set of labels also includes said first label, and the condition following is not verified: for all the labeled electronic documents whose label set includes said first label, said set of labels also includes said second label, then, during said steps b) to e), said second label does not cause creating a subdirectory only under the subdirectory from the first label.

This consideration of a second type of inclusion of labels also makes it possible to simplify, by pruning, the tree organization obtained and to optimize the processing resources necessary for the construction of this tree organization.

Advantageously, said labeling step comprises a step of generating a matrix V as follows:

V _1J = U _1J if U ₁₀ ≠ 0 O with i: an electronic document among all electronic documents, j: a label among a set of possible labels, P: a function of prediction according to descriptive data of electronic documents, and U: a matrix of electronic documents and labels assigned by a user, such as "U ₁₀ ≠ 0" means that the user has decided whether or not to associate the label j with the electronic document i, and "U _1J = 0" means that the user does not have expressed no opinion on whether or not the label should be associated with the electronic document i.

The use of such a matrix V simplifies the calculations related to automatic labeling.

Advantageously, U _1J = 1, if the user has associated the label j with the electronic document i. U _1J = -1, if the user has decided not to associate the label j with the electronic document i. P (i, j) returns a value between -1 and +1, plus the value of P (i, j) being negative minus it is probable that the electronic document i is associated with the label j, and vice versa the more positive the value of P (i, j), the more likely it is that the electronic document i is associated with the label j.

Advantageously, said labeling step comprises a step of thresholding said matrix V in a matrix W as follows: if V ₁₀ > alpha then W ₁₀ =

1 otherwise W ₁₀ = 0, with alpha a determined threshold between 0 and 1. Only non-zero elements W _1J are used in said steps b) to e).

Thus, the alpha threshold makes it possible to determine which are the labels actually used in the construction of the tree. According to an advantageous characteristic, the method comprises a step of limiting the number of labels automatically allocated to an electronic document.

In a particular embodiment, this limitation is implemented thanks to the aforementioned alpha thresholding. In this case, the alpha threshold can be modified by the user, who can thus dynamically limit the number of labels automatically assigned to a document during the automatic labeling step.

Advantageously, said steps b) to e) are followed by the following steps: modification by a user, via a man / machine interface, of at least one set of labels among the sets of labels associated with the labeled electronic documents; deleting a tree organization resulting from the execution of said steps b) to e); new execution of said steps b) to e), taking into account said at least one set of labels modified by the user.

Thus, a first mechanism is offered to the user to modify a tree organization already obtained, in order to correct or evolve it. This first mechanism is based on a new calculation of the tree.

According to an advantageous variant, said steps b) to e) are followed by the following steps: modification by a user, via a man / machine interface, of at least one set of labels among the sets of labels associated with the labeled electronic documents; for each labeled electronic document whose associated set of labels has been modified by the user, said modified document:

* said modified document is deleted from each directory or sub-directory with which it is associated,; and then reclassifying said modified document starting from a root directory: associating said modified document with each sub-directory whose corresponding label is included in the set of labels associated with said modified document, then for each subdirectory to which the document modified has been associated, recursively the association is repeated in sub-directories of lower levels, if they exist.

Thus, a second mechanism is offered to the user to modify a tree organization already obtained. This second mechanism is based on a local adaptation of the tree, without it being completely recalculated.

Advantageously, said step of modifying by a user of at least one set of labels is carried out by: a move / paste, carried out by the user via a graphical interface, of a representative of the document to be modified to a subset; target directory associated with documents whose label sets include one or more desired labels; and an automatic assignment to said document to be modified of the label corresponding to said target directory, replacing the label or labels previously assigned to said document to be modified.

In this way, the user can easily modify labels. Advantageously, said step of modifying by a user of at least one set of labels is carried out by: a selection, made by the user via the man / machine interface, of at least one sub-directory to be deleted; ; for each sub-directory to be deleted, an automatic deletion in the label sets of the electronic documents labeled: the label corresponding to the sub-directory to be deleted, as well as the label corresponding to each subdirectory located between the subdirectory to delete and a root directory.

Again, the user can easily modify labels. In another embodiment, the invention relates to a computer program product downloadable from a communication network and / or recorded on a computer readable medium and / or executable by a processor, this computer program product comprising instructions program code for executing the steps of the aforementioned method (in at least one of its various embodiments), when said program is executed on a computer.

In another embodiment, the invention relates to a device for organizing electronic documents from a set of documents comprising labeled documents, a labeled document being associated with a set of labels. At least one labeled document is associated with a set of labels of at least two labels. The device comprises: a) means for labeling the non-labeled documents of all the documents, by prediction according to the labeled documents and descriptive data of the documents; b) creation means, applied to a current directory associated with a set of labeled documents, of a sub-directory per distinct label among the set of labels associated with the documents of the current directory; c) association means, applied to each sub-directory created, documents of the current directory whose label set includes at least one label corresponding to said sub-directory created; d) means for modifying the sets of labels of the documents of the new subdirectory by subtracting the label corresponding to the subdirectory; e) the means b), c) and d) being applied iteratively for each sub-directory created, becoming current directory, until subdirectories associated only with documents whose set of labels is empty.

More generally, the organization device comprises means for implementing the organization method as described above (in any one of its various embodiments).

In another embodiment, the invention relates to multimedia electronic equipment comprising means for storing multimedia documents, and means for implementing the method of organizing multimedia documents. documents as described previously (in any one of its various embodiments).

4. LIST OF FIGURES

Other features and advantages of embodiments of the invention will become apparent on reading the following description, given by way of indicative and nonlimiting example (all the embodiments of the invention are not limited to the characteristics and advantages of the embodiments described hereinafter), and the accompanying drawings, in which: FIG. 1 shows an overall diagram of a particular embodiment of the organizing method according to the invention; FIG. 2 presents a diagram of a vectorisation phase of the electronic documents, included in a particular embodiment of the organization method according to the invention; FIG. 3 presents a diagram of a phase of creation of Q predictive models, included in the automatic labeling phase of FIG. 5; FIG. 4 presents a diagram of a phase of construction of a matrix V, included in the automatic labeling phase of FIG. 5; FIG. 5 is a diagram of a phase of automatic labeling of electronic documents, included in a particular embodiment of the organization method according to the invention; FIG. 6 is a diagram of a construction phase of a tree structure included in a particular embodiment of the organization method according to the invention; Figures 7, 8 and 9 show the results of three successive steps of construction of an exemplary tree, by implementing the construction phase illustrated in Figure 6; FIG. 10 illustrates the result of taking into account label inclusions in the tree of FIG. 9; FIG. 11 illustrates the result of taking into account a limitation of the number of labels in the tree of FIG. 10; Figure 12 illustrates an example of a user moving a document in the tree of Figure 11; FIG. 13 illustrates the tree resulting from the displacement illustrated in FIG.

12; FIG. 14 shows an exemplary HMI of a system implementing a particular embodiment of the organization method according to the invention; and Figure 15 shows the structure of an organization device according to a particular embodiment of the invention. 5. DETAILED DESCRIPTION In all the figures of this document, the identical elements are designated by the same numerical reference. 5.1 Overall presentation

In a particular embodiment, illustrated by the overall diagram of FIG. 1, the organization method according to the invention comprises the following steps (also called phases thereafter):

Phase 0: Prior encoding of documents (vectorization).

This first phase is not illustrated in Figure 1. It consists in encoding each document to represent it in a vector form. This type of encoding is known to those skilled in the art when performing automatic classification tasks.

Phase 1: Definition of labels for certain documents.

This phase 1 is referenced P1 in FIG. 1. The user defines labels and associates them with certain documents he chooses. Note that the user is not obliged to label all his documents, a small proportion will suffice. Thus, from a set of stored documents 5, a new set of documents 6 is obtained, of which a first subset _6A contains non-labeled documents and a second subset _6B contains labeled documents.

Phase 2: Automatic labeling of all documents by the system.

This phase 2 is referenced P2 in FIG. 1. The system implementing the method automatically labels the documents (it generalizes all the possible labels documents already labeled and those without a label) thanks to a multi-label classification module. We obtain a set of documents all labeled 7.

Phase 3: Automatic generation of the classification plan by the system

This phase 3 is referenced P3 in Figure 1. The system implementing the method automatically generates a tree 8 (also called tree organization) for classifying documents automatically. The nodes of this tree, corresponding to directories, are generated and named automatically.

Phase 4: Modification of the classification scheme by interaction between the user and the system.

This phase 4 is referenced P4 in FIG.

The user can then modify the tree to correct or change it. In particular, it can use conventional interactions of a file management system of the Windows Explorer (registered trademark) type. In particular, he may do the following:

• It can move documents from one directory or subdirectory to another;

• it can change the labels of certain documents;

• It can add new documents, labeled or not, which will be automatically placed in the tree. The system will adapt and adapt the classification automatically, either by local adaptation or by going back to phase 2.

The system therefore makes it possible to limit the user's workload in order to organize his documents:

• By labeling only a part of the documents, the user can then generate a classification tree automatically. This classification tree will take into account the multi-label aspect of the documents;

• By modifying the classification obtained in an interactive way, the user will cause an adaptation of the classification tree so that it corresponds to his wishes. Here too, a minimum of actions will be necessary because the system adjusts its ranking tree (also called the ranking plan) automatically.

Compared to a fully automatic system, this system has the advantage of making it possible to correct and manipulate a result interactively, in order to actually obtain the desired classification tree.

5.2 Vectorization of documents

We propose a particular embodiment which, without being limiting as to the techniques employed, shows the feasibility of the invention. This embodiment is firstly based on a vector representation of the documents. From this representation, each document is seen by the system as a vector of R ^N.

The set of documents will therefore be represented, after the vectorisation phase, by a matrix D containing online the vector description of each document, in column the properties measured on these documents, and at the intersection of a line i and d a column j the value D ₁₀ corresponding to the property j for the document i. In other words, the vectorization phase consists of transforming a document

(already in the form of a computer file) into a vector of numerical values (vector in R ^N ). For example, for an image, this can consist of a set of measurements, functions applied to the bitmap representation of the image. For a text document, this can be done by calculating, for a set of predefined words, the word counts of each document. For a record of a database, this can consist of digital recoding of the recording, passing through a complete disjunctive coding of any non-numerical values. The vectorization of objects is considered as a known domain of the state of the art, typical and particular of each field of application (signal processing, text-mining, image, data mining, ...), and n is not here specified further.

A diagram presenting the vectorization phase is given in FIG. 2. Consider a document storage unit 21. As long as there is a vectorized document (step 22), a non-vectorized document d is selected (step 23) and then a choice is made a vectorization method depending on the type of the document (step 24). We obtains a vector V _d of R ^N which can be stored in a vectorized document storage unit 26.

5.3 Labeling of certain documents by the user

The system implementing the organization method allows the user to label certain documents via an HMI which can for example be similar to that described below in relation with FIG. 14.

At the system level, this means associating certain documents with labels.

For example, an image of holidays by the sea may be labeled by the user with the labels "holidays", "beach" and "Atlantic", while a family anniversary image may be labeled with the labels " Anniversary "," John ", etc.

The HMI device and the associated document management, which allow the labeling of documents by a user, are considered as simple and easily implemented by those skilled in the art and are therefore not further described. Note that in the embodiment described below, any document da for each label x a value associated with -1, 0 or +1 which respectively corresponds to "the user has explicitly said that the document d did not have this label x "," we do not know the assignment of the document d to the label x ", and" the user has given the document d the label x ". 5.4 Automatic Labeling of Documents

We now suppose that the user has a set E = {di, ..., d _Q } of documents whose subset T = {ti, ..., t _k } is labeled by different labels. We denote L = {xi, ..., X _M } the set of all the labels used.

For each document d of T and for each label x of L, document d is in one of the three following cases:

• or it has been labeled by the user with the label x, and note that d _x = + l;

• either the user explicitly or implicitly asked the system to remove the label x from the document d, and note that d _x = -l. See section 5.6 below to see how negative labels are introduced; • vis-à-vis the label x the document currently has no information from the user; we will denote d _x = 0.

A convenient way to represent the documents and labels assigned by the user is a matrix with online documents and in column the different labels. Let U be the matrix of documents and labels assigned by the user. We define a second matrix, which will correspond to the labels assigned to the documents automatically by the system. We denote by V the matrix of documents and labels generated automatically by the system.

The matrix V is generated as follows: V _1J = U _1J if U ₁₀ ≠ 0

Vi, j = P (i, j) if U _lo = O with i: a document, j: a label, P: a prediction function. P (i, j) is a prediction function, implemented by an automatic label prediction module according to the description of the documents (matrix D). P (i, j) gives the prediction that the document i has or does not have the label j. P (i, j) returns a value between -1 and +1. For negative values, the document i does not contain the label j. For positive values, the document i contains the label j. The less a value P (i, j) is the less likely it is that the document has the label j, and conversely the more positive the value, the more likely it is that the document has the label j. Several methods make it possible to implement a prediction function for any document and any label. A simple method is to construct a classification model C _j of type neuron network, or decision tree, or bayesian network for each label j. To each label j is therefore associated a model created on the basis of the examples described by the matrix D and with target variable label j. For each null value of U _10, the classification model C _{j is used} which takes as input the description of the document i in D and returns the predicted value for j for assignment in the matrix V.

To implement a predictive model (also called classification model) associated with each label j using the descriptive data matrix D, the skilled person can for example refer to the following document: "Ian H. Witten, Eibe Frank: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Morgan Kaufmann 1999 ".

Figure 3 illustrates the creation of Q predictive models (also called classifiers), one for each label, to be able to assign labels to documents based on their initial descriptive information only (derived from their prior representations in the form of vectors). We have matrices D and U. To construct a predictive model P _x associated with the label x, we use the descriptive data of the matrix D as descriptive variables and the data of the x-th column of the matrix U. This predictive model P _x is able to predict the probability of the x label for any document, based on the descriptive data in this document. For example, the frame referenced 31i symbolizes the generation of a predictive model associated with the label 1, from the matrix D and the column 1 (referenced 30 ₁ ) of the matrix U. Similarly, the frame referenced 3 I _Q symbolizes the generation of a predictive model associated with the label Q, from the matrix D and the column Q (referenced 30 _Q ) of the matrix U. The frame 32 referenced in Figure 3 symbolizes the Q predictive models obtained, a specialist for each label.

Figure 4 illustrates the construction of the matrix V integrating the labels provided by the user and those inferred by the system. The set (referenced 32 as in FIG. 3) of the Q predictive models 32i,... 32 _{Q is used} to generate the matrix V from matrices D and U.

Figure 5 shows an overall diagram of this phase of automatic labeling of documents. It is assumed that the vectorization phase has already been performed (pre-coding of documents). There is a set of documents (referenced 6 as in Figure 1) of which a first subset 6 _A contains unlabeled documents and a second subset 6 _B contains labeled documents. In a step 51, the matrix U of the documents with at least one label is constituted, in a step 52, automatic predictive models are created from the matrix U and the matrix D. In a step 53, the matrix is generated. V integrating the user labels and the labels predicted by the system. In a step 54, a matrix W is generated by thresholding the matrix V with the parameter alpha (see section 5.5 below). after). In a step 55, the matrix W is optionally adapted by modifying the alpha threshold, to limit the number of labels per document. 5.5 Building the tree

To build the tree, we use the result of the thresholding of the matrix V in a matrix W. An alpha threshold (modifiable by the user) makes it possible to determine which labels will be considered positive. By default, alpha = 0.5. The matrix V is thresholded according to the following principle: Start for each cell V _1J : if V _U} > alpha then W _1J = I otherwise W _1J = O

End for end thresholding.

The construction of the tree is done recursively by levels. At the first level, all the documents are in a single directory which constitutes the root R of the tree under construction. Each document is associated with one or more labels according to the matrix W. 5.5.1 Complete arborescent

A variant of the Borda algorithm is then integrated (see the following document: "Bordat, JP, a practical calculation of the Galois lattice of a correspondence, Math Sci Hum, 1986, No. 96, pp 31-47. "), Applied to the contents of the matrix W, and described below.

The set of labels L = {xi, ..., x _m } will give rise to a set of subdirectories {Ri, ..., R _m } direct from the root directory (that is, to child nodes), each new sub-directory R ₁ corresponding to exactly one label X ₁ of L. At this stage, a gamma threshold sets the maximum number of directories or subdirectories that can be at a given node of the network. tree. If the number of labels exceeds gamma, then we sort the labels in descending order of frequency. We select the first gamma labels. The other labels are temporarily hidden. A complementary directory named "Other" is created to assign each document with a temporarily hidden label. Each directory will contain the documents labeled by the corresponding label. The same document will then be able to belong to several subdirectories, if several different labels are associated with this one. A new labeling function will then be defined for each new directory as follows. For any document d in the directory R ₁ , the labels L _R1 (d) = {L (d) - {x ₁ }} are associated. That is, in any directory R ₁ of the tree corresponding to a label X ₁ , the set of labels (also called set of labels) of each document d in R ₁ is the set of labels of d in the parent directory of R _{1 to} which the label X ₁ has been removed. If a document has an empty set of labels in a directory of the tree (which is the case for all documents at a certain level of the tree, since going down from one level to the next, each document loses one of its labels), this one remains in this current directory and is thus not moved in subdirectories son. This process is then iterated on each sub-directory R ₁ , taking into account the new labels. A directory containing only documents without a label does not give birth to any subdirectory.

Figure 6 shows an overall diagram of this phase of construction of a tree. We have a set of labeled documents (referenced 7 as in Figure 1). In a step 61, a copy of the set of labeled documents is placed in a root directory R. For each label x not contained in another label x '(condition step 62), a new sub-directory R _{x is constructed.} , within the gamma sub-directories limit (step 63). For each document labeled x according to the matrix W (condition step 64), the document d is placed in the sub-directory R _x (step 65) and then the label x is deleted from the set of labels associated with the document contained in the R _x subdirectory (step 66). Finally, the construction of the tree structure is applied recursively to the sub-directory R _x (step 67).

Let us take the following example of a set D of six documents, D = {d, di, ..., ds}, labeled with five labels (X ₁ , x ₂ , x ₃ , X ₄ , X ₅ ). We have more precisely the following sets of labels: L (d) = {xi, x ₂ , x ₅ }, L (di) = {xi, x ₃ }, L (d ₂ ) = {x ₂ , xs}, L (d ₃ ) = {xi, x ₄ }, L (d ₄ ) = {xi} and L (d ₅ ) = {x ₃ }. Figure 7 shows the R root directory containing these six documents. The set of labels X ₁ associated with each document is also represented.

As illustrated in FIG. 8, the first step of construction of the tree structure gives five sub-directories R ₁ , R ₂ , R ₃ , R ₄ and R ₅ , respectively corresponding to the labels X ₁ , x ₂ , x ₃ , x ₄ and X ₅ . Ri contains the documents d, di, d ₃ and d ₄ with the labels L _R i (d) = {x ₂ , X ₅ }, L _R i (di) = {x ₃ }, L _R i (d ₃ ) = {x ₄ } and L _R i (d ₄ ) = {}. R ₂ contains documents d and d ₂ with labels X ₅ } and L _R2 (d ₂ ) = (xs) R ₃ contains the documents di and ds with the labels L _R3 (di) = (xi) and L _R3 (ds) = {}. R ₄ contains the document d ₃ , with the label L _R4 (d ₃ ) = (xi) R ₅ contains the documents d and d ₂ , with the labels L _R5 (d) = (xi, x ₂ ) and L _R5 (d ₂ ) = (x ₂ }.

The build process of the tree continues on the next level. Thus, the subdirectory Ri gives rise to the subdirectories Ri ₂ , Ri ₃ , Ri ₄ and R ₁₅ . Ri ₂ contains the document d of label X ₅ . Ri ₃ contains the document di without label. Ri ₄ contains the document d ₃ without label. R ₁₅ contains the document d of label x ₂ . The subdirectory R ₂ gives birth to the subdirectories R ₂ i and R ₂₅ . R ₂ i contains the documents d of label X ₅ . R ₂₅ contains the documents of label X ₁ and d2 without label. The sub-directory R ₃ gives rise to the subdirectory R ₃ i. R ₃ i contains di documents without a label. The sub-directory R ₄ gives birth to the sub-directory R _4I containing the document d ₃ without a label. The sub-directory R ₅ gives rise to the sub-directory R ₅₁ containing the document d2 of label x ₂ and to the sub-directory R ₅₂ containing the documents d of label X ₁ and d ₂ without a label.

The process continues with the next level. The sub-directory Ri ₂ gives birth to the sub-directory Ri ₂₅ containing the document d. The subdirectory Ri ₃ contains the document di without label, it does not give birth to any new subdirectory. It is the same for the subdirectories Ri ₄ , R ₃₁ and R _4I . The subdirectory

R ₁₅ gives rise to the subdirectory Ri _S2 containing the document d. The sub-directory R _2I gives birth to the subdirectory R ₂ i ₅ containing the document d. The subdirectory R ₂₅ gives rise to the subdirectory R ₂₅ i containing the document d. The sub-directory R ₅₁ gives rise to the sub-directory Rsi ₂ containing the document d. The sub-directory R ₅₂ gives birth to the sub-directory R _52I containing the document d. None of documents contained in the subdirectories of this level does not have a label. The process stops at this level.

Figure 9 shows the complete tree at the end of the process. 5.5.2 Pruning 5.5.2.1 Inclusion of labels

To simplify the tree structure obtained, we will now show how to remove some redundant parts. Suppose that for a label x and a label x 'the following property is verified: all the documents labeled by x' are also labeled by x. We will say in this case that x contains x '. So if x 'also contains x, the labels x and x' are equivalent. In this case, the process of building the tree will not distinguish between x and x ', for example by deleting one of the two labels. This deletion will only be valid during construction, at the end of which the documents will find all their labels. In our example, the label x ₂ is equivalent to the label X ₅ , which implies the deletion of the directories R5, R51, R52, R512, R521, R125, R15, R152, R215, R25 and R251 (deletion of the label x5 during the construction of the tree).

On the other hand, if x contains x 'without x' containing x, then the utility of the x 'label lies in distinguishing between them certain elements labeled x. Thus, in the process of creating the tree, the label x 'will cause the construction of a new subdirectory only in a part of the tree under the directories from the label x.

In our example, the directories R ₄ and R ₄ i must be deleted, since the label xi contains the label x ₄ . On the other hand the repertoire Ri ₄ resulting from x ₄ is conserved since it is located under the repertoire Ri resulting from X ₁ .

Finally, the structure of the tree organization taking into account the inclusion of labels is given in figure 10. 5.5.2.2 Limitation of the number of labels

In order to simplify the structure, it may seem useful to limit the number of labels automatically assigned to a document by the system during the first phase of execution. Thus, it will be possible, for example, to limit this number to the maximum number of labels attributed to an already labeled document.

Let's take our example assuming that the maximum number of labels assigned to a document at the start is 2. We have seen that 3 labels are automatically assigned to the document d (namely X ₁ , x ₂ and X ₅ ). It is therefore appropriate in this example to delete one of the labels assigned to document d.

We can implement the limitation with alpha thresholding. From the matrix V, the alpha threshold is reduced (in small units, for example 0.001) as long as the average number of predicted and positive labels per document in the resulting matrix W is greater than a number set by the user.

Figure 11 shows the resulting tree, after limiting the number of labels (it is the label xiqui was removed from the set of labels associated with the document d). 5.6 Changes to the tree structure

We have seen that the organization of the data takes place in two successive phases: the first phase labeled the non-labeled documents and the second builds the tree. The positions occupied in the tree by non-initially labeled documents depend on how the process has done its labeling. It can then be assumed that the user wishes to perform a correction on one or more document labels. In this case, the system will take into account these corrections (deletion and / or addition of labels to documents chosen by the user) according to one of the following processes:

1) Taking into account with immediate modification of the tree. The current tree is completely deleted. Labels assigned automatically to documents that are not labeled by the user are deleted. The process is executed again (automatic labeling of documents then construction of the tree) including the new labels.

2) Local consideration with delayed readjustment of the tree.

Each document whose associated label set has been modified is removed from all subdirectories in which it appears. Then it is reclassified by itself from the top of the hierarchy, assigning it to each subdirectory for which it has the corresponding label. For each subdirectory where the document has been assigned, the process is recursively repeated in subdirectories if they exist. After a number of modifications, the desired process was performed. The number of modifications depends on the power of the machine that executes the organization process.

5.6.1 Moving a document

In a system implementation in the form of a graphical interface, the following mechanism can be integrated: in order to modify the labels assigned to a document, the user makes a move / paste of a representative (icon) of this document to a document. subdirectory containing documents with the desired labels (if such a subdirectory exists).

In our example, we have seen that the labels x ₂ and X ₅ are assigned to the document d automatically. If the user selects the document d in the directory R ₂ and moves it in the directory Ri (see Figure 12) then the label X ₁ will automatically be assigned (with the value +1) instead of the labels x ₂ and X ₅

(which will take the value -1) (see Figure 13).

It should be noted that moving a non-labeled document initially provides a new document labeled at the beginning of the process. It is therefore possible that the automatic labeling of other documents is modified (by modifying the neighborhood spheres). This modification is desirable, since it is a generalization of a correction made by the user to other documents than the one explicitly moved.

5.6.2 Deleting part of the tree

One or more subdirectories can be considered useless by the user. In this case, it is possible to delete them in the following way: the user selects a sub-directory to be deleted. This subdirectory was created from {x _! i, ..., Xi _q } (that is, the labels that gave birth to the selected directory, as well as the directories located on the path leading from the root to the selected directory). Then the system will automatically delete the entire tree, it will delete the labels {x ^. - .A _q } of all associated tag sets to documents and execute a new automatic construction of the tree structure taking into account the new labeling of documents. 5.6.3 Modification of the labels

At any time, the user has the possibility to modify (add, delete, replace) one or more label (s) corresponding to one or more documents. The system takes this change into account as follows:

• Case of the addition of a new label y: the label is added to the list of labels L, and matrices U, V, W.

• The case of adding an existing label to a document that did not have this label: Ud, _y = + 1, and the label is associated with the document in question.

• Case of the deletion of a label y from the set of labels associated with a document d (which previously had this label), either by an assignment of the user, or by a prediction of the system: U _{d, y} = - 1, and the label is negatively associated with the document in question. 5.7 Example of realization of a system HMI

FIG. 14 shows an exemplary HMI of a system implementing a particular embodiment of the organization method according to the invention. The following three areas can be distinguished:

An area 141 containing the tree; An area 142 containing a set of possible labels; and

A zone 143 containing representatives (icons) of the documents contained in the selected directory or sub-directory.

In this example, it is the sub-directory "CLOUDS" ("clouds" in French) that is selected (it appears for this reason in gray in the tree of the referenced area 141).

5.8 Arrangement Implementing the Organizational Process

FIG. 15 shows the simplified structure of an organization device according to a particular embodiment of the invention, which comprises a RAM 153, a processing unit 151, equipped for example with a microprocessor, and driven by a computer program 152 implementing the organization method according to the invention (for example the particular embodiment described above in relation to FIGS. 1 to 14). At initialization, the code instructions of the computer program 152 are for example loaded into the RAM memory before being executed by the processor of the processing unit 151. The processing unit 151 receives as input 150 electronic documents and allows the user, via an IHM, to label some of them. The processing unit 151 processes all the documents (some having been labeled by the user, and others not), according to the instructions of the program 152, in order to obtain a tree organization 154 (also called a tree structure). classification tree). The processing unit 151 outputs this tree organization 154 and allows the user, via an HMI, to modify it.

The device according to the invention can be integrated in a multimedia electronic equipment which comprises means for storing multimedia documents (images, music files, written documents, etc.). This equipment is for example a music file player with a graphical interface, a handheld computer or electronic organizer, a mobile phone with or without a photographic device.

Thus, the method of organizing electronic documents according to the invention can be implemented on this type of equipment for classifying the stored multimedia contents.

Claims

1. A method for organizing electronic documents from a set of documents comprising labeled documents, a labeled document being associated with a set of labels, characterized in that at least one labeled document is associated with a set of labels at least two labels and in that the method comprises the following steps: a) labeling (P2) of the non-labeled documents of all the documents, by prediction according to the labeled documents and descriptive data of the documents; from a current directory associated with a set of labeled documents: b) creation (P3; 63) of a sub-directory by distinct label among the set of labels associated with the documents of the current directory; by subdirectory created: c) association (P3; 65) documents of the current directory whose label set includes at least one label corresponding to said created subdirectory; d) modifying (P3; 66) sets of document labels of the new subdirectory by subtracting the label corresponding to the subdirectory; from each sub-directory created, becoming current directory: e) iteration of steps b), c) and d) until subdirectories associated only with documents whose set of labels is empty.

2. Method according to claim 1, characterized in that by associating a given document with a subdirectory, the fact that said given document is physically moved from an initial storage unit to said given subdirectory.

3. Method according to claim 1, characterized in that means by association of a given document to a sub-directory, the fact that said given document remains in an initial storage unit and a unique identifier of said given document is placed in said given subdirectory.

4. Method according to any one of claims 1 to 3, characterized in that step b) is preceded by the following step: - if the number of different labels associated with documents in the current directory is greater than a predetermined threshold S, then we sort the labels by order decreasing frequency and selecting the first S labels, only the selected labels being taken into account to perform the following steps.

5. Method according to any one of claims 1 to 4, characterized in that, if there are first and second labels such that: - the following condition is verified: for all the labeled electronic documents whose set of labels includes said second label, said set of labels also includes said first label, and the following condition is satisfied: for all the labeled electronic documents whose label set comprises said first label, said label set also includes said second label, then the one of said first and second labels is deleted only during said steps b) to e), each labeled electronic document finding all its labels at the end of said steps b) to e).

6. Method according to any one of claims 1 to 5, characterized in that, if there are first and second labels such that: the following condition is satisfied: for all the electronic documents labeled with the set of labels includes said second label, said set of labels also includes said first label, and the following condition is not satisfied: for all the labeled electronic documents whose label set comprises said first label, said label set also includes said second label, then during said steps b) to e), said second label causes the creation of a subdirectory only under the subdirectory from the first label.

7. Method according to any one of claims 1 to 6, characterized in that said labeling step comprises a step of generating a matrix V as follows:

V ₁₀ = U ₁₀ if U ₁₀ ≠ 0 O with i: an electronic document among all the electronic documents, j: a label among a set of possible labels, P: a function of prediction according to descriptive data of the electronic documents, and U: a matrix of the documents and labels assigned by a user, such as "U ₁₀ ≠ 0" means that the user has decided to associate the label j with the electronic document i, and "U ₁₀ = 0" means that the user does not has not expressed any opinion that the label j should or should not be associated with the electronic document i.

8. Process according to claim 7, characterized in that:

U _1J = 1, if the user has associated the label j with the electronic document i;

U _1J = -1, if the user has decided not to associate the label j with the electronic document i;

P (i, j) returns a value between -1 and +1, plus the value of P (i, j) being negative minus it is likely that the electronic document i is associated with the label j, and vice versa the value of P (i, j) being positive, the more likely it is that the electronic document i is associated with the label j.

9. Method according to claim 8, characterized in that said labeling step comprises a step of thresholding said matrix V in a matrix W as follows: if V ₁₀ > alpha then W _1J = 1 else W _1J = 0, with alpha a determined threshold between 0 and 1, and in that only non-zero elements W ₁₀ are used in said steps b) to e).

10. Method according to any one of claims 1 to 9, characterized in that it comprises a step of limiting the number of labels automatically allocated to an electronic document.

11. Method according to any one of claims 1 to 10, characterized in that said steps b) to e) are followed by the following steps: modification by a user, via a man / machine interface, of at least one set of labels among the sets of labels associated with the labeled electronic documents; deleting a tree organization resulting from the execution of said steps b) to e); new execution of said steps b) to e), taking into account said at least one set of labels modified by the user.

12. Method according to any one of claims 1 to 10, characterized in that said steps b) to e) are followed by the following steps: modification by a user, via a man / machine interface, of at least one set of labels among the sets of labels associated with the labeled electronic documents; for each labeled electronic document whose associated set of labels has been modified by the user, said modified document:

* said modified document is deleted from each directory or sub-directory with which it is associated,;

and then reclassifying said modified document starting from a root directory: associating said modified document with each sub-directory whose corresponding label is included in the set of labels associated with said modified document, then for each subdirectory to which the document modified has been associated, recursively the association is repeated in sub-directories of lower levels, if they exist.

13. Method according to any one of claims 11 and 12, characterized in that said step of modifying by a user of at least one set of labels is performed by: a move / paste, performed by the user via an interface graphic, a representative of the document to be modified to a target subdirectory associated with documents whose label sets comprise one or more desired labels; and an automatic assignment to said document to be modified of the label corresponding to said target directory, replacing the label or labels previously assigned to said document to be modified.

14. Method according to any one of claims 11 and 12, characterized in that said step of modifying by a user of at least one set of labels is performed by: a selection, made by the user via the human interface / machine, at least one sub-directory to be deleted; for each sub-directory to be deleted, an automatic deletion in the label sets of the electronic documents labeled: the label corresponding to the sub-directory to be deleted, as well as the label corresponding to each subdirectory located between the subdirectory to delete and a root directory.

15. Computer program product downloadable from a communication network and / or recorded on a computer readable medium and / or executable by a processor, characterized in that it comprises program code instructions for executing the steps method according to at least one of claims 1 to 14 when said program is run on a computer.

16. Device for organizing electronic documents from a set of documents comprising labeled documents, a labeled document being associated with a set of labels, characterized in that at least one labeled document is associated with a set of labels at least two labels and in that the device comprises: a) means for labeling the non-labeled documents of all the documents, by prediction according to the labeled documents and descriptive data of the documents; b) creation means, applied to a current directory associated with a set of labeled documents, of a sub-directory per distinct label among the set of labels associated with the documents of the current directory; c) association means, applied to each sub-directory created, documents of the current directory whose label set includes at least one label corresponding to said sub-directory created; d) means for modifying the sets of labels of the documents of the new subdirectory by subtracting the label corresponding to the subdirectory; e) the means b), c) and d) being applied iteratively for each sub-directory created, becoming current directory, until subdirectories associated only with documents whose set of labels is empty.

17. Multimedia electronic equipment comprising means for storing multimedia documents, characterized in that it comprises means for implementing the method of organizing documents according to any one of claims 1 to 14.