WO2020229760A1

WO2020229760A1 - Method for multidimensional indexing of textual content

Info

Publication number: WO2020229760A1
Application number: PCT/FR2020/050766
Authority: WO
Inventors: Mirisaee HAMID; Cédric LAGNIER; Eric Gaussier; Agnès GUERRAZ; Guillaume EMERY
Original assignee: Skopai; Universite Grenoble Alpes
Priority date: 2019-05-15
Filing date: 2020-05-11
Publication date: 2020-11-19
Also published as: FR3096157A1

Abstract

The invention relates to a method for multidimensional indexing of digital textual content, comprising: - a first step of extracting the words from the textual content (6) to build a digital word table; - a second step of filtering consisting in deleting the non-significant words from the digital word table; - a third step consisting in vectorising each of the words to build a vector table from a vector model (5); - a fourth step of calculating a single vector according to the vectors of the vector table. According to the invention, there is also a step of: - building a table from the digital vectors neighbouring the single vector; - calculating a second vector representation of the textual content by combining the neighbouring vectors.

Description

DESCRIPTION

TITLE: MULTIDIMENSIONAL CONTENT INDEXING PROCESS

TEXT

FIELD OF THE INVENTION

The present invention relates to a method for automatic multidimensional indexing of digital textual content. Indexing leads to the recording of the concepts contained in a document, in an organized and easily accessible form, allowing the search of the information recorded from these documentary research tools and the automatic processing of analysis of large volumes of documents for. carry out classifications, groupings by similarity of content, scheduling and more generally all types of automatic processing making it possible to use large volumes of writings in an efficient and relevant manner.

Indexing dates from the 16th century and initially consisted of establishing a "table" of the significant terms of a work or a collection of works to facilitate access. Very quickly, the limits of such an approach, carried out empirically by documentalists, appeared: At the head of volume V of his Diversities (1610), Jean-Pierre Camus, the bishop of Belley, said his hostility to the practice of indexing, then designated by the "tabular representation", and the mode of reading that it induces. “Indexing is a popular mistake, which infects only weak brains, who call it the soul of the book, and it is the instrument of their stupidity. These people can be called Doctores tabularii, which sapiunt tantum per Indices. Will you ask them what they know? They ask you for a book to show it, and immediately at the Table to find what they are looking for, the skilled call it the Donkey Bridge. " Jean-Pierre Camus: "The tables of the author's previous volumes, made by I do not know who, and without his knowing it, displease him, knowing that it is necessary to remove as much as possible what foments laziness, laziness mother of ignorance. "

The development of information technology has made it possible to partially overcome the problem of cognitive bias induced by the personal culture of human documentalists, by automating processing using totally objective approaches. The introduction of digital XML-type formats has also led to the enrichment of texts with metadata facilitating the automatic indexing of digital documents.

A new step was taken by the development of vector indexing techniques, paving the way for automatic processing to allow similarity searches, closer neighbors, and to accelerate access to a large collection of data by their positions in a multidimensional space.

By way of illustration, the thesis of Thierry Urruty "Optimization of multidimensional indexing: application to multimedia descriptors" defended in 2007 at the University of Lille 1 presents the general principles of the processing of multimedia contents which have been the subject of multidimensional indexing. The relevance of these approaches is strongly dependent on the quality of the digital processing operations for constructing the digital representations of a textual document, and the present invention relates more particularly to this essential step in the automatic processing of content.

Several multidimensional indexing techniques have been developed. They are based on the same principle: a priori regroup the data of the database so that the data close in space are in the same group, then develop algorithms which exploit a posteriori the structure put in place to carry out efficient searches in the database.

These techniques can be classified into three families: techniques based on the partitioning of data, known under the English names R * -tree, SR-tree, X-tree, etc., techniques based on the partitioning of space, kd-b-tree, LSD h-tree, PyramidTree, etc., and techniques based on compression, VA-File and its variants.

Several works have shown that these techniques are inefficient in large spaces for various reasons. On the one hand, data groups are generally poorly formed, because the data structuring procedures are very sensitive to the order of insertion of the vectors and the distribution of the data, and on the other hand, the search procedures are unable to confine the search to a small subset of the data which it suffices to access to construct the result set. This last problem is mainly due to the complexity of the organization, generally tree structure, of the groups of data.

The article titled "When is" nearest neighbor "meaningful?" in the journal Proceedings of the 7th International Conférence on Database Theory, 217-235, Jerusalem, Israel, January 1999, by K. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft or the article "A quantitative analysis and performance study for similarity search methods in high-dimensional spaces "in the journal Proceedings of the 24th International Conférence on Very Large Data Bases, 194-205, New York City, New York, USA, August 1998, by R. Weber, H. - J. Schek and S. Blott, have even shown that, in certain cases, the performances of the techniques known in multidimensional indexing are lower than those of a simple sequential search. TECHNOLOGICAL BACKGROUND OF THE INVENTION

Such a method is known for example from document EP3118751. It includes obtaining raw text, for example HTML source code extracted from a website, and preparing this raw text to form usable textual content (formatting, lemmatization). Keywords are then extracted from the actionable textual content.

Also known is the European patent issued under the number EP1828933 describing a document indexing method comprising steps of storing these documents in at least one tree structure of directories nested one inside the other from a root directory, a storage space, characterized in that it further comprises the steps of:

- for each document stored in the storage space, index by a computer the semantic content of the document and the names of the nested directories in which the document is stored,

storing the result of the indexing in a multidimensional indexing base in association with the document present in the storage space.

Also known is the French patent issued under the number FR2835940 which relates to a method of searching for the k nearest neighbors of a query vector q in a multidimensional database of N vectors comprising a phase of structuring the base into clusters for the grouping of vectors and a search phase, characterized in that it comprises the following steps:

- calculation of the distance Dppc (C, p) of the center of gravity of a cluster C from the base to the nearest neighbor vector among the vectors of the cluster, p being a natural number greater than or equal to k,

- calculation of the distance distc (C, q) of the request vector q to the center of gravity of the cluster C,

- calculation of the sum distPc (C, q, p) of the distances distc (C, q) and Dppc (C, p),

- calculation, on all the clusters C of the base, of the smallest value distPc (C, q, p),

- elimination of the clusters C from the base whose mindist distance (C, q), which is the smallest distance between the request vector q and the enclosing sphere of the cluster C, is greater than the smallest value distPc.

Disadvantages of the Prior Art

The problem to be solved concerns the calculation of a vector representation of a document with textual content not being limited to the use of this textual content only, to allow positioning in a homogeneous multidimensional space with respect to the positioning of other documents. with textual content.

In the solutions of the prior art, each document is processed on the basis of its own content, in order to calculate a vector representation which is then the object of comparison with the vector representation of other documents, by Euclidean distance calculations in a common multidimensional space.

Most of the documents analyzed are developed independently, each writer of a document having their own vocabulary, their own cognitive biases and their own thematic context, which results in content whose constituent terms and structure are not harmonized. . The automatic processing applied on the basis of the multidimensional indexations of the prior art are therefore unreliable and lead to very approximate or even erroneous results.

OBJECT OF THE INVENTION

The present invention, based on the word embedding formalism, therefore seeks, by arithmetic calculation on vectors, to establish at least one vector representative of a textual content, this vector not necessarily forming part of the vectors associated with a keyword of the lexical field of the document. In other words, the present invention proposes to automatically index, by vectors which may be representative of keywords, a document or a collection of documents. These vectors and these keywords are representative of the content of the documents without precisely corresponding to the words they contain.

BRIEF DESCRIPTION OF THE INVENTION

With a view to achieving this aim, the object of the invention provides, in its most general sense, a method for multidimensional indexing of digital textual content comprising:

A first step of extracting words from said textual content to constitute a digital word table;

A second filtering step consisting in removing the non-significant words from said digital word table; A third step consisting in vectorizing each of the words in order to construct a vector table from a vector model;

A fourth step of calculating a single vector which is a function of the vectors of said vector table.

According to the invention, the following is also carried out: the constitution of a table of neighboring digital vectors of said single vector;

the calculation of a second vector representation of the textual content by combining the neighboring vectors.

Preferably, the table of neighboring digital vectors of said single vector is established by:

the constitution of a first table of digital vectors neighboring said single vector;

calculating a set of N-tuples of vectors by combinations of said vectors from the first table;

calculating, for each of said N-tuples of vectors, a unique new vector to form the table of neighboring digital vectors.

According to a variant of this preferred embodiment, the method comprises an additional step of selecting at least one vector, from among the new unique vectors, having the highest occurrence to form a table of neighboring digital vectors.

According to a first variant, said table of vectors further comprises an indicator Oi depending on the number of occurrences of the word Mi associated with the vector Vi, in said textual content.

According to a second variant, not exclusive of the previous one, said table of vectors further comprises an indicator Fi depending on the number of appearances of the word Mi associated with the vector Vi, in said vector model.

Advantageously, said fourth step of calculating a single vector which is a function of the vectors of said table of vectors consists in calculating the average of said vectors. Preferably, said fourth step of calculating a single vector as a function of the vectors of said vector table consists in calculating the weighted barycenter as a function of said indicators Oi and / or Fi of said vectors.

According to a particular embodiment, said second filtering step consists of removing from said digital word table the words of the plain text not included in the input dictionary of the vector model to form the textual content.

In a particular example of application, the method further comprises the following steps:

- identify in the linguistic model a first number of vectors closest to the single vector;

- identify in the linguistic model a second number of vectors closest to the second vector representation;

- retain the vectors common to the first and to the second number of vectors to form at least in part a list of key vectors.

Advantageously, the list of key vectors also includes vectors resulting from a graph analysis of the textual content.

According to a particular embodiment, the key vectors of the list of key vectors are associated with a degree of relevance.

Advantageously, the degree of relevance is a cosine similarity between the key vector and the single vector or the second vector representation.

The invention also relates to a method of grouping textual contents, characterized in that one proceeds for each one. of said textual contents with a above-mentioned multidimensional indexing and in that a grouping indicator is associated with the textual contents whose second vector representations have between them a Euclidean distance less than a threshold value.

The invention also relates to a method for searching for contents similar to a reference document, characterized in that for a collection of textual contents as well as for said reference document, the aforementioned multidimensional indexing is carried out and in that one proceeds. searches for the textual contents whose second associated vector representation is closest to the second vector representation associated with said reference document.

The invention also relates to a method of graphically representing the positioning of documents with textual content, characterized in that for a collection of textual content, the aforementioned multidimensional indexing is carried out and in that a symbol is displayed for each of said documents. graphic, the distance between the graphic symbols of two documents on the graphic interface being a function of the Euclidean distance between the second vector representations of each of said documents.

BRIEF DESCRIPTION OF THE FIGURES

Other characteristics and advantages of the invention will emerge from the detailed description of the invention which will follow with reference to the appended figures in which:

FIG. 1 represents a computer environment making it possible to implement a method for extracting keywords in accordance with the invention; FIG. 2 represents the flowchart of an indexing method in accordance with the invention.

DETAILED DESCRIPTION OF THE INVENTION

There is shown in Figure 1, a computer environment for implementing a method according to the invention.

A computer processing unit 1 is configured to execute a computer program. It is connected to data storage means 2 and connected to a computer network 3, for example the Internet network. The computer processing unit 1 also has all the conventional input-output interfaces (screen, keyboard, communication ports, etc.).

These computer resources provide in particular access to documentary resources 4, such as websites accessible via the network 3 or text files recorded in the storage means 2. These documentary resources 4 constitute raw texts which can form data of entry of the process which is the subject of the present description.

Multidimensional linguistic model

There is also available, for example recorded in the storage means 2 of the computer environment of FIG. 1, of a vector linguistic model 5 corresponding respectively to the words of a dictionary with vectors.

The vector model aims to represent documents and queries as vectors in an n-dimensional space.

As was specified in the introduction, this linguistic model, which can be in the form of a simple data table words - vectors, associates linguistically close words with equally close vectors in the multidimensional space in which these vectors are defined. The dimension of the vector space for defining vectors can be very large, typically several hundred. The measure of proximity of two vectors in this space can be determined by a measure of similarity of these two vectors, for example the measure of cosine similarity.

The vector linguistic model 5 may be a pre-existing model which is publicly accessible. But advantageously, when the method of extracting keywords targets a specific field of application, the vector linguistic model 5 has been previously developed from a corpus of documents of this field of application. Reference may be made to the document of the literature cited in the introduction to this application to obtain the details of the implementation making it possible to constitute, by learning, such a vector linguistic model from a corpus of selected documents.

Word extraction

In a preliminary step of the extraction process, a raw text 4 is provided which is prepared to form an exploitable textual content 6 of the digital word table type [Mi; M2; ...; Mi], the format and content of which are suitable for its future processing. This supply step can be implemented by an extraction software module recorded in the storage means 2 and executed on the processing unit 1. This module accesses the plain text 4 from for example a Web address or access paths to the storage means 2 which are provided to it. This module can consist or include an indexing robot ("web crawler", according to the English terminology usual in this field) which automatically explores the network to collect documentary resources 4 of interest. The plain text 4 is prepared, during a filtering step operated by the extraction software module, by conventional operations of eliminating non-significant words such as coordination conjunctions, by lemmatization, or by any other operation making it possible to establish textual content comprising only words known to the linguistic model 5.

Generally, the dictionary of words forming the entry of the linguistic model 5 is established in a determined language, which does not make it possible to process raw texts expressed in other languages. To deal with this situation of raw texts in a foreign language, the invention provides for a translation step. This translation step is based on a plurality of vector translation models, one for each language that is to be processed. Each translation model is similar to the vector linguistic model 5, and associates a word with a vector in a multidimensional space, for example by means of a mapping table. The translation models and the linguistic model are consistent with each other, that is to say that two identical words in different languages are respectively linked to identical or very close vectors. There are many pre-existing and freely available translation templates. To process a plain text in a foreign language, we use the translation model corresponding to this language to transform all the words into vectors, then we apply the linguistic model to perform the inverse transformation, i.e. transforming the vectors into words. We can in this way return to a plain text which is expressed in the language of the linguistic model, and we can apply to it the preliminary processing making it possible to provide the textual content.

We therefore have, at the end of this preliminary step, usable data 6, designated by “textual content” in the remainder of this description, which may be in the form of a string or digital table of unique words Mi recorded in the storage means 2, and whose format and content are regular. The textual content 6 associated with a plain text 4 can be composed of a large number of words, several hundred or even several thousand. The textual content 6 associated with an original plain text can therefore be systematically processed by the following steps of the method.

The preliminary step implemented by the extraction module can perform other operations on the original plain text 4, such as for example determining the numerical coefficient of TF-IDF ("term frequency-inverse document frequency" or of a Term - Reverse document frequency) of the words composing the textual content. This coefficient, the calculation of which is well known in the field, aims to numerically measure the importance of a word in a document. These coefficients can be recorded together with the words extracted from the processed raw text 4, in the form of an adequate data structure constituting the textual content 6.

The factors TF and IDF make it possible to consider the local and global weights of a term. A distinction is made between the frequency of occurrence of a term in a document (term frequency, TF) and the frequency of occurrence of this same term in the entire collection considered (inverse document frequency, IDF). The TF * IDF measure makes it possible to approximate the representativeness of a term in a document, especially in corpus of documents of homogeneous sizes.

The extraction step results in a table of words Mi optionally associated with an occurrence indicator Oi as a function of the number of occurrences of the word Mi considered in the original document, as well as a frequency indicator Fi as a function of the frequency appearance of the word Mi considered in the corpus constituting the vector model 5.

In a following step of the extraction method, an attempt is made to establish a first representation VU of the textual content 6. For this, the vector V ± corresponding to each unique word Mi composing the textual content 6 is determined using the linguistic model vector 5. Then, the word vectors V ± are combined together numerically to form this first vector representation VU of the textual content 6.

The combination can correspond to a simple average, but preferably this numerical combination is a barycenter calculation for which each vector V ± of word Mi is weighted by a measure of importance of the corresponding word, for example Cy and / or Fi, c 'ie the digital coefficients of TF and / or IDF of this word Mi in the plain text 4, which could be established by the extraction software module during the preliminary step of the method.

At the end of this step, there is therefore a first single vector VU representative of the textual content 6 processed.

It is noted that this first unique vector VU, calculated numerically, does not necessarily correspond to an existing word in the linguistic model, but it nevertheless follows very directly from the words of the textual content 6.

In order to try to break away from the lexical field precisely used in the plain text 4 and which is found in the textual content 6, a method conforming to the present description provides several additional steps seeking to provide a second vector representation VU _aiP ha of the textual content, which is precisely freed from words extracted from textual content 6. Enrichment of the vector representation

Thus, during a new step of the method, a list of neighboring vectors VV _j of the first vector representation VU is established.

For this, we can rely on the vector linguistic model 5, for example by establishing a similarity coefficient between the first unique representative vector VU and each vector composing this model 5. As we have seen, this similarity coefficient can be calculated practically as a measure of cosine similarity. This makes it possible to very easily determine the list of vectors of this model 5 located in a neighborhood of the first representative vector VU, that is to say whose degree of similarity is less than a predetermined threshold. Alternatively, this list of vectors can have a predetermined size, and in this case the neighboring vectors VV _j are chosen as the vectors of the model 5 whose degrees of similarity with the first representative vector VU are the lowest.

There are many other ways to build this list. In an advantageous example, a first list of N vectors of the linguistic model 5 closest to the first representative vector VU is formed first. For each of the vectors of the first list, we search again for the M closest vectors in the linguistic model 5, and we form a second list which brings together these N * M vectors. N and M can for example be between 5 and 20, typically 10 We collect in this way, in the second list, the vectors present in the neighborhood of the neighborhood of the first representative vector VU, and we make sure to capture a large variety vectors and therefore to be detached from the textual content 6 or from the original plain text 4. It could moreover be provided to continue this recurrence a greater number of times in order to further diversify the second list of vectors or to apply other approaches, in addition to the replacement of that which are proposed, to further increase this diversity.

The second list of vectors may have a particularly large size, and include insignificant vectors. Also, the list of neighboring vectors VV _j of the first vector representation VU which is established during this example does not correspond exactly, in a preferred mode of implementation of the method, to the second list. The list of neighboring vectors VV _j of the first vector representation VU is preferably established by choosing from the second list the group of vectors having the greatest occurrence. It is thus possible to choose, and by way of example, 5 to 10 vectors to form the list of neighboring vectors VV _j of the first representative vector VU of a textual content 6.

In another advantageous example for forming a table of neighboring vectors VV _j , a first table of vectors of the linguistic model 5 closest to the first representative vector VU is formed first, and just like in the previous example. We then calculate combinations of vectors from the first table to form N-tuples of N vectors VV _jaiPha from the first table, of which we calculate, for each N-tuple, the mean or barycenter in the form of a unique new vector VN _aiPha. It may thus be a matter of determining all the possible N-tuples in this first list or only part of them. We denote by K the number of determined N-tuples and therefore of determined new unique vectors VN _aiPha . For each of the vectors VN _aiPha , we search again for the M closest vectors in linguistic model 5, and we form a second list which brings together these K * M vectors. A large variety of vectors are thus collected. As in the previous example, the list of neighboring vectors VV _j of the first vector representation VU is preferably established by choosing from the second list the group of vectors having the largest occurrence. It is thus possible to choose, and by way of example, 5 to 10 vectors to form the list of neighboring vectors VV _j of the first representative vector VU of a textual content 6.

Whatever method is chosen to develop the list of neighboring vectors VV _j , the vectors forming this list can then be combined numerically with each other, for example using a simple average, to establish a second vector representation in the form of a second unique vector representative VU _aiPha of the textual content.

This vector VU _aiPha , just like the first vector representation VU, does not necessarily correspond to a word from the input dictionary of the vector linguistic model 5. Also, in an example application, to establish at least one keyword from of these representative vectors VU, VU _aiPha , it is therefore necessary to project them into the frame of reference defined by the vector linguistic model 5 in order to obtain at least one representative vector associated in this model with at least one word from the input dictionary . This or these words will form the keyword extracted from the textual content 6, which is representative thereof and which can make it possible, for example, to index it.

Vector and keyword extraction

To this end, the method can comprise an additional step aimed at forming a list of key vectors, contained in the vector linguistic model 5, this list of key vectors comprising vectors close to the first and second representations VU, VU _aiPha

For this, we can for example identify respectively in the vector linguistic model 5, a first number and a second number of vectors closest to the first representation VU and the second representation VU _aiPha Again, we can make use in this processing of proximity calculation by cosine similarity. Then the vectors common to this first and second number of vectors are retained, that is to say that the intersection of these two sets is taken to form at least in part the list of key vectors. The first and second number of vectors can be chosen quite freely, for example between 10 and 200.

In a following step of the method, this list of key vectors, or part of it, can be transformed into a list of key words, by relying on the vector linguistic model 5. This list can thus form the words -keys indexing the textual content which has just been processed.

Preferably, however, it will be preferable to provide a more limited number of keywords than the number of vectors making up the list of key vectors. The method then comprises a step of selecting at least one key vector from the list. In order to carry out this selection, the key vectors can be ordered in decreasing order of proximity to the first and second representative vectors VU and VU _aiPha . The selection then consists in taking first of all the key vectors having the closest proximity. This ensures the relevance of the keywords chosen. In other words, we choose at least one key vector from the list of key vectors and we establish at least one key word representative of the textual content 6 in determining, using the linguistic model, the key word (s) corresponding to the chosen key vector (s).

The list of key vectors can be completed by other methods, so that the selection is as rich as possible. For example, the list of keywords can be increased by keywords resulting from a graphical analysis of the textual content 6, as was presented at the introduction of the request.

A method in accordance with the present description can find many other applications.

It can for example be applied to the grouping of textual content. In this example, the multidimensional indexing method which has just been presented is applied to available textual contents, and a grouping indicator R is associated with the contents of which the second vector representations VU _aiP ha have a Euclidean distance between them less than a threshold value D, which can be predetermined.

It can also be applied to the search for content similar to a reference document. For a collection of textual contents as well as for said reference document, the aforementioned multidimensional indexing is then carried out and the textual contents of which the associated second vector representation VU _aiP ha is closest to the single vector associated with said reference document is sought. reference.

In another example application, it is possible to form a graphic representation of the positioning of documents with textual content. For a collection of documents with textual content, the above-mentioned multidimensional indexing is carried out and a graphic symbol is displayed for each of said documents, the distance between the graphic symbols of two documents on the graphic interface being a function of the distance, for example Euclidean, or the similarity between the second vector representations VU _aiP ha of each of said documents.

Of course, the invention is not limited to the embodiment described and variant embodiments can be provided without departing from the scope of the invention as defined by the claims.

It should be noted that the methods described here are intended to be implemented by instructions, stored on a computer readable medium, and executed by a machine, an apparatus or a device for executing instructions, such as a computer. computer, a machine based on or containing a processor.

Claims

1. Method for multidimensional indexing of digital textual content, the method being implemented by a device for executing instructions, and comprising:

A first step of extracting words from said textual content (6) to constitute a digital table of words (M,);

A second filtering step consisting in deleting the non-significant words from said digital word table (M ±);

A third step consisting in vectorizing each of the words (Mi) to construct a table of vectors (V ±) from a vector linguistic model (5);

A fourth step of calculating a single vector (VU) as a function of the vectors of said vector table (V ±), the single vector forming a first vector representation VU of the textual content;

characterized in that one further proceeds:

in the constitution of a table of neighboring digital vectors (VV _j ) of said single vector (VU);

the calculation of a second vector representation (VU _aiPha ) of the textual content by combination of neighboring vectors

(Wj).

2. A method of multidimensional indexing of a digital textual content according to claim 1, in which the table of neighboring digital vectors (VV _j ) of said single vector (VU) is established by:

the constitution of a first table of digital vectors neighboring said single vector (VU);

calculating a set of N-vector tuple (VVj _aiPha ) by combinations of said vectors from the first table; the calculation, for each of said N-vector tuple (VV _alpha ), of a unique new vector (VN _aiPha ) to form the table of neighboring digital vectors (VV _j ).

3. Method for multidimensional indexing of digital textual content according to the preceding claim, characterized in that it comprises an additional step of selecting at least one vector, from among the new unique vectors (VN _aiPha ), exhibiting the strongest. occurrence to form a table of neighboring digital vectors (VV _j ).

4. Method for multidimensional indexing of a digital textual content according to one of the preceding claims, characterized in that said vector table (Vi) further comprises an occurrence indicator (Cy) depending on the number of occurrences of the word. M ± associated with the vector Vi, in said textual content.

5. Method for multidimensional indexing of a digital textual content according to one of the preceding claims, characterized in that said vector table (Vi) further comprises a frequency indicator (Fi) depending on the number of appearances of the word Mi associated with the vector Vi, in the corpus constituting said vector model (5).

6. Method for multidimensional indexing of a digital textual content according to one of claims 1 to 3, characterized in that said fourth step of calculating a single vector (VU) depending on the vectors of said table of vectors (Vi) consists in calculating the average of said vectors (Vi).

7. Method for multidimensional indexing of a digital textual content according to claim 4 or 5, characterized in that said fourth step of calculating a single vector (VU) as a function of the vectors of said table of vectors (Vi) consists in calculating the barycenter weighted according to said indicators of occurrence (Oi) and / or frequency (Fi) of said vectors (Vi).

8. Method for multidimensional indexing of a digital textual content according to one of the preceding claims, in which said second filtering step consists in deleting from said digital word table (Mi) the words of the plain text not included in the dictionary d. input of the vector model

(5) to form the textual content (6).

9. Method for multidimensional indexing of digital textual content according to one of the preceding claims.

.comprising the following steps:

- identifying in the linguistic model (5) a first number of vectors closest to the single vector (VU);

- identify in the linguistic model a second number of vectors closest to the second vector representation (VU _aiPha );

10. Method for multidimensional indexing of a digital textual content according to the preceding claim, in which the list of key vectors also comprises vectors resulting from a graph analysis of the textual content.

(6).

11. Method for multidimensional indexing of digital textual content according to one of the two preceding claims, in which the key vectors of the list of key vectors are associated with a degree of relevance.

12. Method for multidimensional indexing of digital textual content according to the preceding claim, in which the degree of relevance is a cosine similarity between the key vector and the unique vector (VU) or the second vector representation (VU _aiPha)

13. A method of grouping textual contents characterized in that one proceeds for each of said textual contents to a multidimensional indexing according to at least one of claims 1 to 11 and in that one associates a grouping indicator (R ) to textual contents whose second vector representations (VU _aiPha) have a Euclidean distance between them less than a threshold value (D).

14. Method of searching for content similar to a reference document, characterized in that for a collection of textual content as well as for said reference document, a multidimensional indexing according to at least one of claims 1 to 11 is carried out and in that a search is made for the textual contents of which the associated second vector representation (VU _aiPha) is closest to the second vector representation (VU _aiPha) associated with said reference document.

15. A method of graphically representing the positioning of documents with textual content, characterized in that, for a collection of textual content, one proceeds to a multidimensional indexing according to at least one of claims 1 to 11 and in that one displays for each of said documents a graphic symbol, the distance between the graphic symbols of two documents on the graphic interface being a function of the Euclidean distance between the second vector representations (VU _aiPha) of each of said documents.