EP4127965A1 - Méthode mise en oeuvre par ordinateur pour la recherche analogique de documents - Google Patents
Méthode mise en oeuvre par ordinateur pour la recherche analogique de documentsInfo
- Publication number
- EP4127965A1 EP4127965A1 EP21716109.0A EP21716109A EP4127965A1 EP 4127965 A1 EP4127965 A1 EP 4127965A1 EP 21716109 A EP21716109 A EP 21716109A EP 4127965 A1 EP4127965 A1 EP 4127965A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- map
- documents
- self
- database
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
Definitions
- the present invention relates to the field of computer-implemented methods for searching for documents. More precisely, the present invention relates to a method implemented by computer for the analog search of documents in a set of documents. It specifically relates to a computer-implemented method which makes it possible to identify documents in a set of documents which most closely match a search term which is not necessarily contained in the documents of the set. of documents.
- search methods use a series of key words or terms matched to a known body of information, such as the content of indexed documents.
- a user enters a search word or phrase and a search engine matches the word or phrase to a list of words derived from the corpus.
- the search engine then displays the documents that match the word or phrase in question.
- conventional research methods fail to take into account the two main problems inherent in research.
- the first problem relates to imprecise research.
- an imprecise search a user enters incorrect information into the request. If the user does not enter the correct words in the request, the extracted documents will not represent the user's intention. For example, if a user enters a particular search word in a query, but the desired information is indexed under a synonym of the search word, then the user will not find the desired information.
- the second problem relates to vague research.
- a vague search the user does not know enough about the subject of the desired information to form a precise search query. For example, if the user does not know the language to describe a particular medical condition, then the user will not be able to correctly enter the search query.
- US Pat. No. 888,657 9 B2 relates to the semantic processing of text by neural networks, that is to say the analysis of the meaning of a text by focusing on the relationship between its words and what it is. they represent in the real world and in their context.
- the patent US9183288B2 presents an efficient method of structuring the data for an efficient and reliable search by exploiting the semantic relations between the documents. It uses a semantic analysis techniques to create a vector space model of documents contained in a domain corpus then creates a hierarchical structure of the documents by the through an agglomeration process. Each document in a domain corpus is matched against other documents in the same domain corpus to determine which documents are the most similar.
- An aim of the present invention is therefore to provide a computer-implemented method for the analog search of text documents making it possible to overcome the limitations mentioned previously. According to the invention, these objects are achieved by virtue of the objects of the independent claim. More specific aspects of the present invention are set out in the dependent claims as well as in the description.
- an aim of the invention is achieved by virtue of a computer-implemented method for the analog search of textual documents of a set of documents E included in a first database and whose content corresponds the most. to a search term R comprising the steps: a. Generation of a second database comprising a list of words produced by lemmatization of the documents of the first database; b. Generation of a descriptor vector V of digital values for each document of the first database using a vectorization function F of the textual information; vs.
- a self-organized map C comprising a network of P neurons p on the basis of the descriptor vectors V, each neuron p of the self-organized map C corresponding to a weight vector l / l / of digital values; d. Allocation of each document from the first database to the neuron p of the self-organized map C whose corresponding weight vector W has the smallest distance from the descriptor vector V of the document to be allocated; e. Generation using the vectorization function F and the second database of a search vector K of numerical values for the search term R; f. Determination of the pbest neuron of the self-organized map C whose weight vector W has the smallest distance from the search vector K; and g.
- Determination of the documents from the first database allocated to the pbest neuron of the self-organized map C it is possible to identify the document or documents of a set of documents whose content is closest to a search term.
- the search result is based on the search term's similarity of the documents to the set of documents. The most "similar" documents can be identified even if the search term is not contained in any of the documents in the document set. Note that learning the self-organized map C in step c.
- the self-organized map C is composed of a low-dimensional network of P neurons p.
- each neuron has two neighbors.
- the arrangement of the neurons is done in a rectangular way where each neuron has four neighbors (rectangular topology) or in a hexagonal way where each neuron has six neighbors (hexagonal topology). Neurons are recognized by their number and their location in the network.
- the descriptor vectors V are projected from their initial space, or entry space, towards the card or exit space.
- Each neuron p on the map is associated with a weight vector 1/1 /, also called the prototype or referent vector, belonging to the input space.
- P the total number of neurons in the map
- W of the neuron p of dimension N is denoted by:
- the objective of learning the map consists in updating the weight vectors so as to best approximate the distribution of the vectors of the V criptors while reproducing the self-organization of the neurons on the map.
- the card is learned in sequential mode, also called incremental, or in deferred mode (batch).
- Each iteration t of sequential learning comprises two stages.
- the first step consists in choosing at random a descriptor vector V (t) from the set W, and in presenting it to the network in order to determine its winning neuron.
- the winning neuron (Best Matching Unit) of an observation is the neuron whose weight vector is closest to it within the meaning of a given distance (for example a Euclidean distance). If c is the winning neuron of vector V (t), c is determined as follows:
- the winning neuron is activated. Its weight vector W is updated to approximate the descriptor vector presented to the network. This update does not only concern the winning neuron as in the methods of competitive learning (Winner take ail), but also the neighboring neurons which then see their weight vectors adjust to this descriptor vector.
- the amplitude of this adjustment is advantageously determined by the value of a learning step a (t) and the value of a neighborhood function h (t).
- the parameter a (t) regulates the speed of learning. It is initialized with a large value at the beginning then decreases with the iterations in order to slow down as the learning process progresses.
- the function h (t) defines membership in the neighborhood. It depends both on the location of the neu rones on the map and on a certain neighborhood radius. In the first iterations, the neighborhood radius is large enough to update a large number of neighboring neurons to the winning neuron, but this radius gradually narrows to contain only the winning neuron with its immediate neighbors, or even the winning neuron. winner neuron only.
- the rule for updating weight vectors is as follows: where c is the winning neuron of the descriptor vector V (t) presented to the network at iteration t and h is the neighborhood function which defines the proximity between neurons c and p.
- a neighborhood function between the winning neuron c and a neuron p of the map is worth 1 if the neuron p is inside the square centered on the neuron c and 0 in the other cases. The radius of this square is called the neighborhood radius. It is wide at the start, then narrows with the iterations to contain only neuron c with its immediate neighbors at the end of the apprentice sage or even just neuron c.
- a more flexible and common neighborhood function is the Gaussian function defined below: where r c and r p are respectively the location of neuron c and neuron p on the map, and o (t) is the radius of the neighborhood at iteration t of the learning process.
- the amplitude of the adjustment is graduated according to the distance from the winning neuron which reserves for itself the maximum amplitude.
- the result of the unsupervised learning of the self-organized map C is the nonlinear projection of the set of descriptor vectors V on the map.
- Each descriptor vector V is attributed to its winning neuron, which makes it possible to allocate each document of the set of documents to a neuron of the self-organized map.
- this projection preserves the topology of the data through the use of the neighborhood function. Two neighboring neurons on the map will represent close observations in the data space. A variant of learning is said to be “in deferred mode”.
- each weight vector is an average weight. derived from descriptor vectors (1 ⁇ 4, ie ⁇ 1,..., n ⁇ ).
- the corresponding weights are the values of the neighborhood function h (t).
- the rule for updating prototype vectors is given by: where h is the value of the neighborhood function between the victorious neuron d of the vector V; and the neuron p.
- the updating of the prototype vectors can be formulated in another way by using the fact that the observations which have the same victorious neuron have the same value for the neighborhood function and belong to the Voronoi region whose center is their victorious neuron: where ni is the number of descriptor vectors V belonging to the Voronoi region represented by the neuron / and Vi is the average of the descriptor vectors of this same region.
- each weight vector constitutes the center of gravity of the observations that it represents and we then fall back on the algorithm of mobile centers, which guarantees a better approximation of the observation density function.
- this algorithm does not present any convergence problems.
- the self-organized map is a two-dimensional or three-dimensional map.
- the initialization of the card C before the learning process as such can be carried out in several ways.
- a first initialization method consists in assigning an initial weight vector 1/1 / to each node of the self-organized map C.
- This initial assignment of the weight vectors can be for example a random assignment of a number to each scalar vector of the weight vectors, without stimulation.
- random refers to equal probability for any of a set of possible outcomes.
- the numerical value of these randomly assigned scalar values can be approximately limited to the lower and upper bound by the corresponding extrema observed in the descriptor vectors V.
- Another method of initializing the weight vectors VJ includes a systematic variation, for example a linear variation, in the range of each dimension of each weight vector to approximately intersect the corresponding range observed in the descriptor vectors V.
- the weight vectors VJ are initialized by the values of the ordered vectors along a two-dimensional subspace traversed by the two main eigenvectors of the descriptor vectors V obtained by orthogonalization methods well known in the art, for example by the so-called Gram-Schmidt orthogonalization.
- the initial values are fixed on samples chosen at random from the descriptor vectors V.
- the determination of the winner neuron of the self-organized map C for each descriptor vector V can be done according to several criteria well known to those skilled in the art. This can for example be done on the basis of a distance for example the minimum Euclidean distance between all the weight vectors W 6e the self-organized map C and the vector V.
- Other methods can for determining the winner neuron such as those using the correlation between vectors which has the advantage of offering more robustness to the shift between vectors, the angular difference between vectors which offers the advantage of emphasizing the mutual length of the vectors for all that. that the information is carried by these quantities, the Minkowsky distance measure which is a generalization of the Euclidean distance measure and which is advantageous when the vectors carry data of a qualitative nature can also be implemented.
- the distance between two vectors is a Euclidean distance.
- the Euclidean distance between vectors is a measure which can be determined very quickly whatever the dimension of the self-organized map C, which allows a rapid implementation of the present method and therefore a rapid search for the document (s) of which the content most closely resembles the search term.
- the determination of a Euclidean distance between two vectors requires only few computational resources. It can therefore be done on ordinary desktop computers.
- the textual content of the documents of the first database is normalized before the generation of the V descriptor vectors.
- the normalization method is commonly used by anyone familiar with the state of. the art of preprocessing textual documents.
- the operations typically carried out during normalization are, in a non-exhaustive manner, the aggregation of words, the transformation of upper case letters into lower case letters for common names, the removal of punctuation characters, the transformation of linking pronouns. Standardization allows redundant or unnecessary information to be removed from the text of documents
- the vectorization function F is a so-called “term frequency-inverse document frequency” function.
- the so-called “TF-IDF” (term fre- quency-inverse document frequency) method is a weighting method often used in information research and in particular in text mining. This statistical measure has the advantage of allowing an evaluation of the importance of a term contained in a document, relative to a collection or a set of documents. Importance, also called weight, increases with the number of occurrences of the word in the document and also varies with the frequency of the word in the set of documents.
- the self-organized map C is two- or three-dimensional. A two- or three-dimensional map makes it possible to reduce the complexity of the calculations to be carried out during the search without losing too much information. They also make it possible to produce a search result which groups together several documents in a simple way while maintaining their semantic proximity.
- the self-organized card C is graphically displayable through a graphical interface which is for example the graphical interface of a personal computer. Thanks to this the contents of the C card are accessible through a graphical interface and can be explored directly by a user.
- the contents of the first database can be displayed on the self-organized map. This makes it possible to see the proximity of the various documents contained in the first database. So if, for example, a document has been identified as being particularly relevant, it is possible to find documents whose content is similar to this one very easily because they are positioned close together on the self-organized map C.
- a graphical representation of a CD document card of size equal to the self-organizing card C is superimposable on the graphical representation of the self-organizing card C.
- the documents determined in step g. are identified graphically on the CD document card.
- This second layer which can be superimposed on the graphic representation of the self-organized map, comprises a graphic representation of the documents identified as being the closest to the search term R.
- the content of a document determined in step g. is accessible by selecting this document on the graphic representation of the CD document card via a computer pointing device, such as for example a computer mouse.
- a graphic representation of a map of distances CH of dimension equal to the self-organized map C can be superimposed on the graphic representation of the self-organized map C, and a distance value dd is attributed to each neuron of the distance map CH, the distance value dd corresponding to the sum of the Euclidean distances between the weight vector l / l / of the neuron considered and the weight vectors W of the neighboring neurons direct.
- the self-organized map C learning algorithm has the effect of grouping close documents together in the sense of a distance measurement in neurons.
- Neurons are encoded in a matrix as vectors of real data. These neurons are ordered by the algorithm in such a way that close documents in data space are as close as possible on the C map.
- the proximity between the neurons does not give any information on the real distance which separates the vectors from the criptors V in the original space.
- documents allocated to nearby neurons on the map C can in reality correspond to very distant data in the original space and therefore in reality be very different.
- This limitation can be partly reduced by the use of the distance map CH which can be superimposed on the self-organized map C in the graphics interface.
- This distance map CH is a map of dimension equal to the map C and which gives a measure of the real distance between the weight vectors W of the latter.
- This measurement can advantageously be displayed on a graphic representation of the distance map CH by a suitable “lookup table”, for example a suitable coloring or a suitable gray level.
- a color coded color is assigned to each neuron of the distance map CH based on the dd value. This makes it possible to visually very quickly determine the real level of resemblance of two documents positioned in close proximity on the self-organized map C.
- a graphical representation of a CW word map of dimension equal to the self-organized map C is superimposable on the graphical representation of the self-organized map C, in which at each neuron of the graphic presentation of the word map CW are displayed the words of the vocabular whose component of the weight vector l / l / of the corresponding self-organized map neuron C is greater than a predetermined value.
- a road map represents cities in their context including monuments, roads, forests and in general anything that gives a city context
- the representation of documents on a map can be contextualized by placing words on the map.
- CW the most significant words near the neurons of the map C corresponding to the documents that contain these words.
- a graphic representation of the word map CW is advantageously superimposed on the graphic representation of the self-organized map C in order to give additional information to the user.
- the words displayed in the word map CW are organized on different planes corresponding to different ranges of weight values, the words having the most weight value. high being displayed in the foreground of the graphical representation of the CW word map.
- the number of words to be displayed on the graphic representation of the CW card can be very high and lead to an unreadable display if they were all placed on the same plane.
- the display of documents on the map is advantageously lightened by offering a zoom system comparable to that available for road maps. To do this, the words are distributed on different display planes according to their relevance.
- the documents from the first database are indexed in a third database, and the documents comprising the search term R are determined. This allows in addition to the analog search to make a search based on an indexation of the documents of the first database. It is thus possible to search and identify documents that explicitly contain the search term R.
- documents comprising the search term R are identified on the graphic representation of the CD document card. This makes it possible to quickly obtain information on the resemblance of two documents which were identified by the textual search. This is because two nearby documents on the CD document card have similar content.
- the CH distance card can also be superimposed on the CD document card in this case. This allows to provide the user with an indication of the real resemblance of two documents identified during the text search.
- the documents of the first database are documents encoded in digital form advantageously originating from word processing, from text recognition systems, for example by means of the so-called Optical Character Recognition method, or from any system capable of producing structured digital files.
- This has the advantage that the textual content of the documents can be easily processed to create the vocabulary as well as the descriptor vectors.
- FIG. 1 shows a functional diagram of a method according to an embodiment of the present invention
- FIG. 3 shows a functional diagram of the adaptation of the self-organized map
- FIG. 4 illustrates the system for generating document, word and distance cards
- - Figure 5 shows a graphical representation of the document card
- FIG. 6 represents a functional diagram of the search for the documents closest to the search term;
- - Figure 7 shows a graphic representation of the distance map;
- FIG. 8 illustrates the relationship between the map, neurons, word weights and vocabulary
- FIG. 9a illustrates the first step in the calculation of the continuous coordinates for a word of the vocabulary
- FIG. 9b illustrates the second step in the calculation of the continuous coordinates for a word of the vocabulary
- FIG. 10 illustrates the positioning of a specific word on the word map after calculating the continuous coordinates
- FIG. 11 illustrates the distribution of the words on different planes of the word map
- FIG. 12 shows an example of word positioning on a map of 10 ⁇ 00 neurons
- - Figure 13 shows the graphical interface
- - Figure 14 shows the superposition of the word, distance and document maps on the self-organized map
- FIG. 15 illustrates access to a summary of a document by selecting it on the document card
- FIG. 16 shows the graphical interface with the summary of the documents selected on the document card
- FIG. 17 illustrates the system for indexing and textual searching for documents
- FIG. 18 shows an example of a result of an analog search in the graphical interface
- the invention presented here consists in making it possible to identify one or more documents among a set of documents using an analog search.
- the invention employs a system for analyzing documents deposited in a centralized storage location.
- Document analysis automatically produces a vocabulary used to encode documents in the form of characteristic descriptor vectors. These criptor vectors are mapped in the form of a self-organized map, preferably two-dimensional, which can finally be used to perform searches in the document base and identify one or more documents that most closely match a search term. .
- the invention is based on the possibility of mapping the documents on a self-organized map, for example a two-dimensional self-organized map, in such a way that two documents whose descriptor vectors are close in space to the descriptor vectors are placed on the map to nearby locations.
- This property has the advantage of grouping documents from their content alone and without supervision.
- a user can advantageously use the self-organized card via an appropriate graphical interface in order to find and find documents by simply exploring the map.
- the cartography thus relies on a non-linear projection of points from the space constituted by the descriptor vectors of documents towards a two-dimensional map.
- a document reading system 110 takes as input textual documents, from a set of documents E, encoded under digital form which can be obtained from word processing, from text recognition systems, for example by the intermediary of the method known by the abbreviation OCR (Optical Character Recognition) and in general any system capable of producing structured digital files.
- OCR Optical Character Recognition
- the textual content of the documents of the set of documents E is recorded in a first database, here called the “raw documents” database 120 which is used as the source of the document processing system 130, the object of which is, in a second step, to process the contents extracted from the documents to group together similar words and to remove the punctuation characters.
- a first database here called the “raw documents” database 120 which is used as the source of the document processing system 130, the object of which is, in a second step, to process the contents extracted from the documents to group together similar words and to remove the punctuation characters.
- the documents thus processed are stored in the “standardized documents” database 140 which is used, in a third step, by the vocabulary generation system 150 to produce a list of words called “vocabulary” established by applying restrictions to the vocabulary. the list of words produced by the document processing system 130.
- the vocabulary thus obtained is stored in a second database, here called the “vocabulary” database 160, which will be used, in a fourth step, by the system document descriptor vectors 170 which transforms each document of the database 140 into document descriptor vectors which are finally stored in the database "descriptor vectors" 175.
- the descriptor vectors stored in the database 175 are used, in a fifth step, by the document card generation and processing system 180 which produces a self-organized card C which allows the analog search of one or more documents of the set of documents E on the basis of a search term R advantageously defined by a user via a graphical interface 190 which is advantageously a graphical interface of a personal computer, such as for example a computer screen.
- This graphical interface 190 allows the graphical display of the documents identified during the search on the self-organized card C and on one or more additional cards (see below for more details).
- a document indexing system 125 processes the documents stored in the database 120 and indexes their documents. content which is recorded in a third database referred to herein as an "indexing" database.
- This indexing can be advantageously used to allow, in addition to the analog search mode, a textual search mode which will be available in the graphical interface 190.
- the graphical interface 190 is advantageously used to display the self-organized map C and one or more additional maps, enlarge them, move them, show or hide words, show or hide documents, and to search for documents by two search modes ana logical and textual.
- the document reading system 110 represents any device capable of reading textual documents and recording them in a database 120.
- Each row of the database 120 contains for each document an ID number and the content of the document in the form of. plain text.
- Table 1 shows an example of a typical content of a row of the "raw documents" database 120 which is obtained from the database of 44,512 summaries of the "Internet Movie Database” ( IMBD).
- Table 1 a row of the “raw documents” database 120
- the document processing system 130 takes as input the information contained in the “raw documents” database 120 to perform a series of analyzes and transformations intended to standardize the content of the documents in a form which will allow its subsequent use. really.
- This process called "normalization” is commonly used by anyone familiar with the state of the art in the preprocessing of textual documents.
- the operations typically carried out during normalization are, in a non-exhaustive manner, the aggregation of words, the transformation of letters my up to lowercase letters for common nouns, the deletion of punctuation characters, the transformation of linking pronouns. and in general any processing aimed at removing redundant or unnecessary information in the text of documents.
- the normalization will perform, for example, the following transformations:
- Table 2 a row of the “standardized documents” database 140
- the documents are thus standardized and can be, in a following step, used to build the “vocabulary”.
- the vocabulary is a set of words selected from among all the words contained in the documents of the “standard documents” 140 database. Its purpose is to concisely represent all of the information. textual tion of documents in canonical form. In this sense, the vocabulary constitutes the axes of a multidimensional space whose dimension is equal to the number of words in the vocabulary.
- the vocabulary generation step is known to those skilled in the art as “lemmatization” because it involves generating "lemmas” or vocabulary words.
- the generation of the vocabulary will consist, among other things, in counting all the words which appear in all the documents of the “standardized documents” database 140.
- a usual way of constructing the vocabulary consists in by shooting d 'existing vocabulary and counting the number of words that appear in standard documents.
- Another method is to dynamically build vocabulary from standard documents. Indeed, the words are by construction eligible to become vocabulary words thanks to the preprocessing carried out during the processing of documents 130.
- two counts will be carried out (1) the number of appearances of each word in the set of standard documents as well as (2) the number of standard documents in which this word appears.
- two parameters will be chosen arbitrarily when constructing the vocabulary:
- VOCABULARY_CHOICE_DOCCOUNT_MAX_DOCS the maximum number of documents in which the words appear.
- the "cabulary" database 160 will contain as many lines as there are words of the vocabulary, each of them being represented by 3 values:
- Table 3 extract of six lines from the "vocabulary" database 160
- the document descriptor vectors V are vectors whose components are digital data which are calculated from the textual data of documents stored in the “standard documents” database 140 as well as from the “vocabulary” database 160 which was built upon completion of systems 110, 130 and 150.
- TF-IDF frequency-inverse do cument frequency
- This statistical measure makes it possible to evaluate the importance of a term contained in a document, relative to a collection or a set of documents.
- the weight increases in proportion to the number of occurrences of the word in the document and also varies with the frequency of the word in the document set. It is this method that it is used in the preferred embodiment of the invention shown here. This method may however be replaced by any other standard or original method which could provide information more suited to the field of application relating to the documents processed without departing from the scope of the present invention.
- the “raw” TF frequency of a term simply corresponds to the number of occurrences of this term in the document considered.
- the term “frequency” is a misnomer.
- the term “frequency” will however be used here, because it is regularly used in the technical field of the present invention. It is possible to choose this raw frequency to express the frequency of a term. In this case, the calculation of the raw frequency is expressed by:
- T Fi f id where f represents the raw frequency, / is the word considered and d is the document considered.
- the reverse IDF document frequency is a measure of the importance of the term in the set of documents. In the TF-IDF scheme, it aims to give more weight to the less frequent terms, considered to be more discriminating. In general, determining the inverse IDF frequency involves calculating the inverse of the proportion of documents in the set that contain the term:
- each document will be coded by a descriptor vector V whose number of components corresponds to the number of words in the vocabulary.
- the components of the descriptor vector V of each document result from the calculation of TF-IDF described above.
- Each row of the “descriptor vectors” database 175 will contain the TF-IDF values associated with each document stored in the “raw documents” database 120 and according to the vocabulary stored in the “vocabulary” database 160.
- a last operation consists, advantageously, in a normalization of the matrix of descriptor vectors V contained in the “descriptor vectors” database 175 by applying a so-called “L2” normalization, also called a Euclidean norm.
- L2 normalization also called a Euclidean norm.
- the values are normalized so that if they were all squared and added, the total would be 1.
- each document in the “standardized documents” database 140 will be encoded in the “descriptor vectors” database. »175 as a descriptor vector V of real values of dimension 4 ⁇ 96, each value resulting from the computation of the TF-IDF of each word of the vocabulary for each document.
- Each row of the “descriptor vectors” database 175 thus represents a descriptor vector V of each document.
- the row in Table 2 for the document with ID 19 will be encoded by the descriptor vector V shown in the Table 4. Only the seven non-empty columns are represented for a vocabulary containing 4 ⁇ 96 words, obtained from the processing of the 44,512 summaries of films extracted from the “Internet Movie Database”.
- Table 4 values of the descriptor vector V for document 19 of Table 2
- the document card generation and processing system 180 is intended to produce a self-organized card C which groups together all the documents contained in the “standardized documents” database 140 in the form of a card which places the documents of which the content is similar to nearby locations on this map. To do this, the data stored in the “descriptor vectors” database 175 are used to feed an automatic classification system.
- the map generation system 180 advantageously uses the so-called “Self-Organizing Maps (SOM)” algorithm which produces a self-organized map C as illustrated in Figure 2.
- SOM Self-Organizing Maps
- the self-organized map C is composed of a grid of low dimensional p neurons.
- each neuron p has two neighbors.
- the arrangement of the p neurons is done in a rectangular way where each neuron has four neighbors (rectangular topology) or in a hexagonal way where each neuron has six neighbors (hexagonal topology).
- the p neurons are identified by their number and their location on the grid.
- the document descriptor vectors V v (1), v (2), ..., v (p) are projected from their initial space, or input space, to the self-organized map C or output space.
- a weight vector W also called the weight vector or prototype, belonging to the space entry.
- P the total number of neurons p of the map C
- the weight vector W of the neuron p of dimension N is denoted by:
- the objective of learning the map is to update the weight vectors l / l / so as to best approximate the distribution of the input vectors, that is to say the descriptor vectors V, while reproducing the self-organization of the p neurons of the card C.
- the training of the card can be done advantageously in sequential mode, also called incremental, or in deferred mode (batch).
- the general process of learning is depicted in Figure 3.
- All the weight vectors W are initialized to random values at step 810.
- Each iteration t of the sequential learning comprises two steps.
- the first step consists in choosing at random a descriptor vector V (t) from the set of descriptor vectors contained in the “descriptor vectors” database 175 (step 820), and in presenting it to the network of neu rons p in the aim of determining its winning neuron (step 830).
- the winning neu rone, called Best Matching Unit or BMU, of a descriptor vector V (t) is the neuron p whose weight vector W (t) is closest to it within the meaning of a given distance, for example the distance Euclidean. If c is the winning neuron, i.e. the BMU of the descriptor vector V (t), c is determined as follows:
- the winning neuron is activated. Its weight vector W (t) is updated to approximate the descriptor vector V (t) presented to the network. This update does not only concern the winning neuron BMU, as in the so-called “winner take ail” competitive learning methods, but also the neighboring neurons which then see their weight vectors W (t) also fit to the descriptor vector V (t).
- the amplitude of this adjustment 840 is determined by the value of a learning step a (t) and the value of a neighborhood function h (t).
- the parameter a (t) regulates the speed of the learning and is initialized with a large value at the beginning then decreases with the number of iterations in order to slow down as the learning process progresses.
- the parameter a (t) takes its values between 0 and 1.
- the function h (t) defines the neighborhood membership. It depends both on the location of the neurons on the map and on a certain neighborhood radius.
- the function h (t) takes its values between N / 2 and 0, where N represents the number of neurons on the largest side of the map.
- the neighborhood radius is advantageously large enough to update a large number of neurons neighboring the BMU neuron, but this radius gradually narrows to contain only the BMU neuron and its immediate neighbors, or even the BMU neuron only.
- the rule for updating the weight vectors VJ is as follows: where c is the BMU neuron of the input vector V (t) presented to the network at iteration t and h the neighborhood function which defines the proximity between the neurons c and p.
- Gaussian function defined below: where r c and r p are respectively the location of neuron c and neuron p on the map, and o (t) is the radius of the neighborhood at iteration t of the learning process.
- the amplitude of the adjustment is graduated according to the distance from the BMU neuron which reserves the maximum amplitude to itself.
- the unsupervised learning presented above results in a nonlinear projection of the set of V descriptor vectors on the C map.
- Each V descriptor vector is allocated to its winner neuron BMU.
- this projection preserves the topology of the data through the use of the neighborhood function. Two neighboring p neurons on the map will represent nearby V descriptor vectors in the data space.
- each weight vector W is a weighted average of the descriptor vectors ( ⁇ 4, ie ⁇ 1,..., N ⁇ ) when the square of the Euclidean distance is used for the computation of the winning neuron, the corresponding weights being the values of the neighborhood function h (t)
- the rule for updating the weight vectors W is given by: where h is the value of the neighborhood function between the winning neuron a of the vector V; and the neuron p.
- the update of the weight vectors W can be formulated otherwise by using the fact that the descriptor vectors V which have the same victor neuron have the same value for the neighborhood function and belong to the Voronoi region whose center is their winning neuron: where ni is the number of observations belonging to the Voronoi region represented by the neuron /. and Vi is the average of the observations from this same region.
- each weight vector W constitutes the center of gravity of the descriptor vectors V that it represents and we then fall back on the centers algorithm -mobiles, which guarantees a better approximation of the density function of the observations.
- this algorithm does not present any problems of convergence.
- the self-organized map C can be a two-dimensional or a three-dimensional map.
- the initialization W of the self-organized card C before the learning process as such can be carried out in several ways.
- a first initialization method consists in assigning an initial weight vector W to each node of the self-organized map C.
- This initial allocation of the weight vectors W can be for example a random allocation of a number at each component of the weight vectors, without stimulation.
- random refers to equal probability for any of a set of possible outcomes.
- the numerical value of these randomly assigned components can be approximately limited to the lower and upper bound by the corresponding extrema observed in the descriptor vectors, i.e. the V vectors.
- Another method of initializing the weight vectors W includes a systematic variation, for example a linear variation, in the range of each dimension of each weight vector W to approximate the corresponding range observed in the descriptor vectors V.
- the weight vectors W are initialized by the values of the vectors ordered along a two-dimensional subspace crossed by the two principal eigenvectors of the vectors descriptors V obtained by orthogonalization methods well known in the art, for example by the so-called “Gram-Schmidt” orthogonalization.
- the initial values of the components of the weight vectors W are fixed on samples chosen at random from the descriptor vectors V.
- the determination of the BMU neuron of the self-organized map C for each descriptor vector V can be done according to several criteria well known to those skilled in the art. This can, for example, be done on the basis of a distance, for example the minimum Euclidean distance between all the weight vectors W of the self-organized map C and the descriptor vector V.
- Other methods can be employed for determination of the BMU neuron such as those using correlation between vectors which has the advantage of offering more robustness to the offset between vectors, the angular difference between vectors which offers the advantage of emphasizing the mutual length of the vectors as long as the information is carried by these quantities, the Minkowsky distance measure which is a generalization of the Euclidean distance measure and which is advantageous when the vectors carry qualitative data can also be used. artwork.
- the descriptor vectors V are stored in the “descriptor vectors” database 140 and the number of neurons p varies according to the number of documents in the set of documents E in order to ensure a distribution as well. uniform as possible of documents on the self-organized map C.
- a weight matrix M of real numbers is delivered and stored in the “weight matrix” database 310 (see FIG. 4), the number of which of rows is equal to the number of components of the document descriptor vectors V of the “normalized documents” database 140 and the number of columns is equal to the number of neurons p of the self-organized map C.
- This matrix of weight M can be advantageously used by the document map generation and processing processor 350 to produce the content of three “heatmap” databases 320, “wordmap” 330 and “pointmap” 340 which can be used via the graphical interface 190 (see below).
- the purpose of the “pointmap” database 340 is to allow documents to be displayed on a graphical representation of the card C.
- the processing processor 350 will call on the “criptor vectors” database 175 containing the descriptor vectors V of all the documents of the set of documents E. For each row of the database 175, a distance calculation will be carried out by comparing the descriptor vector V with all the weight vectors l / l / of the neurons p from the map.
- the index, or the number, of the neuron c whose weight vector W has the smallest distance from the descriptor vector V of the document presented is associated with this descriptor vector V.
- the number of documents to be displayed on the graphic representation of card C can be very high and lead to an illegible display if they were all placed on the same plane.
- the display of documents on the map can be advantageously lightened by providing a zoom system comparable to that which is available for road maps.
- Each index (x, y) contained in the "pointmap" database is enriched by a z value corresponding to a display plane.
- the z value is calculated according to an evaluation function, the free choice of which is advantageously left to the user.
- This may be, for example and without limitation, a ma- nual or the result of a calculation.
- This function is determined as a parameter of the processor for generating and processing the document map 350.
- the “point-map” database 340 contains all the words of the document. vocabulary as well as the coordinates (x, y, z) of the card at which these words must be displayed.
- the document information of the concrete example used here is available in the "pointmap" database 340 and is shown in Table 5.
- An example of document display is shown in Figure 5.
- Table 5 extract from the "pointmap" database 340
- the aim of the present invention is to make it possible to identify, using an analog search, one or more documents whose content is closest to a search term R.
- a search term R which can advantageously be entered by a user via the graphical interface 190, is transformed into a search vector K which can then be compared to the columns of the weight matrix of the "weight matrix” database 310.
- the transformation process is illustrated in Error! Referral source not found. 6.
- Each word of the search term R is read sequentially at step 910 and then it is extracted at step 920 to be compared with the list of words which make up the vocabulary 160. If the word read is a word from the vocabulary, the The index to which this word is in the "vocabulary" database 160 is recorded in step 940 and the process of reading and comparing continues until all of the words in the search field have been read and that the reading of the words is terminated at step 950. At the end of this process, a list of indices 960 which correspond to the words of the search term R which have been found in the "vocabulary" database 160 is obtained.
- the value 1 is stored at the location of these indices to form a search vector K whose number of components is equal to the number of identified words.
- the vector K is finally normalized by advantageously using an “L2” type normalization.
- a distance is then calculated between the values which are recorded in these indices to the values recorded in the "weight matrix" database 310 which are found at the same indices 960.
- a distance is calculated between the search vector K and all the weight vectors W of the self-organized map C.
- the neuron (s) pbest of the self-organized map C which respond the most, i.e. those for which the calculated distance is the smaller, are identified on the map.
- the identifiers of the documents which are attached to these neurons are extracted using the "pointmap" database 340 and the bookmarks are advantageously displayed on a graphic representation of a CD document card which can be superimposed on the self-map. organized C, as shown on Error! Referral source not found.
- the list of documents identified on the basis of an analog search and the search term R be made available in a way other than on a CD document card.
- a simple list of IDs of identified documents could represent the result of the analog search.
- these documents are always identified on the basis of the distance between the search vector K and the vectors of weight l / l / de. the self-organized map.
- the weight vector W the distance D of which is minimal from the search vector K, is first identified. As explained above, each weight vector W corresponds to a neuron p of the self-organized map C.
- each document of the document set E has been allocated to a neuron on map C. This therefore makes it possible, by knowing which weight vector W is closest to the search term R, to determine which document is allocated to the corresponding neuron and therefore which document has the content closest to the search term R.
- the SOM algorithm has the effect of re-grouping close documents in the sense of a distance measurement in neurons. Neurons are encoded in a matrix as vectors of real data. These neurons are ordered by the algorithm such that close documents in data space are as close as possible on the C map. This is one of the most important properties of this algorithm.
- each point of the distance map CH also called a “heatmap” is associated with a value ddi j calculated as follows: with k, l G ⁇ —1.0, +1 ⁇ and d is the Euclidean distance measure
- the “heatmap” database 320 contains the coordinates of each point ij of the distance map CH as well as the value ddi j calculated for this point.
- the color scale for representing these values can extend from red to green where red represents the highest value and green the furthest distance.
- An example of a CH distance map with a grid of 10 ⁇ 00 neurons is shown on the Error! Reference source not found. In this figure, the color code is represented by gray levels.
- the representation of documents on a map can be contextualized by placing on a map.
- CW words the most significant words near the neurons on map C corresponding to the documents that contain these words.
- the word map CW can be on the graphical interface 190 advantageously superimposed on the self-organized map C in order to give additional information to the user.
- the positioning of the words of a document on the CW word map at the location of the corresponding C map neuron is possible but all the words are thus placed at the same time. same place, namely the position of the neuron on the map C.
- the graphic representation of the word map CW is then useless when the number of words becomes high because they are all superimposed at the same location.
- One of the original features provided by the present invention is to offer a new representation of all the documents processed in the form of a card. The most significant words will be placed continuously on the CW word map according to the method presented here. Let us recall first of all that the neurons are ordered according to a mono-, two- or three-dimensional relation according to the targeted application.
- the processing processor 350 will identify for each neuron the indices of the corresponding weight vector W whose component exceeds a predefined threshold. It will then extract from the “vocabulary” database 160 the words which are located at the same indices in order to relate them to the neuron under consideration.
- the principle is illustrated in Figure 8 where the point (2,3) of the CW word map will be attached to the words "James", “Spy” and "Bridge". Moreover, for each word the value of the component of the weight vector which corresponds to it pond is saved.
- This reattachment process is performed for all points of the CW word map as shown in Figure 9a.
- a list of the points of the word map C1 / 1 / to which it is attached is established.
- An example of such a list can be found in Fig 9b.
- the processing processor 350 will calculate a continuous index of the locations by multiplying the value of the coordinates of the neurons in the list by the value of the component of the weight vector W (third column of FIG. 9b)
- a list of real values which can be used to calculate the position of the word on the map is available. For this purpose, a barycenter calculation will be used.
- the sum of the values of the component of the weight vector W for each chosen word is calculated: with (k, l) designating the indices of the neurons for which w ⁇ ) > predetermined threshold and m designating the index of the weight vector for the chosen word.
- the value z is calculated according to an evaluation function, the free choice of which is left to the user. It can be, for example and without limitation, a manual evaluation or the result of a calculation.
- This function is determined as a parameter of the processor for generating and processing the document map 350.
- the “wordmap” database 330 contains all the words of the vocabulary as well. as the coordinates (x, y, z) of the CW word map at which these words should be displayed.
- An extract from the "wordmap” database 330 is shown in Table 6.
- Table 6 extract from the "wordmap” database 330 for the word "re venge”
- the word "revenge” is identified at 21 different locations on the CW word map which correspond to as many different contexts.
- this same word has a higher importance (z value) at position (62.75, 96.49). This will allow it to be placed on a higher GUI 190 display plane and will therefore be highlighted more prominently there.
- the present invention advantageously provides a graphical interface 190 to allow a graphical representation of the self-organized card C, the word card CW, the distance card CH as well as the control card. CD documents.
- An embodiment of this graphical interface 190 is shown in Error! Referral source not found.
- Display area 485 of GUI 190 is intended to provide a graphical representation of these cards. The data required for this display is stored in databases 310, 320, 330 and 340.
- Button 470 allows the user to zoom in with the + symbol or zoom out with the - symbol on the map. This has the effect of enlarging or reducing the area allocated to each neuron to allow a display of its more detailed content.
- the 480 button is used to display or hide the locations of the neurons in the form of a grid.
- Button 490 allows the user to return to the original map display.
- the button 460 allows the user to show or hide the CW word map.
- the button 450 allows the user to show or hide the CD document card in which the documents are represented by pins.
- the button 440 allows the user to show or hide the distance map CH.
- Input field 400 allows the user to enter an R search term.
- search will be carried out according to an “analog” or “text” type.
- “analog” or “text” type see below.
- Information area 430 gives the user information on the number of documents used to construct the map, the number of points displayed in the view chosen by the user and the size of the vocabulary.
- the 420 drop-down list allows the user to choose a document database from the set of available databases.
- Display area 495 allows the user to view an entire selected document on the map.
- a summary 475 "popup" window is displayed.
- the user left-click on the "Show garlic” link displayed in the summary popup window the entire document is displayed in display area 495.
- the graphical interface 190 is configured so that initially the self-organized map C is displayed and in superposition the distance map CH, the word map CW and the document map CD, as shown in Fig. 14.
- the user can use the various controls available to explore the content of the display, carry out searches and refine the results obtained.
- the user has for example two buttons 470 for zooming forward or backward on part of the maps. When he clicks on the zoom (in or out), the event is picked up by the interface and then transmitted to the generation and processing processor 350. It determines the zone currently being displayed, calls the databases of "wordmap" data 330 and "pointmap" 340 to search for words and documents that are in the identified area.
- the user selects the words and documents according to the z value associated with them and returns the list to the processor which is in charge of the display.
- the user also advantageously has the button 480 to display or hide the location of the neurons on the map. This display makes it possible to identify which documents are attached to each neu rone.
- the display of the grid size depends on the zoom level on the map.
- buttons - Show / hide words 460 which allows you to display or hide the words on the CW word map, regardless of the current zoom level;
- Each document pointer represented on the CD document card by a red bookmark is a clickable item with the right mouse button.
- a display window 475 is shown to the user as shown in Figure 15. It contains a summary of the selected document as well as a “Show garlic” link to the full content of the document. document.
- the user can then click on the closing cross to close each display window. He can click multiple bookmarks to display multiple summary windows, but only the last full document display window can be displayed.
- the analog search mode presented above and allows you to search for the search term R entered in the following input field 400 a method of calculation which uses the values recorded in the “weight matrix” database 310.
- the neu rones which respond the most that is to say those for which the distance calculated between a search vector K of the search term R and the weight vectors of the self-organizing map C is the smallest are identified on the map.
- the IDs of the documents that are attached to these neurons are retrieved using the "pointmap" database 340 and bookmarks are displayed on the CD document map, as shown on Error! Referral source not found.
- the user has access to several different data sources, he can choose the one he can explore thanks to the drop-down list 420 which displays all the available data sources. Selecting a data source resets the display of the map and takes into account the parameters specific to this data source and in particular the number of neurons, the volume and the total number of documents.
- the preferred embodiment of the present invention provides in addition to the analog search mode a text search mode.
- the latter employs a system of indexing and textual search of documents which aims to create an index database of the contents of documents in order to use it to allow a textual search in the contents of the documents. , advantageously through the graphical interface 190.
- the indexing and text search system is based on any solution for indexing documents.
- the 125 document indexing and text search system shown in Error! Referral source not found. Built after the indexing phase an index database 220 which can be used by the user interface of the document card 190 shown in Figure 13.
- the search field 400 is used to enter a search term to search for in all the indexed documents, provided that the user has chosen the textual search using the button 410 of the interface 190, positioned in "search type: text" mode.
- the text entered in this field is transmitted to the search system via the programming interface.
- API 230 which transmits to the index and search processor.
- the indexing and search processor 230 outputs via the API interface the list 240 of the identifiers of the documents in which all the words are present.
- This list of identifiers is transmitted to the document map generation and processing processor 180, which crosses the list of identifiers with the data from the “pointmap” database 340 in order to determine the positions (x, y , z) of each document to place them on the CD document card.
- the search engine 125 delivers an exact result for the searched words, the latter are advantageously highlighted in green. If the result is close, the words are highlighted in blue.
- the weight matrix M is also advantageously used to define and display the dominant themes of all the documents E.
- each row of the matrix M is interpreted as a point in a space of dimension r and it is assumed that these dimensions are themselves generated by a subspace of lower dimension.
- the idea is to consider the documents as random mixtures on underlying themes ("latent" themes) where each theme is characterized by a distribution on the words. This amounts to saying that each word is contained in a context which we will call “dominant theme”.
- the calculation of the dominant themes is based, according to the present invention, on an arbitrary choice of the number of themes q which will always be less than the number of neurons p of the map.
- the strong assumption on which the method is based is that the themes are independent in the statistical sense. Posed thus, the The problem amounts to carrying out an analysis in independent components of the weight matrix M of the card.
- ICA Independent Component Analysis
- the matrix M is broken down into 2 matrices B and S where S is a matrix such that each row is a group of words representing the themes to be identified.
- the matrices B and S are the result of the ACI as shown in Figure 19.
- each line of S is a dominant theme and each theme is composed of a list of words. assigned a weighting one row of which is shown in Table 7.
- Table 7 a row of the matrix S of the values generated by the ACI on the IMDB base.
- the row of this table highlights a theme whose most important word is “music” and associated words are “band, hop, hip, song, musician, musical, concert, dancer, dance, rock”.
- the same method of interpretation is to be used for the other rows of the table. Thanks to an adaptation of the graphical interface 190, it is possible to give the user an immediate visualization of the dominant themes which emerge from the analysis of his documents.
- the generation of the dominant themes supposes that the generation of the matrix M has been completed.
- the dominant theme matrix computation process performed by the dominant theme generation processor 700 is shown in Figure 20.
- the dominant themes processor 700 takes as input the weight matrix M.
- the first step 710 consists in reading the data from Matrix 310.
- Step 720 consists in reading the value entered in the “number of dominant topics” field of the. graphic interface of card 190 (see figure 21). These sets of values are supplied at the input of the calculation phase 730 of the “themes” matrix carried out by an independent component analysis method.
- the matrix is then stored in the database 360.
- the independent component analysis can be calculated by different algorithms and in particular the Hérault-Jutten algorithm (Herault, Jutten, & Ans, Proceedings of the Xe colloque GRETSI, 2, pp. 1017-1022, 1985), JADE (Cardoso & Souloumiac, IEE proceedings-F, 140 (6), 362-370, 1993), Fast-ICA (Hyvàrinen, J. Karhunen, & Oja, Independent Component Analysis. John Wiley and Son, 2001) or Infomax (Linsker , IEEE Computer, 21, 105-117, 1988).
- Hérault-Jutten algorithm Hérault, Jutten, & Ans, Proceedings of the Xe colloque GRETSI, 2, pp. 1017-1022, 1985
- JADE Cardoso & Souloumiac, IEE proceedings-F, 140 (6), 362-370, 1993
- Fast-ICA Hyvàrinen, J. Karhunen
- the Hérault-Jutten algorithm strongly inspired by a neuromimetic approach reproduces the separation of sources observed on the nerve fibers which convey speed and position.
- This Robins-Monro-type algorithm iteratively searches for common zeros of two nonlinear functions. Its main advantage is the simplicity of the iterative processing, but the implementation is not suited to large problems like the one we are addressing in this invention.
- the JADE algorithm Joint Approximate Diagonalization of Eigenmats
- the JADE algorithm is based on a definition of independence seen as the cancellation of all moments and cumulants at all orders. This amounts to canceling all the non-diagonal elements of a tensor of cumulants of order N which is an N-dimensional matrix containing all the crossed cumulants of order N.
- Infomax is based on the principle which stipulates that the implementation of a model of the cognitive capacities of mammals by means of a network of artificial neurons must be such that the rate of information transmitted from one layer of neurons to the next is maximum. Nadal and Parga showed that under certain conditions, this principle was equivalent to the principle of redundancy reduction (Nadal & Parga, Network: computation in neural Sys tems. 5, 565-581, 1994) which states that the goal of systems sensory effects of mammals is to efficiently encode stimuli (visual, sound, etc.). The main drawback of infomax is that the runtime is difficult to predict.
- Fast-ICA is based on the estimation of independent components by means of a “non-gausianity” measure. Its main drawback is its sensitivity to initial conditions, offset by its high speed of execution. In the context of the present invention, the Fast-ICA method is preferred without this option being limiting. It is important to note that it is possible to use any ad-hoc algorithm which implements the principle of independent component analysis efficiently.
- the detailed description of the Fast-ICA algorithm is given by (Hyvàrinen, J. Karhunen, & Oja, 2001).
- Table 7 A row taken from this matrix is shown in Table 7. This row highlights the 20 words that have the highest values generated by MCA. We will therefore call the dominant theme the list of 20 words. The word with the greatest value will be used as the title of the dominant theme. Table 7 will therefore give the dominant theme title: music.
- the 20 words of the theme are: "music, band, hop, hip, song, musician, musical, concert, dancer, dance, rock, scene, teenage, actor, video, record, night, culture, industry, mi- chael, pop ”.
- Table 8 shows an extract of the matrix containing only the words sorted in descending order of values calculated by MCA. This matrix is finally stored in the database 360 shown in Figure 20. Table 8: list of dominant themes
- the present invention describes a method for displaying the dominant themes as regions of the map such as shown in Figure 21.
- each dominant theme covers a region of the map and the title of the associated theme is displayed in the center of the region concerned.
- Field 412 groups together data relating to the use of dominant themes.
- Field 411 allows the user to choose the number of dominant themes to display.
- the present invention also relates to the method used for the construction of the regions on the map which correspond to the dominant themes. This method is presented in Figure 22.
- a vector containing all the descriptor words of this dominant theme is constructed in step 510 with values 0 and 1 which depend on the presence or absence of the word at the corresponding index.
- the vector corresponding to the dominant theme "war” will be coded as shown in Table 9. Only 7 non-empty columns are represented for a vocabulary containing 4 ⁇ 96 words, obtained from the processing of the 44,512 abstracts of extracted films. from the Internet Movie Database.
- Table 9 values of the descriptor calculated for the example chosen and corresponding words in the vocabulary A calculation is launched at step 520 to evaluate the Euclidean distance between this vector and all the vectors of the neurons of the map contained in the base of data 310. This produces a table containing these distances with the corresponding neural index, as shown in Table 10.
- the distance vector is then normalized in step 530 with the L2 norm to produce values that will be comparable from one dominant theme to another.
- Each distance is then compared to the threshold set in step 540 to identify the neurons that respond the most to the words of the dominant theme chosen.
- the list of indices of these neurons is retrieved in step 550 and these neurons are surrounded on the map with a closed polygon in step 560.
- a dominant theme can be visualized on the map as a region bounded by a polygon.
- the title of the dominant theme being advantageously placed in the center of this polygon.
- Each button 413 of interface 190 corresponding to each theme can be clicked to show or hide the dominant theme as a polygon on map 485 and the title 414 of the corresponding dominant theme.
- field 415 containing the words associated with the dominant theme is displayed. Construction of regions associated with a dominant theme. The number of words displayed is chosen by parameter. It is at most equal to the number of descriptor words. By default, the value is set to 20.
- Each theme title 414 displayed on the map can be changed by the user to correspond to a more appropriate denomination.
- the user must double click on the title of the dominant theme or long click for a tablet version in order to activate the editing function of the title of the dominant theme.
- the user can then edit the title of the theme.
- the change has no effect on other words related to the dominant theme.
- Changing the theme title also changes the title of button 413 of the corresponding theme.
- the theme title change is saved in the user's context. It is not available to all users of the solution.
- the present method is implemented by using a computer program to perform operations on aspects.
- a computer program to perform operations on aspects.
- of the present invention which can be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Python, C ++, or the like and conventional procedural programming languages , such as the "C" programming language or similar programming languages.
- Program code can be run entirely on the user's computer, partially on the user's computer, as a stand-alone software package, partially on the user's computer, and partially on a computer. remote computer or entirely on the remote computer or server.
- the remote computer can be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection can be established to an external computer (for example, through the Internet function using an Internet service provider).
- LAN local area network
- WAN wide area network
- the computer running the program will consist of at least a standard processor (CPU) with its RAM memory of at least 30Giga bytes, a hard disk with a minimum capacity of 1Tera Byte. It could also be composed of a processor to execute several threads simultaneously (multi-thread). Finally, it can be added hardware acceleration cards such as GPUs (graphie processor Units), TPUs (Tensor Processing Units) and in general any hardware acceleration device available on the market such as RTX2060, RTX 2070, GTX 1070.
- GPUs graphie processor Units
- TPUs Torsor Processing Units
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CH00364/20A CH717260A2 (fr) | 2020-03-26 | 2020-03-26 | Méthode mise en oeuvre par ordinateur pour la recherche analogique de documents. |
PCT/EP2021/057839 WO2021191392A1 (fr) | 2020-03-26 | 2021-03-25 | Méthode mise en oeuvre par ordinateur pour la recherche analogique de documents |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4127965A1 true EP4127965A1 (fr) | 2023-02-08 |
Family
ID=75362586
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP21716109.0A Withdrawn EP4127965A1 (fr) | 2020-03-26 | 2021-03-25 | Méthode mise en oeuvre par ordinateur pour la recherche analogique de documents |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4127965A1 (fr) |
CH (1) | CH717260A2 (fr) |
WO (1) | WO2021191392A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115048432B (zh) * | 2022-08-02 | 2024-04-26 | 西南石油大学 | 基于布隆过滤器的模糊关键词公共审计方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9183288B2 (en) | 2010-01-27 | 2015-11-10 | Kinetx, Inc. | System and method of structuring data for search using latent semantic analysis techniques |
DK2639749T3 (en) | 2012-03-15 | 2017-02-27 | Cortical Io Gmbh | Methods, apparatus and products for semantic processing of text |
-
2020
- 2020-03-26 CH CH00364/20A patent/CH717260A2/fr unknown
-
2021
- 2021-03-25 EP EP21716109.0A patent/EP4127965A1/fr not_active Withdrawn
- 2021-03-25 WO PCT/EP2021/057839 patent/WO2021191392A1/fr unknown
Also Published As
Publication number | Publication date |
---|---|
CH717260A2 (fr) | 2021-09-30 |
WO2021191392A1 (fr) | 2021-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Sato: Contextual semantic type detection in tables | |
US12079269B2 (en) | Visually guided machine-learning language model | |
US11182433B1 (en) | Neural network-based semantic information retrieval | |
US10909459B2 (en) | Content embedding using deep metric learning algorithms | |
US20230162481A1 (en) | Pre-training of computer vision foundational models | |
CN113661487A (zh) | 使用机器训练词条频率加权因子的产生密集嵌入向量的编码器 | |
US12032915B2 (en) | Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model | |
US20220138402A1 (en) | Text style and emphasis suggestions | |
JP2024503036A (ja) | 改善された深層学習モデルのための方法およびシステム | |
CN114003758A (zh) | 图像检索模型的训练方法和装置以及检索方法和装置 | |
JP5596648B2 (ja) | ハッシュ関数生成方法、ハッシュ関数生成装置、ハッシュ関数生成プログラム | |
CN111368125B (zh) | 一种面向图像检索的距离度量方法 | |
Giussani | Applied machine learning with Python | |
US11763094B2 (en) | Cascade pooling for natural language processing | |
EP4127965A1 (fr) | Méthode mise en oeuvre par ordinateur pour la recherche analogique de documents | |
Fonseca et al. | Research trends and applications of data augmentation algorithms | |
EP2374073A1 (fr) | Systeme de recherche d'information visuelle | |
Pereira-Ferrero et al. | Unsupervised affinity learning based on manifold analysis for image retrieval: A survey | |
Sobrecueva | Automated Machine Learning with AutoKeras: Deep learning made accessible for everyone with just few lines of coding | |
Ye et al. | Data Preparation and Engineering | |
WO2021237082A1 (fr) | Récupération d'informations sémantiques fondée sur un réseau neuronal | |
Mengle et al. | Mastering machine learning on Aws: advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow | |
CN111708745A (zh) | 一种跨媒体数据共享表示方法及用户行为分析方法、系统 | |
Tüselmann et al. | A weighted combination of semantic and syntactic word image representations | |
US20240289361A1 (en) | User interface for chat-guided searches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20221021 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20230516 |