WO1999005614A1 - Outil d'extraction d'informations - Google Patents

Outil d'extraction d'informations Download PDF

Info

Publication number
WO1999005614A1
WO1999005614A1 PCT/IB1998/001123 IB9801123W WO9905614A1 WO 1999005614 A1 WO1999005614 A1 WO 1999005614A1 IB 9801123 W IB9801123 W IB 9801123W WO 9905614 A1 WO9905614 A1 WO 9905614A1
Authority
WO
WIPO (PCT)
Prior art keywords
topics
topic
information
documents
mining
Prior art date
Application number
PCT/IB1998/001123
Other languages
English (en)
Inventor
Louis Gay
Olivier Massiot
Original Assignee
Datops S.A.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datops S.A. filed Critical Datops S.A.
Publication of WO1999005614A1 publication Critical patent/WO1999005614A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing

Definitions

  • the present invention relates to an information mining technology which enhances the intelligence with which information can be analysed and in order to be better delivered to users.
  • the evaluation content of a set of collected documents is usually presented, in particular on the web, through the listing of the titles and possibly of summaries or of beginning sentences of these documents.
  • Such a listing does not permit to a user to clearly apprehend the informational content of the set of collected documents.
  • the invention proposes a global system for:
  • the Pull of Information Using user programmable agents, which search through public and private information sources and retrieve relevant documents on a user-determined interval.
  • he Mining of Information Using complete Technologies for the Processing of Language as Text, as well as sophisticated signal and trend analysis, the invention analyses the retrieved documents, clusters them based on content, and matches them to each users unique information profiles. The information is prioritised for users based on relevancy of content, association with topics of interest, urgency and changeability.
  • he Push of Information Once information is analysed and processed to match unique user needs, it is delivered to the user in a variety of ways, including HTML page, mail pager and individual user reports.
  • the invention collects and analyses unstructured, qualitative information versus structured data. As a result, it can be used to analyse and profile information ranging from
  • the invention proposed can effectively "mine” information in its most natural form... a document.
  • the analysis is not only qualitative, based on processing of language, but also quantitative, particularly for determination of trends in information evolution
  • the invention enables users to uniquely customise their information topics, based on their true area of interest
  • the system actively monitors the users' work with the information, noting which information is "consumed” or not Using this information, the system constantly tunes and updates the user's profile to provide more and more relevant information
  • the users' profiles are permanently accurate, they can enter as parameters of each phase of information processing to filter information (in the Pull), to process information and determine indicators according to users' needs (in Mining), to deliver information according to relevancy to user topics of interest (in the Push)
  • the analysis mechanisms used by the invention enable to provide information displays which graphically depict to a user the information's relevancy, proximity to other relevant topics, the intensity of the information the trends surrounding the information and enable the user to dynamically change their views of the information "Summaries" of the information content and direct access to original documents are available for each topic graphically displayed
  • the mining tool proposed by the invention realises a Corporate Intelligent Channel.
  • the invention proposes an information mining tool comprising mining means for processing documents stored in a data base in order to extract the topics to which these documents relate and means for determining parameters which relate to the evolution with time of said topics.
  • the mining tool comprise means to survey the time related evolution of a topic.
  • this permits to detect discontinuities of evolution, even in case of topics corresponding to small signals.
  • the invention also proposes an information mining tool comprising mining means for processing documents stored in a data base in order to extract the topics to which these documents relate and means to determine parameters characterising the relationship between topics, such as the average topological distance between the words corresponding to two topics or to the time related cross-similarity of two topics.
  • the mining comprise means to detect correlation between topics according to their time related evolution.
  • the invention proposes an information mining tool which comprises push means which deliver to the user an information which relates to the topics, said push means comprising means to display on the screen of the user a map of the topics, said topics being presented in said map in form of nodes presented with links, the length of such a link between two topics corresponding to the value of a parameter characterising the relationship between said topics.
  • the push means comprise means to colour said nodes and links by using a colour code characterising the evolution with time of the topics and of their relationship parameters.
  • Such a presentation offers fast reading capabilities to the user.
  • an information mining tool comprising push means which deliver a set of topics and of corresponding documents in view of a particular query of the user and/or in view of a profiling file in which is stored a list of topics of interest for the user.
  • the analysis of the documents base is recursive and takes into account the queries and fields of interest of the user, even though this evolution is not specifically formulated.
  • figure 1 and figure 2 are schematic drawings illustrating the architecture of the system ;
  • figure 3 illustrates an example of topics map to be displayed on the screen of the user.
  • a system according to the invention is to process documents which might be collected from various sources. These documents might be picked up in specific media such specific data bank servers, specific files or can be paper written documents electronically converted.
  • the documents are stored with corresponding metadata to constitute a text data base named «textual corpus», which can be processed by the information mining server which is illustrated on figure 2.
  • Metadata is used to characterise the data for several purposes, including query processing, browsing and retrieval. Metadata may take different forms. It is required that metadata have the following properties : Effective (if the metadata says one information relates to other one, there is a great « probability » that is relevant), Concise (much smaller than the text it describes), Generated automatically (no human intervention required).
  • the information mining processing comprises two main tasks respectively hereinafter named acquisition and restitution.
  • the textual corpus is processed to determine an index base which comprises a file of the topics representative of the informational content of the stored documents, as well as characteristics of these topics (hereinafter referred to as tags) and relationships which may exist between the topics.
  • the index data base also comprises characteristics of documents (also called tags) and a file of indexation corresponding to a full text indexation of the documents.
  • the acquisition processing also uses a profiling base in which all the information relating to the profiles of the users are stored.
  • the index base and the textual corpus are processed through an information retrieval processing and the informational content of the documents can be displayed on the screen of the users in form of a schematic mapping of the topics.
  • One aim of the acquisition processing is to add new tag values to documents, to extract related topics and relationships and also add tags to those topics and relationships.
  • This processing occurs on a working image in memory of the selected documents, these documents being kept stored in textual corpus in which they remain, without any change, during the whole processing.
  • Such a conversion processing is for example of the type presented in :
  • a lemma processing is performed.
  • the texts can be processed to transform the verbs into their infinitive form, suppress detectable orthographic errors, detect the ambiguous words as well as the polysemic and homonymic and in such a case modify these words to suppress any ambiguity.
  • the textual documents are then structurally analysed.
  • 3-1 In a first step of this structural analysis, a determination of the main language of each document is performed.
  • a dictionary base is used which lists words which are the more representative of some languages. For example, a text incorporating a huge number of words such as «le», «la», «les» will be labelled as being mainly a French text.
  • the language which is determined is the language corresponding to the higher number of words of the dictionary base which appear in the processed text.
  • a parameter corresponding to an estimation of the structural complexity of the text is calculated.
  • the .parameter which is then provided permits to infer the kind of text to which the text processed belongs.
  • Such determination is for example based on neural network and uses a neural processing of a multilayer perceptron with a learning through a gradient retropagation algorithm.
  • the topology of the network is of 41 input neurons, 10 output neurons and 15 to 20 neurons of an intermediate layer.
  • the inputs of the input neurons are numerical characteristics which are calculated on the text through for example, for an HTML text its number of images and their average length, the number and percentage of external links, the density of the text, etc..
  • the outputs neurons correspond to evaluations of the text complexity.
  • the rate of success is around 95 % and varies with the nature of the corpus.
  • a detection of the domain of the text is performed. For example, it is detected whether the text analysed is a scientific text, a technical text or a business text.
  • the processing is performed by artificial neural networks, for example with a multilayer perceptron neural network using a retropagation learning algorithm and a topology with the same number of input neurons, output neurons or intermediate neurons as for the determination of the value of the structural complexity estimation parameter, the inputs of the input neurons being the same.
  • the output neurons correspond to the various types of documents expected.
  • a fourth step the text structures are detected.
  • this analysis it is performed a structural syntaxic surface analysis which permits to detect in the text the series of alphanumeric characters which correspond to titles, sentences or paragraphs (patermatchi ⁇ g processing).
  • the texts are processed to perform a segmenting of their content into sentences, in the case where the punctuation is ambiguous. This segmenting is a non-trivial task, due to the ambiguity of many punctuation marks.
  • the algorithm used is for example of the type described in : PALMER D., «Tokenisation and Sentence Segmentation)), The
  • This tokenising processing consists first in a statistical indexation of the words and in a particular in a calculation of the apparition frequency of words in the text.
  • the words are classified into hollow words, which are randomly distributed in the whole content of the file, (common language words with no correlation with the topics of the texts) and sensible words which are not uniformly distributed in the file and mainly appear in some texts of the file.
  • the method uses the fact that a term "is diluted” on several domains or "is concentrated” on only one while bringing back the number of occurrences of a term to that of the domain where it appears more, when one wants to decrease the weight of the empty words, or with the sum of those where it appears less, when one wants to decrease the weight of the concepts.
  • the hollow words are determined by selecting for each document the words which rate of occurrence is superior to a given threshold. A count corresponding to such a word is incremented each time this rate is superior to said threshold.
  • the words selected as hollow words are those superior to a given selection threshold.
  • the file is processed to determine the topics.
  • the determined topics are stored in the index base (index data base 14 of Fig.2)
  • some of the stored tags 12 can correspond to all statistics information calculated in the previous steps of the processing, but now processed on each topic (all documents related to a topic)
  • tags can be classification tags describing the average type of activity (business, scientific, etc ) or the type of language
  • these classification tags being determined through a neural network of the same type of the one used for the determination of the domain
  • a first trend parameter corresponds the number of documents in which the topic appears. It is hereinafter referred to as volumetric trend, which corresponds to the rough volume of published documents. It does not reflect really the intensity of the expressed opinion since it does not take account of the relevancy of documents toward the theme and of the number of sources or authors that expressed or retransmitted information. Volumetric trends can for example be compared for different periods of time in case of sets of documents collected from the same source with the same query agent.
  • a second trend parameter is the information intensity which corresponds to the ratio of another parameter called the global pertinency to the number of documents in the text.
  • the pertinency is a parameter determined for a given topic and a given text and corresponds to the number of apparition in said text of the words corresponding to said topic, with a ponderation attached to each of said words corresponding to the relevantness of said word relatively to said topic.
  • the global pertinency of a topic corresponds to the sum of the pertinences for this topic of all the documents of the file.
  • the Pertinency can be : - Calculated from the number of the occurring sought terms, as seen
  • the Surface of publication can then be defined as the number of authors divided by information volume.
  • the Information Intensity can be corrected by a multiplication with the following parameter : Information Volume * (1 / log (Number of Volume) )
  • a third trend parameter which is advantageously used is a signal value.
  • a reference query can be «cows» whereas the specific query can be «mad cows».
  • the signal value - may be determined as corresponding to the difference between the value of the derivative with time of the volumic trend for the given query and the average value in time of said derivative.
  • a fourth parameter which can then be used is the ratio between said signal and the reference or average volumetric derivative parameter. By regularly calculating the value of this ratioparameter, one can detect the time at which the evolution of the information propagation breaks and therefore the time at which a topic becomes unexpectedly important. 5-4 Having determined these tags parameters, the texts are then processed to determine parameters characterizing the relationship between topics
  • topological distance is the likelihood of two words appearing in the same window of discourse - a phrase, a sentence, a paragraph This distance is inversely related to their semantic distance, that is directly related to their semantic similarity Observing the relative frequency of their joint occurrence in such windows is a part of the estimation of the relative similarity of any pair of words
  • x and y being information intensity or volumetric parameter of the two topics.
  • a highly positive auto-correlation value means that correlation is established between the elements of X and those of Y temporally shifted.
  • a minus coefficient implies an anti-correlation. It is relatively difficult to interpret such a test.
  • a signal does not consist of only one and single periodicity and in general this periodicity is not even constant in amplitude and/or time.
  • Mutual Information makes no assumption about the distribution of the measured series, and is therefore the most attractive measure to hand.
  • Mutual Information is a concept conceived by Claude Shannon (1949).
  • Mutual Information attempts to measure in bits the amount of information that can be inferred about one series of symbols by another. A derivation of this concept is used. In general given two series x and y with indexes i and j respectively, the average mutual information l(x,y) can be calculated as:
  • the determination of the clusters - sets of topics which share neighbour relations - can be based on the informational distance calculated like above.
  • the construction of the map exploits the informational distance to represent the neighbour nodes related to the central node of a map.
  • the queries can be factual or/and boolean, the information retrieval being then performed on the topics file and on the indexation file of the index base.
  • the selected documents can be classified by order of pertinence regarding the formulated query. They also can be highlighted by the values of trend parameters which denote an abnormal evolution of the topic.
  • the system displays on the screen of the user the topics which appear in the selected documents having pertinence superior to said pertinence level.
  • the pertinence level of a topic is determined by using tags parameters.
  • the system can display a map in which the topics appear in form of nodes distributed on the screen (see figure 3).
  • nodes are represented with links in the case of topics having a high cross-correlating parameter or a topological or informational distance superior to a given threshold.
  • the lengths of the links correspond or tend to correspond to the topological or informational distances between said topics.
  • the system will optimise the repartition of the nodes in order to minimise the difference between the distance from a node to another.
  • the nodes of these two topics can be merged in a single node. Possibly, these links and nodes are colorated for taking into account their evolution in the last period of time.
  • a node will be colorated in red in case one of its trend parameter is highly increasing (for example if a break of propagation is detected). It is colourated in blue in case its trend parameter is decreasing in the last period.
  • the links will be colourated to take into account the evolution of the informational distance or more specifically the cross-similarity parameter to which they correspond. Further, when the user clicks on the node of a topic to select it, the process gives him a list of the documents concerned by third topic with a pertinency hierarchy.
  • the system can display tags of the topics corresponding to this query and can also determine new tags which are specific to the query. For example, the determined trends can be displayed to the user in form of graphs giving their value with time.
  • the system presents a processing by which it takes into account the behaviour of the user.
  • the system memorises a profile of the user in which are stored topics of interest for him.
  • a profiling can comprise a structural part and a personal part, as well as an implicit part.
  • the structural part comprises topics which are of interest for the environment of the user (for example, topics concerning his company). It is divided into inalienable topics - which in any case appear in the mapping display of the user - and dynamic topics - which are selected by the user himself.
  • the personal part comprises topics which relate to the user and not to his environment (for example his own fields of interest in the company). It is also divided into inalienable topics and dynamic topics.
  • the implicit part of the profiling comprises topics which in time appear to be cross-correlated with topics of the structural or personal part of the profiling.
  • New topics created by the user himself are defined through the formulation of a new query. This new topic will be characterised in the profiling file by the words and expressions corresponding to the query.
  • These new topics can also be selected by the user in the file of topics of the index base.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention porte sur un outil d'extraction d'informations comportant des moyens d'extraction pour traiter des documents stockés dans une base de données et permettre de sortir les sujets relatifs à ces documents et de déterminer les paramètres relatifs à l'évolution desdits sujets dans le temps.
PCT/IB1998/001123 1997-07-23 1998-07-23 Outil d'extraction d'informations WO1999005614A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US5354697P 1997-07-23 1997-07-23
US60/053,546 1997-07-23

Publications (1)

Publication Number Publication Date
WO1999005614A1 true WO1999005614A1 (fr) 1999-02-04

Family

ID=21985021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB1998/001123 WO1999005614A1 (fr) 1997-07-23 1998-07-23 Outil d'extraction d'informations

Country Status (1)

Country Link
WO (1) WO1999005614A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001006389A2 (fr) * 1999-07-17 2001-01-25 Incedo Ag Serveur de reseau pour la mise a disposition d'une page d'informations et procede pour la mise a disposition d'une page web
WO2001022280A2 (fr) * 1999-09-20 2001-03-29 Clearforest Ltd. Determination de tendances par exploration de textes
FR2813413A1 (fr) * 2000-08-30 2002-03-01 Datops Sa Procede de traitement d'un ensemble de documents textuels stockes dans une base de donnees evoluant dans le temps
EP1233349A2 (fr) * 2001-02-20 2002-08-21 Hitachi, Ltd. Méthode d'affichage de données et appareil à utiliser pour l'analyse de textes
GB2368432B (en) * 1999-08-06 2004-05-19 Univ Columbia System and method for language extraction and encoding
WO2005041058A1 (fr) * 2003-10-22 2005-05-06 Qsr International Limited Système et procédé d'analyse de données qualitative
WO2007143899A1 (fr) * 2006-05-22 2007-12-21 Kaihao Zhao Système et procédé pour l'extraction intelligente et le traitement d'informations

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHEN H ET AL: "INTERNET CATEGORIZATION AND SEARCH: A SELF-ORGANIZING APPROACH", JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, vol. 7, no. 1, March 1996 (1996-03-01), pages 88 - 102, XP000619822 *
GINSBERG A: "A UNIFIED APPROACH TO AUTOMATIC INDEXING AND INFORMATION RETRIEVAL", IEEE EXPERT, vol. 8, no. 5, 1 October 1993 (1993-10-01), pages 46 - 56, XP000413472 *
LE MONDE INFORMATIQUE, no. 705, 17 January 1997 (1997-01-17), http://www.lmi.fr/705/705p22.html, pages 1 - 11, XP002082836 *
WONG J W T ET AL: "ACTION: AUTOMATIC CLASSIFICATION FOR FULL-TEXT DOCUMENTS", SIGIR FORUM, vol. 30, no. 1, 21 March 1996 (1996-03-21), pages 26 - 41, XP000699962 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001006389A2 (fr) * 1999-07-17 2001-01-25 Incedo Ag Serveur de reseau pour la mise a disposition d'une page d'informations et procede pour la mise a disposition d'une page web
WO2001006389A3 (fr) * 1999-07-17 2003-12-31 Incedo Ag Serveur de reseau pour la mise a disposition d'une page d'informations et procede pour la mise a disposition d'une page web
GB2368432B (en) * 1999-08-06 2004-05-19 Univ Columbia System and method for language extraction and encoding
WO2001022280A2 (fr) * 1999-09-20 2001-03-29 Clearforest Ltd. Determination de tendances par exploration de textes
WO2001022280A3 (fr) * 1999-09-20 2002-12-05 Clearforest Ltd Determination de tendances par exploration de textes
FR2813413A1 (fr) * 2000-08-30 2002-03-01 Datops Sa Procede de traitement d'un ensemble de documents textuels stockes dans une base de donnees evoluant dans le temps
EP1233349A2 (fr) * 2001-02-20 2002-08-21 Hitachi, Ltd. Méthode d'affichage de données et appareil à utiliser pour l'analyse de textes
EP1233349A3 (fr) * 2001-02-20 2004-10-13 Hitachi, Ltd. Méthode d'affichage de données et appareil à utiliser pour l'analyse de textes
WO2005041058A1 (fr) * 2003-10-22 2005-05-06 Qsr International Limited Système et procédé d'analyse de données qualitative
WO2007143899A1 (fr) * 2006-05-22 2007-12-21 Kaihao Zhao Système et procédé pour l'extraction intelligente et le traitement d'informations

Similar Documents

Publication Publication Date Title
US9542393B2 (en) Method and system for indexing and searching timed media information based upon relevance intervals
US7295967B2 (en) System and method of analyzing text using dynamic centering resonance analysis
Lim et al. Multiple sets of features for automatic genre classification of web documents
US7085771B2 (en) System and method for automatically discovering a hierarchy of concepts from a corpus of documents
US7509313B2 (en) System and method for processing a query
US20090300046A1 (en) Method and system for document classification based on document structure and written style
Kallipolitis et al. Semantic search in the World News domain using automatically extracted metadata files
EP1843257A1 (fr) Procédés et systèmes d'indexation et de recherche de documents
Li et al. Visual segmentation-based data record extraction from web documents
Lahtinen Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods
WO1999005614A1 (fr) Outil d'extraction d'informations
US8682913B1 (en) Corroborating facts extracted from multiple sources
Zanasi Web mining through the online analyst
Peganova et al. Labelling hierarchical clusters of scientific articles
Ozmutlu et al. Using conditional probabilities for automatic new topic identification
Chi et al. The designing of a web page recommendation system for ESL
Luštrek Overview of automatic genre identification
Song et al. Semantic-based similarity computation for xml document
Shi Social network analysis of web search engine query logs
Zeng Construction of an Emotional Dictionary for Online Comment Text in Knowledge Services
Liu et al. Keyphrase extraction for labeling a website topic hierarchy
Ojo et al. Knowledge discovery in academic electronic resources using text mining
Pande Table understanding for information retrieval
Ohshima et al. Visualizing changes in coordinate terms over time: an example of mining repositories of temporal data through their search interfaces
Abuzir et al. ThesWB: A Tool for Thesaurus Construction from HTML Documents

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): IL JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase