EP1910918A2 - Methode et systeme pour extraire automatiquement des donnees a partir de sites web - Google Patents

Methode et systeme pour extraire automatiquement des donnees a partir de sites web

Info

Publication number
EP1910918A2
EP1910918A2 EP06787271A EP06787271A EP1910918A2 EP 1910918 A2 EP1910918 A2 EP 1910918A2 EP 06787271 A EP06787271 A EP 06787271A EP 06787271 A EP06787271 A EP 06787271A EP 1910918 A2 EP1910918 A2 EP 1910918A2
Authority
EP
European Patent Office
Prior art keywords
clustering
pages
experts
cluster
hints
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP06787271A
Other languages
German (de)
English (en)
Inventor
Bora C. Gazen
Steven N. Minton
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fetch Technologies Inc
Original Assignee
Fetch Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fetch Technologies Inc filed Critical Fetch Technologies Inc
Publication of EP1910918A2 publication Critical patent/EP1910918A2/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates generally to a network data extraction system, and in particular to a system for automatically extracting data from semi-structured web sites.
  • data may be automatically extracted from semi-structured web sites.
  • Unsupervised learning may be used to analyze web sites and discover their structure.
  • a method of this invention utilizes a set of heterogeneous "experts,” each expert being capable of identifying certain types of generic structure. Each expert represents its discoveries as "hints.” Based on these hints, the system clusters the pages and text segments and identifies semi-structured data that can be extracted. To identify a good clustering, the probability of clusterings, given the set of hints, is evaluated. BRIEF DESCRIPTION OF DRAWING
  • FIG. 1 is an example of a relational model of a web site in accordance with the invention
  • FIG.2 shows an example of the manner in which the invention solves the site-extraction problem in accordance with this invention
  • FIG. 3 shows data extracted from an exemplary web page in relational form in accordance with the invention
  • FIG. 4 is an example of pseudocode for a leader-follower algorithm according to the invention.
  • FIG. 5 is an example of pseudocode for clustering tokens on a web page
  • FIG. 6 is an example of how a Bayesian network would function according to the invention for purposes of clustering
  • FIG. 7 shows a graph partitioning with a related table of probabilistic edge weights in accordance with the invention.
  • FIG. 8 shows examples of the use of templates to cluster text-segments in accordance with the invention.
  • FIG. 9 shows the use of HTML patterns to cluster text segments according to the invention.
  • FIG. 10 shows the use of page layout to cluster text segments in accordance with the invention.
  • unsupervised learning may be used to analyze the structure of a web site and its associated pages.
  • One objective is to extract and structure the data on the web site so that it can be transformed into relational form. For instance, for an e-commerce retail web site it is desirable to be able to automatically create a relational database of products.
  • Other web sites of interest may include news web sites, classified ads, electronic journals, and the like.
  • the "wrapper induction" system may only be shown examples of a single page type, and moreover, a human marks up each example page.
  • the "site extraction problem” is in some sense a natural problem since web sites are generally well-structured so that humans can easily understand and navigate through a web site.
  • One possible approach for extracting data is to identify pages that share the same grammar, and then use the grammar to extract the data.
  • grammar induction techniques may be able to automatically learn grammar for those pages. Unfortunately, this is a chicken and egg problem; without knowing anything in advance about the data on the pages, it is difficult to automatically identify pages that have the same grammar.
  • embodiments of the invention exploit the situation in which many different types of structures exist within a web site.
  • This includes the graph structure of the links of a web site, the URL naming scheme, the content on the pages, the HTML structures within page types, and the like.
  • a set of "experts" have been developed to analyze the links and pages on a web site and to recognize different types of structure.
  • the system may be directed to cluster the pages and the data within pages, so that it can create a relational structure over the data. If desired, a human can then identify which columns in the resulting relational table should be extracted.
  • Clustering is a natural approach to unsupervised structure discovery.
  • Existing approaches to clustering typically define a similarity or distance metric on the space of samples.
  • the clustering problem is to find a partitioning of the samples such that a global function defined over this metric is maximized (or minimized). For example, if the samples lie in an n-dimensional space, the distance between samples might be defined as the Euclidean distance and the function to minimize could be chosen to be average distance between samples in a cluster for a given number of clusters.
  • the partitioning that minimizes the criterion function would then represent the underlying structure, grouping together samples that lie close to each other within the n-dimensional space.
  • a purpose of the invention is to be able to combine many different types of knowledge to solve the site-extraction problem.
  • Web pages can be compared by analyzing their URLs, their text content, and their page layout, among other dimensions, and it is desired to make use of all of these types of data in the clustering process.
  • software "experts" utilizing a variety of types of heuristic knowledge have been successfully combined to solve problems in domains ranging from crossword puzzles (N. M. Shazeer, et al., Solving crossword puzzles as probabilistic constraint satisfaction, AAAI/IAAI, pages 156-162 (1999)) to Bayesian network structure discovery (M. Richardson et al., Learning with knowledge from multiple experts, T. Fawcett and N. Mishra, editors, ICML, pages 624-631, (2003)).
  • it is not necessarily simple to adapt these techniques to clustering.
  • constrained clustering it is desired to pass background knowledge to the solver and not from only one heuristic expert but from many heterogeneous heuristic experts.
  • the challenge in combining heterogeneous experts is expressing different types of knowledge in a common language so that they can be combined effectively.
  • the URL structure might represent a hierarchical organization of pages whereas the page layout might reveal some flat clusters. Combining such different types of knowledge is challenging.
  • experts may be constructed so that they output their observations using a common representation. In this scheme, the experts produce "hints'" indicating, for example, that two items should be in the same cluster.
  • the clustering process may then implement a probabilistic model of the hint-generation process to rate alternative clusterings.
  • a start could be defining a relational database table with the current weather for all the cities, one row per city.
  • a script may be written to generate an HTML page for each row of the table.
  • the weather site may involve other page types as well. For instance, to help users navigate to the city-weather pages, pages may be included for each state, with each state page containing links to the city-weather pages. This can be done by creating a new page type for states. To do so, a new table is created that holds state information, such as state name and abbreviation, one row per state, and also another table that relates the records in the state and city tables, that is, listing the cities in each state. Using these tables a script would generate the corresponding HTML page for each state. If it is also desired to display a list of neighboring states on each state page, another table would be added to the database and script modified accordingly.
  • state information such as state name and abbreviation
  • FIG. 1 shows a hypothetical weather web site and the underlying relational data, from the homepage down to the city-weather pages.
  • Relational learning approaches seem to offer a possible methodology.
  • some existing relational learning methods such as the probabilistic relational model (PRM) technique, start with data that is already in relational format and attempt to find an underlying model that would generate the data.
  • PRM probabilistic relational model
  • start with data that is already in relational format and attempt to find an underlying model that would generate the data.
  • a set of relations are not needed to start. Instead, the inventors have found that it is possible in many situations to discover both the relational data and the model by analyzing the HTML pages.
  • the site-extraction problem can be viewed as two clustering problems: the problem of clustering pages according to their page-type and the problem of clustering text segments so that segments from the same relational column are grouped together.
  • a sub-goal of this approach is to discover the page and text-segment clusters.
  • An example of this approach is shown in Fig. 2. Consequently, the site-extraction problem is solved in three main steps: discovering low-level structure with heterogeneous experts; clustering pages and text segments to find a consistent global structure;, and finding the relational form of the data from page and text- segment clusters.
  • a preliminary step is spidering the web site from which the data is going to be extracted.
  • the starting input is the set of HTML pages on the site (including links on each page), which is obtained by spidering the site.
  • each expert focuses on a particular type of structure and works independently from all the other experts. Thus, each expert does the following two tasks: analyze the pages and the links with respect to a particular type of structure and output hints to indicate the similarities and dissimilarities between items (that is, pages or text-segments). For example, examining the URL patterns on a web-site gives clues about groups of pages that may contain the same type of data.
  • the goal here is to define a generic clustering approach where clusters are determined by a joint decision of multiple experts. This is a more difficult problem than multi-expert classification.
  • each expert can vote on the label of the sample, and then the votes can be combined (for example, simply by counting) to determine the final label.
  • the same approach does not quite work in the clustering problem: If each expert generates a clustering, then there is no obvious way to combine the decisions of the experts.
  • the basic clustering framework itself already provides one mechanism in which expert knowledge is combined. If two samples, A and B, are considered in isolation, the experts (or features) may have no indication that they are similar (that is, in the same cluster). If the same elements are considered in the context of a clustering problem then the neighborhoods of A and B may give clues that in fact these samples are similar. This observation becomes even more important if, in fact, three different experts discover the neighborhood of A, the neighborhood of B, and the similarity of the two neighborhoods. In effect, disjoint experts have been combined to reach a conclusion that could not have been reached using the individual experts.
  • the next step is to produce the corresponding tables (that is, the relational view of the data).
  • tables that is, the relational view of the data.
  • Each page cluster there is a set of tables.
  • Each column in each table is given by a token cluster.
  • One of the tables associated with a page-cluster is the "base table" for that page- cluster. Filling in the base table is straightforward.
  • Clusters containing a single data item per page become columns of the base table. For instance, for the weather site, the column containing the "current weather" for each city would be placed in the base table for cities (that is, where there is a row for each city).
  • Clusters containing list items are placed in separate "list tables.”
  • producing list tables requires that list boundaries be determined because there may be more than one list per page-type.
  • the data of interest is assumed to be in at most one list. With this assumption, there is only one list table to produce and the columns of this list table are given by the clusters that have not been used in filling the base table.
  • FIG. 3 shows the data extracted from an exemplary page in relational form. In this figure, only a few of the many columns of the extracted table are shown, but note how each field, such as product name or price, is placed in a separate column. Also note that the second column contains HTML code, which is unlikely to be of interest to the user. Currently, the system has no understanding of the type of data contained in each column. For the experiments described in the next section, a human was relied upon to pick out the columns of interest. Alternatively, automated content-targeting can be used, employing, for example, the technique presented by K. Lerman et al.
  • the input is the set of HTML pages on the web site (including links on each page), which is obtained by spidering the web site, or using any other suitable data gathering technique.
  • the pages may then be tokenized into individual components (that is, strings, URLs, numbers, etc.) based on the analysis performed by a set of "experts'" which examine the pages for URLs, lists, template structures, and the like.
  • an initial focus may be to find the individual columns of the tables.
  • a clustering approach is used during which attempt is made to arrange token sequences on the HTML pages into clusters, so that eventually each cluster contains the data in a column of one of the underlying tables. Once this is achieved, it is straightforward to identify the rows of the table, producing the complete tables from the clusters.
  • HTML pages are processed (the tokens are clustered), it is also convenient to cluster the pages themselves in order to identify page types. Accordingly, the following representation may be used.
  • a page-cluster may contain a set of pages, and is also the parent of a set of token-clusters. All of the tokens of the pages in a page-cluster may be clustered into the child token-clusters of that page-cluster. For example, suppose work is being done on the weather site which contains a home page listing all the states, state pages listing all the cities in that state, and weather condition pages that display the current conditions in a any particular city. When page-clusters are found for this web site, three clusters are expected to be found: one for weather-condition pages, one for the state pages, and one containing only the home page.
  • the token-clusters for the weather-condition page-cluster might include a cluster for city names, another for low temperatures, and another for high temperatures.
  • One approach to discovering, generating, or otherwise finding page-clusters is to apply a distance metric based on surface-structure, such as viewing a page as a bag-of-words and measuring the similarity between the document vectors.
  • This type of approach may not always correctly cluster web pages from a single web site because the true similarity of pages is typically discovered after some understanding is obtained of the deeper structure of the page. For example, determining that a web page with a short (or empty) list is similar to one with a long list may require that the first one discovered have a similar context surrounding the list, even though these pages may not appear similar in words, length, and so on.
  • the system may implement multiple experts to generate "hints" that describe local structural similarities between pairs of pages (or between pairs of tokens).
  • the hints may then be used to cluster the pages and tokens.
  • Our approach utilizes a heterogeneous set of experts such that for any given web site, the discoveries of at least some of the experts will make the solution obvious, or at least nearly so. This is based on observations that individual experts are successful in finding relevant structure some of the time, but not all of the time.
  • the hints provide a common language for experts to express their discoveries.
  • two types of hints are used: page-level and token-level.
  • a page-level hint may be defined as a pair of page references indicating that the referred pages should be in the same cluster. For example, if the input contains pagej with URL "weather/current_cond/lax.html" and page 2 with URL “weather/current_cond/pit.html," a URL-pattern expert might generate (pagei, ⁇ age 2 ) as a page-level hint.
  • a token-level hint is a pair of token sequences; the hint indicates the tokens of the two sequences should be in the same token-clusters.
  • a list expert might generate (“New Jersey,” “New Mexico”) as a token-level hint, among many other similar hints, after examining a page which contains a list of states.
  • possible experts include URL patterns, list structure, templates, and layout, among others. Each of these experts will now be described.
  • page-hints are generated for pairs of pages whose URLs are similar. This expert is helpful for identifying pages that should go into the same page- cluster. For example, on the weather site, if all the state pages had the constant "USstate" in their URL, it would be helpful. However, on some web sites the expert can be over-specific (for example, if the state page URLs had no commonality) or over-general (if state and city pages all had similar URLs).
  • the template expert may be used to search for or otherwise identify token sequences that are common across pages. Token-hints are generated for these sequences and the sequences in-between them. This expert is effective for identifying simple template structure shared by multiple pages. However, this expert may be misled when it is run on a set of pages that contain one or more pages that are not generated by the same grammar as other pages.
  • a probabilistic approach is employed that provides a flexible framework for combining multiple hints in a principled way.
  • a generative probabilistic model is employed that allows assignment of a probability to hints (both token hints and page hints) given a clustering. This in turn allows a search for clusterings that maximize the probability of observing the set of hints.
  • a probability is assigned by assuming that choosing any token-cluster is equally likely ( ⁇ lcount tc where counttc is the number of token-clusters) and also that picking any pair of tokens within C 1 is equally likely (1/C(
  • a page hint (p 0 , pi) is satisfied by the clustering, that is, p 0 and pi are in the same page-cluster C p , then it is assigned a probability by assuming that picking any page-cluster is equally likely ( ⁇ lcount pc where count pc is the number of page-clusters) and also that choosing any pair of pages from c p is equally likely (l/C( ⁇ c P ⁇ ,2) . Otherwise, it is assigned a small probability.
  • This probabilistic model allows permits assignment of a probability to hints given a clustering, which in turn permits a search for clusterings that maximize the probability of observing the set of hints.
  • One embodiment implements a greedy clustering approach, a specific example of which is based on the leader-follower algorithm presented by R. O. Duda et al. in "Pattern Classification,” Wiley-Interscience Publication (2000).
  • FIG. 4 provides an example of pseudocode for such a leader-follower algorithm. Adding a page to a page cluster involves clustering the tokens of the page. This operation may be accomplished using, for example, the pseudocode depicted in FIG. 5.
  • One advantage of the probabilistic model is that it prevents the system from simply clustering all the pages together.
  • the probabilistic model addresses this by assigning smaller probabilities to hints in larger clusters.
  • the output of the AutoFeed software is compared to the output of web wrappers that have been created using the AgentBuilder system of Fetch Technologies (a supervised wrapper induction system) and manually validated for correctness.
  • the output of a wrapper when applied to a set of pages, can be represented as a table whose columns correspond to the fields extracted by the wrapper. These columns are referred to as “target columns” (or, equivalently, “target fields”) and the extracted data values in each column are referred to as the "target data.”
  • the evaluation proceeds as follows. For each target column produced by the wrapper, the AutoFeed column is found that contains the most target values. If there is a tie, the column with the fewest total values is chosen. Then the retrieved and relevant count RR (the number of target values in this AutoFeed column) is calculated. The total number of values in the AutoFeed column (Ret) and the total number of target values (ReI) are reported. Precision is defined as RR/Ret and recall as RR/Rel.
  • Table 1 is a summary of AutoFeed results for e-commerce sites, journals, and job- listings.
  • the target fields for this experiment were product name, manufacturer, model number, item number (SKU), and price, all of which are generic across product types.
  • Table 2 provides data relating to the extraction from e-commerce sites.
  • the AutoFeed system correctly retrieved all the target values on DMTCS and missed only one article on JMLR.
  • AutoFeed retrieved approximately 90% of the values for JAIR.
  • the missed values were on pages that are clustered separately from the main cluster of detail pages.
  • On the EJC site there are no individual pages for articles, but all the information is still available from the table-of-contents pages (thus the large difference between the number of pages and the articles).
  • AutoFeed returned all the target values for the author and title fields, but also included some spurious values.
  • the results show O retrieved values because AutoFeed returned a longer field containing multiple links to different formats of the article.
  • an expert expresses its discovery by adding to the collection of hints a binary hint that indicates that two samples (either pages or text segments) are in the same cluster. For a given pair, the absence of a hint can mean either that the expert cannot make a decision about the pair or that it discovered that the pair should be in separate clusters. This ambiguity is a shortcoming of the first implementation and prevents experts from indicating that items should not be in the same cluster.
  • constraints are defined in the form of "must-link” or “cannot-link” pairs.
  • a must-link pair indicates that the pair of samples must be in the same cluster, a cannot-link indicates the opposite.
  • the "must-link” and “cannot-link” paradigm works well when the constraints are coming from an authoritative source, such as a human user, but not so well when they are generated using heuristics that can make errors.
  • the constraint language is extended so that constraints are assigned confidence scores. This allows an expert to output hints with varying levels of confidence. For example, if a particular type of structure indicates that two items may be similar but are not necessarily similar, the expert can express this by assigning a relatively lower level of confidence to the corresponding hint.
  • the clustering problem is represented as a Bayesian belief network. This approach is similar to the use of Bayesian networks in multi-sensor fusion (K. Toyama et al., Bayesian modality fusion: Probabilistic integration of multiple vision algorithms for head tracking, Proceedings of ACCV OO, Fourth Asian Conference on Computer Vision (2000), and Z. W. Kim et al., Expandable bayesian networks for 3d object description from multiple views and multiple mode inputs, IEEE Trans. Pattern Anal. Mach. Intell., 25(6):769-774 (2003 )).
  • multi-sensor fusion the problem is determining the unknown state of the world from noisy and incomplete data from many sensors.
  • clustering the task is determining the unknown clustering from evidence collected by experts. The tasks are similar in that both sensors and experts give partial information about the hidden state. For example, a sensor might give a 2D image of a 3D scene and an expert can find clusters in only a subset of the samples.
  • the Bayesian network is structured based on the assumption that for a given state of the world, the data from the sensors is independent. This leads to the following network structure:
  • the variables representing the unknown state of the world are root nodes.
  • the variables representing the observed states of the sensors are descendants of these variables.
  • the unknown clustering of the_ samples replaces the unknown state of the world.
  • an additional layer of nodes is added to the network. This extra layer contains a node for every pair of samples in the problem. Each such "InSameCluster" node represents whether the pair of samples is in the same cluster or not. For a given clustering, the value of such a node is determined, and each node is conditionally independent of the others given the clustering node.
  • the experts in the clustering problem correspond to the sensors in the multi-sensor fusion problem, but sensors typically provide specific evidence whereas experts provide virtual evidence.
  • virtual evidence nodes also called dummy nodes by Pearl (J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, (1988)) are used without specifying the values they take.
  • Each such "VirtualEvidence" node is a child of an InSameCluster node and each InSameCluster node has as many child VirtualEvidence nodes as there are experts.
  • This network structure represents the notion that each expert will observe with some probability whether two samples are in the same cluster or not. The evidence can then be propagated from each virtual evidence node to its parent InSameCluster node. An example of this is shown in Fig. 6.
  • Fig. 6 shows the Bayesian network for clustering three samples a, b, and c, with evidence from two experts El and E2.
  • the domain of the root node C contains the five partitionings: ⁇ abc ⁇ , ⁇ a, be), ⁇ b, ac), ⁇ c, ab ⁇ , and ⁇ a, b, c ⁇ .
  • the three nodes labeled L y are the pairwise InSameCluster nodes.
  • the leaf nodes of the network are the VirtualEvidence nodes.
  • Table 5 shows sample tables for the network in Figure 6.
  • the first is the probability table of the clustering node. For this node, all clusterings are assumed to be equally likely.
  • the third is the conditional probability tables of the VirtualEvidence nodes. A simple training approach is used to determine these conditional probability values.
  • These values represent the confidence an expert has on the hypothesis that two samples are in the same cluster. In general, it is desired that these probability values be dynamically computed.
  • the expert should examine a pair of samples before assigning a confidence score to the hypothesis that the samples are in the same cluster. Rather than having each expert directly compute these confidence scores, the process is divided into two steps. First, the expert assigns a similarity value to a given pair. In the second step, the similarity value is mapped into a confidence score. For example, an expert that computes edit-distance can assign the edit-distance as a similarity value and then this number can be mapped into a confidence score. The goal of the training process is to learn the mapping used in the second step.
  • Bayesian network allows experts to assign confidence scores to evidence is especially useful for experts that naturally compute a similarity score between sample pairs. For example, an expert based on string edit-distance naturally computes a similarity score, and intuitively, this score should be related to the likelihood of the pair of samples being in the same cluster.
  • the Bayesian network allows experts to easily pass the expert-specific similarity scores into the network as probability values.
  • Another benefit of this model is that an expert can indicate that a sample pair should not be clustered together as easily, as it can indicate that they should be clustered together.
  • a confidence score of 0.5 in an InSameCluster node shows that the expert has no evidence about whether the pair should be clustered in the same cluster or in different clusters. Scores less than 0.5 indicate that items should be in separate clusters and more so as the score approaches 0. Similarly, scores greater than 0.5 indicate that items should be in the same cluster and more so as the score approaches 1.
  • the process is the same. Experts provide evidence to the network and the effects of the observed evidence is propagated so that ultimately the conditional distribution of the clustering variable is determined. If only one clustering needs to be chosen, then the most-likely one can be picked among all the clusterings as the solution.
  • the hierarchical structure of the Bayesian network for clustering leads to the propagation algorithm which follows. First, virtual evidence from the experts is collected. Next, for each pair of samples, the belief in the corresponding InSameCluster node is calculated by propagating belief from all the VirtualEvidence nodes. Finally, the belief in the root clustering node is computed by propagation from all the InSameCluster nodes. In practice, the final step of propagation is computationally intractable, as it involves assigning a probability to all the values in the domain of the root clustering node. This domain includes all possible clusterings of the samples, so its size is exponential in the number of samples.
  • the Bayesian network is still useful for finding the probability of a given clustering.
  • P(C c
  • the goal is to find the most likely value of the clustering variable after all the evidence has been propagated through the Bayesian network, but standard propagation algorithms are not adequate for this network. This is because propagation in a Bayesian network normally involves computing the probability distribution of a small number of variables, each of which has a small domain. In this case, even though interest is in the probability distribution of only one variable, the domain of this variable is exponentially large. The standard algorithms simply assign probabilities to all the values in this exponentially large domain resulting in an exponentially expensive computation.
  • the process of picking the most likely clustering can be viewed as a search problem.
  • the standard propagation algorithms do complete searches through an exponentially large search space in which each search state represents a clustering.
  • the algorithms visit the search states in arbitrary order and assign each a probability value.
  • the state that represents the clustering with the highest probability is the goal state.
  • other search techniques especially those that don't traverse the complete search space, become alternatives to the standard propagation algorithms.
  • a greedy agglomerative clustering algorithm is used for the work presented here.
  • the greedy agglomerative clustering works in the following way.
  • the initial clustering consists of singleton clusters, one per sample.
  • a pair of clusters is merged to create a larger cluster until there is only one cluster left. This pair is chosen by considering all pairs of clusters and evaluating the set of edges connecting one sample from one cluster to another sample in the other cluster.
  • the score assigned to the pair of clusters is the product of the probabilities associated with these edges.
  • the pair with the highest score is merged at each step and the process is repeated.
  • the greedy algorithm considers each clustering, it also calculates that clustering's probability by evaluating it within the Bayesian network. At the end of the merging process, the algorithm picks the clustering with the highest probability as the solution.
  • Another way to view the search is as a graph-partitioning problem.
  • the nodes of the graph represent the samples and the weights Wj 0 , on the edges e ⁇ represent the probability of the two ends of the edge being in the same cluster.
  • Finding the most likely clustering is equivalent to finding a partitioning of the graph such that the product of score(e xy ) is maximized where score(e xy ) is w xy if nodes x and y are in the same partition and 1-w ⁇ if they are in separate partitions.
  • This problem is similar to optimal graph-partitioning with weighted edges and in general intractable (N. Bansal et al., Correlation clustering. Machine Learning, 56(l-3):89-113 (2004)).
  • the table in Fig. 7 shows the normalized probabilities of all the partitionings of the graph for the edge weights shown. For example, the unnormalized probability for the partitioning ⁇ ab ⁇ cd ⁇ is computed with
  • any particular type of structure such as URLs, links, page layout, HTML structure or content, may or may not contain useful information about the page-type.
  • a successful approach has to be able to consider many types of structure as it clusters pages.
  • the basic approach here to page-clustering is to build experts for finding different types of structure. Each expert focuses on a particular structure and passes its discoveries as evidence into the Bayesian network. The clustering that has the highest probability, given the evidence, is then looked for.
  • Pages pi and p 2 list products from two different categories. Their URLs contain the category path (for example, Books/Nonfiction/Science/Computers). Pages ps and p 4 show detailed information about two products. Their URLs are identical except for the product id.
  • the URL expert might determine that pages /» 5 and p 4 are likely to be in the same cluster because their URLs are so similar, but might not be able to find any evidence about whether or not any other pair is in the same cluster.
  • the page-layout expert might determine that/?; and P 2 are likely to be in the same cluster, because each contains many text segments that are indented exactly the same amount, but not find any evidence about the other pairs.
  • pages/? / and/?? can be clustered together and pages /»3 and/? ⁇ can be clustered together, but determining if the cluster of/? / and/? 2 should be merged with the cluster of/? ⁇ and / ⁇ / still cannot be determined
  • the content expert on the other hand, might find that neither/? / and/? 2 can be in the same cluster with /7 5 and/?. / by computing a similarity metric such as cosine similarity between document vectors. If all the evidence is now combined, it can be determined with confidence that the best clustering is ⁇ pi, p2 ⁇ , ⁇ p3,
  • URL expert In this implementation, we use the following experts: URL expert; template expert; page layout expert; table structure expert; and sibling pages expert.
  • the URL of a page is usually a good indicator of its page-type. Two pages that are of the same type will normally have similar URLs.
  • the URL expert computes the similarity of the URLs of two pages based on the length of the longest common subsequence of characters.
  • Pages that contain the same type of data are usually generated by filling an HTML template with data values.
  • the template expert determines the similarity of two pages by comparing the longest common sequence of tokens to the length of the pages. The longer the sequence, the more likely the pages are to be in the same cluster.
  • the page layout expert analyzes the visual appearance of vertical columns on the page. To do this, it builds a histogram of the counts of HTML elements that are positioned at each x coordinate on the screen. The similarity of these histograms is a good indicator that the pages are of the same page-type.
  • the table-structure expert uses a heuristic to detect the similarity of such pages.
  • the heuristic is based on the observation that some html structures (for example, ⁇ table>, ⁇ ul>, etc.) are commonly used to represent lists of items. This expert first finds these HTML structures, then removes all but the first few rows from each such structure and Finally compares the remaining text of the two input pages. If the texts are similar, this is a good indicator that the pages are of the same-type.
  • the sibling pages expert relies on the observation that if a page contains a list of URLs, then the pages pointed to by these URLs, that is, the sibling pages, are likely to be of the same page-type. Thus, the expert first finds lists of URLs on individual pages and then generates hints indicating that the pages pointed to by these URLs are in the same page- cluster.
  • a collection of pages from on-line retail stores is used.
  • sets of pages from these sites were collected by both directed and random spidering, so that the sets include a variety of page-types, but also a page-type that contains a number of pages that give detailed information about a product.
  • the original dataset did not have page-type labels, so all the pages were manually labeled.
  • Clustering web-pages is an important step in unsupervised site-extraction and the results here can be directly used in that application, as described in previous work. (K. Lerman et al.. Using the structure of web sites for automatic segmentation of tables, SIGMOD '04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 119-130 (2004)). Unfortunately, the problem of site-extraction is a relatively new problem and direct comparison with other approaches is difficult. Other clustering work on web-pages focuses on clustering pages returned in response to queries. The main goal of these approaches is to make navigation easier for users by grouping related or similar pages together. Thus, such web-page clustering approaches are not comparable to the approach here, which clusters pages to a finer grain.
  • the text-segment clustering algorithm is applied after the page- clusters have been determined. So the overall approach is as follows: first, find the page clusters; next, for each page cluster, determine the set of text segments; then, cluster the text segments.
  • Other approaches to finding the text-segment clusters are certainly possible.
  • text segments from all pages can be clustered. This allows simultaneous clustering of pages and text-segments and opens up the possibility of co-clustering, where decisions made in one problem can be used to improve decisions in the other.
  • text segments can be clustered while page segments are being clustered as in the first implementation. This allows one-way propagation of decisions from the text-segment clustering problem to the page clustering problem.
  • the first step in clustering text segments is to determine what the segments are.
  • the problem of finding text segments within HTML pages is similar to the problem of finding word or sentence boundaries in natural language text. Such boundaries are best determined if the structure of the data is understood, but to understand the structure of the data requires that the boundaries be found first. So, this type of problem is another what-comes-first problem.
  • the boundary issue is bypassed by utilizing the surface- structure of the input. This is similar to assigning sentence boundaries to every period. This type of heuristic will miss some true sentence boundaries as in, "It works!,” and generate some false positives as in "Mr. Smith left.”
  • On an HTML page there are two obvious ways of implementing a similar heuristic. One is to find words by assigning boundaries to spaces, punctuation, among others. This tends to generate segments that are usually too short for the purposes here, in that data fields of interest are rarely single words. For example, data fields such as product names, street addresses, article titles, or dates all consist of more than one word.
  • a second alternative is to use the HTML structure. The HTML structure already defines boundaries for segments of interest.
  • segments are text elements and links that appear as attributes of certain HTML tags.
  • text elements sometimes contain extra text in addition to the data of interest. For example, a text elements can contain "Price: $19.99" even though the only element of interest would be 19.99.
  • the extra text can be removed through a post-processing step. So, in this implementation, the HTML structure is used to determine the boundaries of text segments.
  • experts For text-segment clustering, the following types of experts are used: experts based on location of text-segment within a page; experts based on content; and experts based context.
  • the goal in clustering text-segments is to group segments not only according to their content, but also according to the relational column from which they might have originated. For example, multiple occurrences of the same segment (for example, the same date value) may sometimes come from different relational columns and segments that look very different (for example, book titles) may come from the same relational column.
  • the following experts give clues that help with these types of decisions. They use the location within a page in which the segment is found as a clue to identify similarity.
  • Template slots experts find the common text between a pair of pages.
  • the segments that are in the same slot (the text following the same segment of common text) on each page are likely to be in the same cluster.
  • An example is shown in Fig. 8.
  • list slots find repeating patterns of HTML structure within each page. Like templates, patterns have slots in which data fields appear. The segments that appear in the same slot are likely to be in the same cluster.
  • a representation of a list slots expert is shown in Fig. 9.
  • Another expert within the location of text-segment within a page expert is the layout expert. This expert finds sets of text segments that have the same x-coordinate when the page is displayed on screen. The segments that are members of each such set are likely to be in the same cluster, and Fig. 10 represents an example of a layout expert.
  • Experts based on content look for similarities within the content of text-segments. For example, in a string similarity expert, the basic indicator that two segments are in the same cluster or not is how similar they are in terms of content. If the contents of two segments are similar, then they are likely to be in the same cluster.
  • segments that contain the same type of data are likely to be in the same cluster. For example, if two text segments consist of a street address each, then they are likely to be in the same cluster even if the addresses are not similar at all.
  • the main factor in the time complexity of the present approach is the search.
  • the greedy search takes O(n 3 ) time where n is the number of samples as there are n clusters to merge and each merge takes « 2 time to consider all pairs of clusters.
  • the actual running time of this system varied between a few minutes to half an hour on the datasets described here. 10.
  • Unsupervised site-extraction is a challenging task that is becoming more relevant as the amount of data available on the web continues to increase rapidly.
  • the approach to the problem disclosed here includes, but is not limited to, combining multiple heterogeneous experts, each of which is capable of discovering a particular type of structure. Combining experts involves finding a global structure that is consistent with individual substructures found by the experts.
  • clustering provides the global structure.
  • the substructures found by experts are expressed as probabilistic constraints on the sample space.
  • the global structure, clustering in this case allows heterogeneous experts to be combined in such a way that the collection of experts can discover structures that no one single expert can.
  • the various search techniques may be modified so that they can handle significantly larger numbers of hints and web sites. For example, an incremental approach may be used during which the system iteratively spiders and clusters pages so that it can cut off search for pages of the same page type. This allows the AutoFeed system to handle much larger sites.
  • the various methods and processes described herein may be implemented in a computer-readable medium using, for example, computer software, hardware, or some combination thereof.
  • the embodiments described herein may be performed by a processor, which may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a selective combinations thereof.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGAs field programmable gate arrays
  • processors controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described herein, or a selective combinations thereof.
  • the embodiments described herein may be implemented with separate software modules, such as procedures, functions, and the like, each of which perform one or more of the functions and operations described herein.
  • the software codes can be implemented with a software application written in any suitable programming language and may be stored in memory, and executed by a processor.
  • Computer memory may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor using known communication techniques.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Selon un mode de réalisation de l'invention, des données peuvent être automatiquement extraites à partir de sites Web semi-structurés. Un apprentissage non supervisé peut être utilisé pour analyser des sites Web et pour découvrir leur structure. Une méthode de l'invention fait appel à un ensemble 'd'experts' hétérogènes, chaque expert permettant d'identifier certains types de structure générique. Chaque expert représente ses découvertes sous forme 'd'indices'. En fonction de ces indices, le système peut regrouper les pages et des segments de texte, et identifier des données semi-structurées pouvant être extraites. Pour identifier un bon rassemblement de pages, un modèle probabilistique du procédé de génération d'indices peut être utilisé.
EP06787271A 2005-07-15 2006-07-14 Methode et systeme pour extraire automatiquement des donnees a partir de sites web Withdrawn EP1910918A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US69951905P 2005-07-15 2005-07-15
PCT/US2006/027335 WO2007011714A2 (fr) 2005-07-15 2006-07-14 Methode et systeme pour extraire automatiquement des donnees a partir de sites web

Publications (1)

Publication Number Publication Date
EP1910918A2 true EP1910918A2 (fr) 2008-04-16

Family

ID=37669390

Family Applications (1)

Application Number Title Priority Date Filing Date
EP06787271A Withdrawn EP1910918A2 (fr) 2005-07-15 2006-07-14 Methode et systeme pour extraire automatiquement des donnees a partir de sites web

Country Status (3)

Country Link
EP (1) EP1910918A2 (fr)
CA (1) CA2614774A1 (fr)
WO (1) WO2007011714A2 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625719A (zh) * 2020-05-21 2020-09-04 四川九八村信息科技有限公司 一种单采血浆站的宣传渠道拓展系统及方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8983964B2 (en) 2006-03-30 2015-03-17 Geographic Solutions, Inc. System, method and apparatus for consolidating and searching educational opportunities
US20110314001A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Performing query expansion based upon statistical analysis of structured data
US10664530B2 (en) 2014-03-08 2020-05-26 Microsoft Technology Licensing, Llc Control of automated tasks executed over search engine results
CN112215385B (zh) * 2020-03-24 2024-03-19 北京桃花岛信息技术有限公司 一种基于贪婪选择策略的学生困难程度预测方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5525673B2 (ja) * 2000-09-28 2014-06-18 オラクル・インターナショナル・コーポレイション エンタープライズウェブマイニングシステム及び方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2007011714A2 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625719A (zh) * 2020-05-21 2020-09-04 四川九八村信息科技有限公司 一种单采血浆站的宣传渠道拓展系统及方法
CN111625719B (zh) * 2020-05-21 2023-06-13 四川九八村信息科技有限公司 一种单采血浆站的宣传渠道拓展系统及方法

Also Published As

Publication number Publication date
WO2007011714A3 (fr) 2007-10-04
CA2614774A1 (fr) 2007-01-25
WO2007011714A2 (fr) 2007-01-25
WO2007011714A9 (fr) 2007-03-08

Similar Documents

Publication Publication Date Title
US8117203B2 (en) Method and system for automatically extracting data from web sites
US20240135098A1 (en) Interactive concept editing in computer-human interactive learning
Chakrabarti et al. A graph-theoretic approach to webpage segmentation
TWI557664B (zh) Product information publishing method and device
JP5421737B2 (ja) コンピュータ実施方法
US9009134B2 (en) Named entity recognition in query
TWI424325B (zh) 使用有機物件資料模型來組織社群智慧資訊的系統及方法
US20100241639A1 (en) Apparatus and methods for concept-centric information extraction
Osman et al. Graph-based text representation and matching: A review of the state of the art and future challenges
CN114238573B (zh) 基于文本对抗样例的信息推送方法及装置
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction
CN114201598B (zh) 文本推荐方法及文本推荐装置
JP2022035314A (ja) 情報処理装置及びプログラム
WO2007011714A2 (fr) Methode et systeme pour extraire automatiquement des donnees a partir de sites web
EP4172811A1 (fr) Système et procédé de détection automatique de zones d'intérêt de page web
CN109213830B (zh) 专业性技术文档的文档检索系统
Gazen et al. Autofeed: an unsupervised learning system for generating webfeeds
Yuan et al. Self-adaptive extracting academic entities from World Wide Web
US20230315744A1 (en) Ranking determination system, ranking determination method, and information storage medium
Mohammadi et al. Web Content Extraction by Weighing the Fundamental Contextual Rules
Dias Reverse engineering static content and dynamic behaviour of e-commerce websites for fun and profit
Aljabary Automatic Generic Web Information Extraction at Scale
Ghecenco et al. Extraction of Attributes and Values From Online Texts
Wu et al. Recommending Relevant Tutorial Fragments for API-Related Natural Language Questions

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20080124

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR

AX Request for extension of the european patent

Extension state: AL BA HR MK RS

RIN1 Information on inventor provided before grant (corrected)

Inventor name: MINTON, STEVEN, N.

Inventor name: GAZEN, BORA, C.

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20100202