WO2007110398A1

WO2007110398A1 - Web search engine with new ranking method

Info

Publication number: WO2007110398A1
Application number: PCT/EP2007/052839
Authority: WO
Inventors: Stefano Serra-Capizzano
Original assignee: Stefano Serra-Capizzano
Priority date: 2006-03-24
Filing date: 2007-03-24
Publication date: 2007-10-04

Abstract

The invention relates to a method for ranking Web pages which characterized by the fact that a each dangling node is provided with a link to itself.

Description

Web search engine with new Ranking Method

Field of invention

The present invention relates to the ranking of Web pages with hierarchical/topological considerations.

State of the Art

Search engines, for instance implemented by GOOGLE, YAHOO, MICROSOFT or ASK JEEVES, are used by millions of users every day.

There are some models and different algorithms which may be used for Ranking Web pages. Most of them are based on the structure of the Web (i.e. the World Wide Web) as a huge graph, i.e. a collection of nodes, Web pages, links between the nodes and on the behaviour of the "generic" user. There is a connection from A to B if the there is an outgoing link from the Web page A to the Web page B. In other words, from page A one can reach within one single "click" page B.

When one looks for the Web pages containing a given word (e.g. "Ferrari"), hundreds and often thousands of pages are involved. This large amount of pages to check could be a problem for the user and, in fact, the main feature of those algorithms is that Web pages appear in a precise order, the first being supposed to be "more important" than the second, the second "more important" than the third and so on. Informally speaking, a page is more important if it may be "clickable" from other (important) pages: the same concept appearing in Social Sciences, described and even personally interpreted by the PopArt Master Andy Warhol. In this specific case of the "Ferrari" search, the first page is the HomePage of the prestigious Car Maker Ferrari, the fourth is the Ferrari Club of America and so on.

It is self evident that this idea of making a Ranking of the Web pages in terms of importance is very useful (and has a strong commercial impact), since the generic user will find the desired information more easily without reading every page containing the wanted information. Moreover, the owner of the "important" pages will have a kind of free advertisement. However, it is widely recognized that these engines sometimes do not produce the desired order. The related pathologies have been well described also in Scientific Journals as Internet Mathematics and Workshops as the World Wide Web Conferences on the subject. One example of an artificially constructed pathological Web graph occurs when there is a connection from a page A to a page B, a connection from a page C to page B and a connection from page B to page C. Page A being connected from all remaining 10^Λ9 pages, the connections among these 10^Λ9 pages being arbitrary.

The example below follows the classical GOOGLE model, as defined in particulair in US 6, 285, 999 :

A matrix G is defined with G(i,j) equal to l/deg(i) if deg(i)>0 and there is a link from page i to page j, deg(i) being the cardinality of the outgoing links from page i; if deg(i)=O i.e. i is a dangling node then the entry G(i,j) is set equal to 1/n with n being the size of the Web. The resulting matrix is row stochastic and represents a surfing model for the behaviour on the Web of a generic user. Then the model is completed by adding the teleportation parameter c so that G[c]=cG+(l-c)ev^ΛT, c belonging to [0,1], e being the vector of all ones, v being a nonnegative probability vector. The Ranking of page i is defined as the i-th entry of PageRank( c) where y=PageRank( c) is a n-sized nonnegative probability vector satisfying the relation Py=y with P being the transpose of G[c]. According to the classical approach, in the "ideal case" of c=l, the page A has zero Ranking (as the 10^Λ9 pages in the first row) and the importance is concentrated in B and C (due to the ergodic projector, J. Lasserre, A formula for singular perturbations of Markov Chains, J. Appl. Prob., 31-2 (1994) pp. 829-833). This Ranking is highly counter- intuitive and indeed wrong: if you are a leader of 10^Λ9 people, you are really powerful no matter if any of your followers has low Ranking i.e. he/she is not important. Now suppose that C is deleted (i.e. that Web page is deleted) or that the pages B and C merge in a unique Web page named again B. Then the Ranking of B goes down dramatically and A becomes really the most important. Again this sharp modification of the Ranking is highly counter-intuitive and indeed wrong. At least, one would expect that the cumulative Ranking of B and C, before deletion of C, and the Ranking of B alone after deletion of C should remain roughly speaking constant (a sort of mono tonicity, which is substantially violated by the actual model). A strong and macroscopic evidence of the problems in the actual model is that, for most of the nodes (what is called "core" in the literature, see P. Boldi, M. Santini, S. Vigna, PageRank as a function of the damping factor, Proc. 14th Intenational World Wide Web Conference (WWW 2005), Chiba, Japan 2005, ACM, pp. 557-566), the Ranking is zero. Indeed, only the use of values of the teleportation parameter c far away from 1 (e.g. 0,85) partly alleviates the problem (in Latin "ex falso quod libet" is an expression capturing the fact that from something wrong anything can derive and, by coincidence, also good things).

The basic error is the confusion between the notion of "importance" (PageRanking) and the stochastic model for surfing on the Web. We can identify two critical points :

A first critical point is a wrong treatment of dangling nodes: with the actual model, there is no monotonicity as the example of deletion of node C in the above graph shows.

A second critical point is that the transient effects are not taken into account. A user is on the Web for a finite number of clicks, at every visit. This implies that looking at the stationary vector (as the number of clicks tends to infinity) is just theoretical and far from reality.

Some algorithms have been developed in order to speed up the computation and/or to eliminate the above cited problems. Here we give a more detailed account on the existing background.

As usually recognized a very precise way of Ranking a document in a given database is the determine the intrinsic importance based on the content of the document or of its parents or from the anchor text : This idea becomes unpractical and its application even impossible for computational reasons when the size of the database is of order of millions or billions, as it happens in the case of the Web. The alternative approach is to define the Ranking according to extrinsic relations among nodes: this sort of idea is called link-based Ranking. Clearly the GOOGLE model described previously belongs to this class. The computation for c<l is carried by using the power method without normalization since the 1^Λ1 norm of the iterates is maintained, if the initial guess is a nonnegative probability vector. As already reported the value c=0.85 is chosen. Many techniques for accelerating the computation have been proposed, see e.g. WO 2004/088475, A2. They are based on extrapolation techniques, on block analysis of the matrix P, linear system representation and analysis of the PageRank vector etc. and allow to compute efficiently the vector PageRank(c ) also for c larger to 0.85 (e.g c=0,9 or 0.99). Other ways of treating the problem consist 1) in determining the importance in two steps, first on a much denser and much smaller matrix representing the GOOGLE approach, where the Web pages are replaced by Hosts, and then treating the Web pages (see e.g. EP 1 653 380 Al), 2) in identifying several GOOGLE type matrices related to different features and in considering a linear combination of these matrices (see US 2004/0111412 Al), in order to increase the precision of the Ranking by blending the influence of the different features, or 3) by considering the influence of the neighbouring pages (see EP 1 622 047 A2) . In that latter approach the secondary (sub-dominant) eigenvector has to be computed and this could be computationally critical. However none of them addresses the model problems emphasized before of wrong identification between the surfing model and a right notion of importance. Similar link-based Ranking techniques disclosed in US 6, 111, 202 calculate (a partial) singular value decomposition of a GOOGLE-type matrix determines the Rank of a given page i as the i-th component of the principal singular vector. The idea behind is essentially that of Kleinberg based on the dual concepts of Hub and Authority: in this type of models it is natural to make the Ranking by including features of the query (query dependent Ranking). Although the idea is of great interest (see e.g. EP 1 596 315 Al, EP 1 643 385 A2, US 2002/0103798 Al), here the main goal is different because we would like to define a kind of global Rank as in the GOOGLE model, independent of the query, for evaluating the importance of every Web page in terms of the topology of the underlying graph: this would give as a final result a kind of evaluation of a 'fair price' for buying or selling a space on a given Web page.

Description of the Invention

One objective of the invention is to incorporate the transient behaviour for differentiating our Ranking from the limit solution to the surfing model. Another objective of the invention it to provide to any dangling node a link to itself or to itself and its parents, with a given distribution (in order to impose monotonicity, at least in a weak sense). A link to itself models a reload action and hence it could be also used for the other non-dangling nodes. An improved ranking may be computed starting from the uniform vector u with components all equal to w(0)/n (with n being the size of the Web) and adding all the vectors w(j)P^Λj u, where P=G[I ]^ΛT is the transpose of the modified GOOGLE matrix with parameter c=l, j ranges from 0 to a reasonable number of clicks, w(j)>0 is the j-th term of a sequence forming a convergent series. Here the modified GOOGLE matrix is that of the old model with c=l modified by incorporating the new treatment of dangling nodes: moreover, the present invention is not limited to choosing a hyperlink at random within a page with uniform distribution; if statistical data are known about actual usage patterns, then that information can be included since any arbitrary distribution describing the choice of the hyperlink can be considered.

Here the speed of the decay of w(j) to zero, as j tends to infinity, can be used for deciding to give more or less importance to the stationary limit distribution (solution to the surfing model) or to the transient behaviour. Indeed, if I should choose a page where to put the advertisement of a new product, I would prefer a page with high transient Ranking (transient i.e. for j moderate e.g. at most 10, 15) because many people will have a chance of looking at it, instead of a page with low transient Ranking and high final Ranking (final i.e. as j tends to infinity). In fact no user will wait so much or, if he/she waits on the Web, then he/she will be probably terribly tired and unable to appreciate any commercial suggestion. This can motivate a first concrete proposal of w(0)=w(l )=.... =w(k)=(p-l)/(pk), for a reasonably moderate k (e.g. k integer with k in the interval [7,20]), p belonging to [2,10], and the remaining w(j), j>k, such that w(k+l)+w(k+2)+....=l/p. In practice, for j larger of any reasonable number of clicks, dictated e.g. by the "physical resistance" of a generic user, we could set w(j)=0. Furthermore, since the Cesaro sum of the P^Λj u tends to a stationary distribution (as in the Google model) and this stationary distribution is the limit as the teleportation parameter c tends to 1 of PageRank(c), PageRank(c) being the classical PageRank, instead of the general condition w(k+l)+w(k+2)+....=l/p we can safely choose w(k+l)=l/p, w(m)=0 for every m larger than k+1 and the classical PageRank(l) instead of P^Λj u, for j=k+l. The choice of PageRank(l) is recommended for stabilizing the computation: indeed the sequence P^Λj u may fail to converge, while its Cesaro mean converges to the ergodic projector.

A natural problem at this point is the management of SPAM pages. An interesting idea used in the previous model is based on a careful choice of the personalization vector v (see below): hence as before, in the previous sum, the uniform vector u is replaced by the personalization vector v. A second natural problem is the computation of PageRank(l) intended, by definition, as the limit as the teleportation parameter c tends to 1 of PageRank(c) with generic personalization vector v.

In fact from the analysis in (S. Serra Capizzano, "Jordan canonical form of the Google matrix: a potential contribution to the PageRank computation", SIAM Journal on Matrix Analysis and Applications, Vol. 27, N.2 (2005), pp. 305—312 and in R. Horn, S. Serra Capizzano, "A general setting for the parametric Google matrix" to appear on Internet Mathematics) we know that PageRank(c ) is an analytic function of c on the complex plane, except for a finite number of points different from 1 outside the open unit disk. Therefore PageRank(l) can be approximated, just by continuity, by PageRank(c) with c close to 1 (0.9, 0.99): there is a lot of work by Golub and coauthors (using Arnoldi), Del Corso, Gulli', Romani (using the linear system representation and preconditioned GMRES), Breziski and Redivo Zaglia (vector extrapolation based on explicit rational formulae of PageRank(c)) etc. for making such computations fast.

An appropriate choice of the involved parameters, based on the experience, is also possible with special reference to k, p and to the weights w(j). Here is a first embodiment : A visit to the page A will make A more important if it is longer: following this principle the value w(j) could be decided as a monotone function of the average time of a generic user between the click j and the click j+1 (see below). While the previous model is trying to Rank the importance at the limit (the asymptotic stationary distribution i.e. the solution to the surfing model), the present approach can be seen as a global Ranking i.e. as a weighted integral over the discrete time (decided by clicks on the Web) of the Ranking. Of course, as already informally observed, the weights w(j), like in any weighted quadrature formula, decide where to put the attention for giving the final decision on the global Ranking.

Another healthy effect of the integral approach is the stabilization (typical of any Cesaro like process). Indeed, by considering again the above example, with the old model the Ranking of page B and C are oscillating. They exchange the first and the second top positions at every j and the difference between their Ranking is not negligible. Of course, the use of tele-portation just alleviates the phenomenon, which is eliminated at the limit, but in practice it remains well visible. The averaging implied by the integral approach substantially reduces this fact as any Cesaro like process does: however, it should be noticed that a plain Cesaro approach would again give emphasis only on the limit behaviour, since its representing matrix would converge to the spectral projector (see again J. Lasserre, A formula for singular perturbations of Markov Chains, J. Appl. Prob., 31-2 (1994) pp. 829—833 or Serra-Capizzano, The PageRanking problem: the model and the analysis, Proc. Dagstuhl Seminar on 'Web Information Retrieval and Linear Algebra Algorithms', Dagstuhl, Germany 2007).

Furthermore, let us give more details for the determination of the sequence w(j), based on experience. Consider for a moment to have the following information on all the visits on the Web for a certain window of observation (one week for instance). Let click(O) be the cumulative number of first clicks, over all users, over all visits, on the Web (for opening Internet), let click(l) be the cumulative number of second clicks on the Web and so on. If you are on the Web and you change Web page not clicking, but by writing explicitly the address, then this is counted as a restart i.e. in the number click(O): clearly for any j>0, click(j)-click(j- 1) is a nonnegative integer and represents the number of visits to the Web for which j-1 has been the last click before stopping the visit or starting a new one. Moreover, there exists only a finite number of indices j with nonzero click(j), due to the finiteness of the time interval and due to the physical resistance of the generic user. Now on these visits make a statistics on every t_j, j>0, the time interval between the click number j-1 and the click j, if the click j-1 is not the last click, and the time interval between the click number j-1 and the exit, if the click j-1 is the last. Let us denote by Tj the expected value of t_j on our observations. Then calling \gamma(j)=click(j)*(T(j)-T(j-l)), T(-l)=0, and s(j) the sum of all \gamma(k) with k>j-l, our integral will be

y=F(P,v,w)=w(0) v +w(l) Pv+ +w(k)P^Λk v+w(k+l)PageRank(l) (**)

with w0)=\gamma(j)/s(0), j=0,....,k, and w(k+l)=s(k+l)/s(0).

In this way more influence is given to P^Λs v if the 'area w(s)' is maximal: w(j) may be viewed as the area of a rectangle where the length of the basis is the average time between click j and click j+1 and the length of the height is the quantity click(j). It is not excluded that the behaviour of such a sequence w(j) can be roughly approximated by a Poisson distribution with a given mean. Along the same line the personalization vector v can be described. It should be nonnegative and with unit 1-norm (just a matter of scaling). Moreover v(j) should be put at zero if it is recognized as a SPAM and for the other pages the value v(j) has to be proportional to the sum over the visits to j at the first click of the visit-time.

Of course these parameters have to be estimated, but such an information may be accessed by the leaders of Web-Searching Market (as Google, Microsoft, Yahoo etc).

Finally, the latter statement suggests to look at the problem in a time dependent and nonlinear way, since the Web evolves in time and the expected values of the various time intervals, i.e., T(j)-T(j-1), j=0, 1,... also depend on the Ranking that we attribute to Web pages. A concrete proposal is the following: if y(t) denotes this new definition of the PageRank according to the formula (**), then we define the new PageRank at t+dt as

y[t+dt]= F(P[t+dt],z,w[t+dt]), z=m y(t)+ (1-m) v[t+dt], 0<=m<=l,

where P [t+dt] is the Web matrix at the time t+dt, w[t+dt] is the vector of the weights at the time t+dt, and where z is defined as a convex combination of v[t+dt] (the personalization vector defined as before at time t+dt) and y[t] which carries the information on PageRank at the older time t. The parameters of the convex combination can be interpreted as weights that measure the level of fidelity, which is based on the 'past importance'

Conclusions and final remarks

In summary two goals are achieved with the present invention. The actual efficiency (fast computation) is preserved, since the new computation will involve at most two vectors, which already were computed in the preceding model, and the old pathologies are removed without introducing new ones.

The new Ranking Method according to the invention may be called the VisibilityRank or the CommercialRank, since a query-independent measure is given of the 'fair value' of any Web page for deciding e.g. the cost of putting an advertisement in that page (as in the determination of the cost of renting a space for advertisement in a given place of a given street, square etc.)

As a final remark, it is worth mentioning that the present invention and its related model could be of interest not only in Web Ranking, but also in political/social sciences e.g. for Ranking who/what is influential and who/what is not (as an example one could be interested in answering to the following questions: Bill Clinton's opinion is really influential and at which level? How to Rank immaterial forces such as a religious authority vs material forces such as economic/military powers?), in many aspects of marketing, for Ranking human resources, for Ranking the importance of a paper and/or of a researcher looking in scientific databases. Let us think to MathSciNet for Mathematicians where a generic node is any paper in the database and a link from A to B is just a bibliographic reference to paper B in paper A. For evaluating the impact (i.e. the Ranking) of a paper the very same model and the same procedure as described before could be applied to the related graph. For evaluating or Ranking a researcher (a very hot topic nowadays in several countries) it would be enough to modify the graph where every single node is a researcher and a link from A to B means that the researcher A has written at least one paper referring to at least one paper of the researcher B: the links have to be weighted and the related weights will be proportional to the number of such papers and will be properly normalized according to the number of authors in the referring papers of A and in the referred papers of B. The algorithm will be again the same and again the same idea would work for Ranking researcher groups or Institutions such as Departments, Faculties, Universities (see the hierarchical approach in EP 1, 653, 380 Al). In addition it is worth stressing that the described procedures for defining the graph and for computing the Ranking are unchanged in any Scientific homogeneous community.

Of course, for modelling in a convincing way such complex phenomena, it would be recommended to enrich the structure of the graph by adding to nodes and/or to edges more information (meta-graph? ....). However, the essential basic idea for defining and computing the Ranking has to remain virtually the same.

Claims

1. A method for ranking Web pages characterized by the fact that a each dangling node is provided with a link to itself.

2. A method according to claim 1 wherein a link is also provided to the dangling node parent(s).

3. A method according to claim 1 or 2 wherein the ranking is computed as a nonnegative linear combination with coefficients w(j) of P^Λjv, j>=0, v right personalization vector,

P GOOGLE-like matrix with parameter c=l .

4. A method according to anyone of the previous claims using the following formula (**) : y=F(P, v,w)=w(0) v +w( 1 ) Pv+ +w(k)P^Λk v+w(k+ 1 )PageRank( 1 )

with wO)=\gamma(j)/s(O), j=0,....,k, and w(k+l)=s(k+l)/s(0). and wherein the sum of w(j)p^Λj v, j>k, is replaced by the unique term w(k+l)y(l) , y(l) being the limit as c tends to 1 of the GOOGLE-like Page Rank with said matrix P; y(l) being computed by approximation as y(0,99) or by any appropriate numerical technique; y(l) furthermore replacing P^Λj v, with j>k.

5. A method according to anyone of the previous claims comprising a step which includes the determination of the average time interval between two successive clicks j and j+1 and the average number of visits over all the users at click j: this is used for computing the weights w(j), j>=0, as in the steps before the application of said formula (**).

6. A method according to anyone of the previous claims including a step for the determination of the entries v(s), s>=l, of the personalization vector v, according to the number of first visits i.e. at click j=0. Use of the method as defined in anyone of the previous claims as a time-dependent model for taking into account the rapid change of the WEB and for determining the ranking as a compromise between the ranking at the present time t and the ranking in the past.