WO2007110398A1 - Web search engine with new ranking method - Google Patents

Web search engine with new ranking method Download PDF

Info

Publication number
WO2007110398A1
WO2007110398A1 PCT/EP2007/052839 EP2007052839W WO2007110398A1 WO 2007110398 A1 WO2007110398 A1 WO 2007110398A1 EP 2007052839 W EP2007052839 W EP 2007052839W WO 2007110398 A1 WO2007110398 A1 WO 2007110398A1
Authority
WO
WIPO (PCT)
Prior art keywords
ranking
web
click
page
previous
Prior art date
Application number
PCT/EP2007/052839
Other languages
French (fr)
Inventor
Stefano Serra-Capizzano
Original Assignee
Stefano Serra-Capizzano
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Stefano Serra-Capizzano filed Critical Stefano Serra-Capizzano
Publication of WO2007110398A1 publication Critical patent/WO2007110398A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the ranking of Web pages with hierarchical/topological considerations.
  • Search engines for instance implemented by GOOGLE, YAHOO, MICROSOFT or ASK JEEVES, are used by millions of users every day.
  • the resulting matrix is row stochastic and represents a surfing model for the behaviour on the Web of a generic user.
  • the page A has zero Ranking (as the 10 ⁇ 9 pages in the first row) and the importance is concentrated in B and C (due to the ergodic projector, J.
  • a first critical point is a wrong treatment of dangling nodes: with the actual model, there is no monotonicity as the example of deletion of node C in the above graph shows.
  • a second critical point is that the transient effects are not taken into account.
  • a user is on the Web for a finite number of clicks, at every visit. This implies that looking at the stationary vector (as the number of clicks tends to infinity) is just theoretical and far from reality.
  • EP 1 653 380 Al 2) in identifying several GOOGLE type matrices related to different features and in considering a linear combination of these matrices (see US 2004/0111412 Al), in order to increase the precision of the Ranking by blending the influence of the different features, or 3) by considering the influence of the neighbouring pages (see EP 1 622 047 A2) .
  • the secondary (sub-dominant) eigenvector has to be computed and this could be computationally critical.
  • none of them addresses the model problems emphasized before of wrong identification between the surfing model and a right notion of importance.
  • Similar link-based Ranking techniques disclosed in US 6, 111, 202 calculate (a partial) singular value decomposition of a GOOGLE-type matrix determines the Rank of a given page i as the i-th component of the principal singular vector.
  • the idea behind is essentially that of Kleinberg based on the dual concepts of Hub and Authority: in this type of models it is natural to make the Ranking by including features of the query (query dependent Ranking). Although the idea is of great interest (see e.g.
  • EP 1 596 315 Al, EP 1 643 385 A2, US 2002/0103798 Al here the main goal is different because we would like to define a kind of global Rank as in the GOOGLE model, independent of the query, for evaluating the importance of every Web page in terms of the topology of the underlying graph: this would give as a final result a kind of evaluation of a 'fair price' for buying or selling a space on a given Web page.
  • One objective of the invention is to incorporate the transient behaviour for differentiating our Ranking from the limit solution to the surfing model. Another objective of the invention it to provide to any dangling node a link to itself or to itself and its parents, with a given distribution (in order to impose monotonicity, at least in a weak sense). A link to itself models a reload action and hence it could be also used for the other non-dangling nodes.
  • the speed of the decay of w(j) to zero, as j tends to infinity can be used for deciding to give more or less importance to the stationary limit distribution (solution to the surfing model) or to the transient behaviour. Indeed, if I should choose a page where to put the advertisement of a new product, I would prefer a page with high transient Ranking (transient i.e. for j moderate e.g. at most 10, 15) because many people will have a chance of looking at it, instead of a page with low transient Ranking and high final Ranking (final i.e. as j tends to infinity).
  • transient Ranking transient i.e. for j moderate e.g. at most 10, 15
  • a natural problem at this point is the management of SPAM pages.
  • An interesting idea used in the previous model is based on a careful choice of the personalization vector v (see below): hence as before, in the previous sum, the uniform vector u is replaced by the personalization vector v.
  • a second natural problem is the computation of PageRank(l) intended, by definition, as the limit as the teleportation parameter c tends to 1 of PageRank(c) with generic personalization vector v.
  • PageRank(l) can be approximated, just by continuity, by PageRank(c) with c close to 1 (0.9, 0.99): there is a lot of work by Golub and coauthors (using Arnoldi), Del Corso, Gulli', Romani (using the linear system representation and preconditioned GMRES), Breziski and Redivo Zaglia (vector extrapolation based on explicit rational formulae of PageRank(c)) etc. for making such computations fast.
  • click(j)-click(j- 1) is a nonnegative integer and represents the number of visits to the Web for which j-1 has been the last click before stopping the visit or starting a new one.
  • w(j) may be viewed as the area of a rectangle where the length of the basis is the average time between click j and click j+1 and the length of the height is the quantity click(j). It is not excluded that the behaviour of such a sequence w(j) can be roughly approximated by a Poisson distribution with a given mean.
  • the personalization vector v can be described. It should be nonnegative and with unit 1-norm (just a matter of scaling).
  • v(j) should be put at zero if it is recognized as a SPAM and for the other pages the value v(j) has to be proportional to the sum over the visits to j at the first click of the visit-time.
  • P [t+dt] is the Web matrix at the time t+dt
  • w[t+dt] is the vector of the weights at the time t+dt
  • z is defined as a convex combination of v[t+dt] (the personalization vector defined as before at time t+dt) and y[t] which carries the information on PageRank at the older time t.
  • the parameters of the convex combination can be interpreted as weights that measure the level of fidelity, which is based on the 'past importance'
  • the new Ranking Method according to the invention may be called the VisibilityRank or the CommercialRank, since a query-independent measure is given of the 'fair value' of any Web page for deciding e.g. the cost of putting an advertisement in that page (as in the determination of the cost of renting a space for advertisement in a given place of a given street, square etc.)
  • the present invention and its related model could be of interest not only in Web Ranking, but also in political/social sciences e.g. for Ranking who/what is influential and who/what is not (as an example one could be interested in answering to the following questions: Bill Clinton's opinion is really influential and at which level? How to Rank immaterial forces such as a religious authority vs material forces such as economic/military powers?), in many aspects of marketing, for Ranking human resources, for Ranking the importance of a paper and/or of a researcher looking in scientific databases.
  • MathSciNet for Mathematicians where a generic node is any paper in the database and a link from A to B is just a bibliographic reference to paper B in paper A.

Abstract

The invention relates to a method for ranking Web pages which characterized by the fact that a each dangling node is provided with a link to itself.

Description

Web search engine with new Ranking Method
Field of invention
The present invention relates to the ranking of Web pages with hierarchical/topological considerations.
State of the Art
Search engines, for instance implemented by GOOGLE, YAHOO, MICROSOFT or ASK JEEVES, are used by millions of users every day.
There are some models and different algorithms which may be used for Ranking Web pages. Most of them are based on the structure of the Web (i.e. the World Wide Web) as a huge graph, i.e. a collection of nodes, Web pages, links between the nodes and on the behaviour of the "generic" user. There is a connection from A to B if the there is an outgoing link from the Web page A to the Web page B. In other words, from page A one can reach within one single "click" page B.
When one looks for the Web pages containing a given word (e.g. "Ferrari"), hundreds and often thousands of pages are involved. This large amount of pages to check could be a problem for the user and, in fact, the main feature of those algorithms is that Web pages appear in a precise order, the first being supposed to be "more important" than the second, the second "more important" than the third and so on. Informally speaking, a page is more important if it may be "clickable" from other (important) pages: the same concept appearing in Social Sciences, described and even personally interpreted by the PopArt Master Andy Warhol. In this specific case of the "Ferrari" search, the first page is the HomePage of the prestigious Car Maker Ferrari, the fourth is the Ferrari Club of America and so on.
It is self evident that this idea of making a Ranking of the Web pages in terms of importance is very useful (and has a strong commercial impact), since the generic user will find the desired information more easily without reading every page containing the wanted information. Moreover, the owner of the "important" pages will have a kind of free advertisement. However, it is widely recognized that these engines sometimes do not produce the desired order. The related pathologies have been well described also in Scientific Journals as Internet Mathematics and Workshops as the World Wide Web Conferences on the subject. One example of an artificially constructed pathological Web graph occurs when there is a connection from a page A to a page B, a connection from a page C to page B and a connection from page B to page C. Page A being connected from all remaining 10Λ9 pages, the connections among these 10Λ9 pages being arbitrary.
The example below follows the classical GOOGLE model, as defined in particulair in US 6, 285, 999 :
A matrix G is defined with G(i,j) equal to l/deg(i) if deg(i)>0 and there is a link from page i to page j, deg(i) being the cardinality of the outgoing links from page i; if deg(i)=O i.e. i is a dangling node then the entry G(i,j) is set equal to 1/n with n being the size of the Web. The resulting matrix is row stochastic and represents a surfing model for the behaviour on the Web of a generic user. Then the model is completed by adding the teleportation parameter c so that G[c]=cG+(l-c)evΛT, c belonging to [0,1], e being the vector of all ones, v being a nonnegative probability vector. The Ranking of page i is defined as the i-th entry of PageRank( c) where y=PageRank( c) is a n-sized nonnegative probability vector satisfying the relation Py=y with P being the transpose of G[c]. According to the classical approach, in the "ideal case" of c=l, the page A has zero Ranking (as the 10Λ9 pages in the first row) and the importance is concentrated in B and C (due to the ergodic projector, J. Lasserre, A formula for singular perturbations of Markov Chains, J. Appl. Prob., 31-2 (1994) pp. 829-833). This Ranking is highly counter- intuitive and indeed wrong: if you are a leader of 10Λ9 people, you are really powerful no matter if any of your followers has low Ranking i.e. he/she is not important. Now suppose that C is deleted (i.e. that Web page is deleted) or that the pages B and C merge in a unique Web page named again B. Then the Ranking of B goes down dramatically and A becomes really the most important. Again this sharp modification of the Ranking is highly counter-intuitive and indeed wrong. At least, one would expect that the cumulative Ranking of B and C, before deletion of C, and the Ranking of B alone after deletion of C should remain roughly speaking constant (a sort of mono tonicity, which is substantially violated by the actual model). A strong and macroscopic evidence of the problems in the actual model is that, for most of the nodes (what is called "core" in the literature, see P. Boldi, M. Santini, S. Vigna, PageRank as a function of the damping factor, Proc. 14th Intenational World Wide Web Conference (WWW 2005), Chiba, Japan 2005, ACM, pp. 557-566), the Ranking is zero. Indeed, only the use of values of the teleportation parameter c far away from 1 (e.g. 0,85) partly alleviates the problem (in Latin "ex falso quod libet" is an expression capturing the fact that from something wrong anything can derive and, by coincidence, also good things).
The basic error is the confusion between the notion of "importance" (PageRanking) and the stochastic model for surfing on the Web. We can identify two critical points :
A first critical point is a wrong treatment of dangling nodes: with the actual model, there is no monotonicity as the example of deletion of node C in the above graph shows.
A second critical point is that the transient effects are not taken into account. A user is on the Web for a finite number of clicks, at every visit. This implies that looking at the stationary vector (as the number of clicks tends to infinity) is just theoretical and far from reality.
Some algorithms have been developed in order to speed up the computation and/or to eliminate the above cited problems. Here we give a more detailed account on the existing background.
As usually recognized a very precise way of Ranking a document in a given database is the determine the intrinsic importance based on the content of the document or of its parents or from the anchor text : This idea becomes unpractical and its application even impossible for computational reasons when the size of the database is of order of millions or billions, as it happens in the case of the Web. The alternative approach is to define the Ranking according to extrinsic relations among nodes: this sort of idea is called link-based Ranking. Clearly the GOOGLE model described previously belongs to this class. The computation for c<l is carried by using the power method without normalization since the 1Λ1 norm of the iterates is maintained, if the initial guess is a nonnegative probability vector. As already reported the value c=0.85 is chosen. Many techniques for accelerating the computation have been proposed, see e.g. WO 2004/088475, A2. They are based on extrapolation techniques, on block analysis of the matrix P, linear system representation and analysis of the PageRank vector etc. and allow to compute efficiently the vector PageRank(c ) also for c larger to 0.85 (e.g c=0,9 or 0.99). Other ways of treating the problem consist 1) in determining the importance in two steps, first on a much denser and much smaller matrix representing the GOOGLE approach, where the Web pages are replaced by Hosts, and then treating the Web pages (see e.g. EP 1 653 380 Al), 2) in identifying several GOOGLE type matrices related to different features and in considering a linear combination of these matrices (see US 2004/0111412 Al), in order to increase the precision of the Ranking by blending the influence of the different features, or 3) by considering the influence of the neighbouring pages (see EP 1 622 047 A2) . In that latter approach the secondary (sub-dominant) eigenvector has to be computed and this could be computationally critical. However none of them addresses the model problems emphasized before of wrong identification between the surfing model and a right notion of importance. Similar link-based Ranking techniques disclosed in US 6, 111, 202 calculate (a partial) singular value decomposition of a GOOGLE-type matrix determines the Rank of a given page i as the i-th component of the principal singular vector. The idea behind is essentially that of Kleinberg based on the dual concepts of Hub and Authority: in this type of models it is natural to make the Ranking by including features of the query (query dependent Ranking). Although the idea is of great interest (see e.g. EP 1 596 315 Al, EP 1 643 385 A2, US 2002/0103798 Al), here the main goal is different because we would like to define a kind of global Rank as in the GOOGLE model, independent of the query, for evaluating the importance of every Web page in terms of the topology of the underlying graph: this would give as a final result a kind of evaluation of a 'fair price' for buying or selling a space on a given Web page.
Description of the Invention
One objective of the invention is to incorporate the transient behaviour for differentiating our Ranking from the limit solution to the surfing model. Another objective of the invention it to provide to any dangling node a link to itself or to itself and its parents, with a given distribution (in order to impose monotonicity, at least in a weak sense). A link to itself models a reload action and hence it could be also used for the other non-dangling nodes. An improved ranking may be computed starting from the uniform vector u with components all equal to w(0)/n (with n being the size of the Web) and adding all the vectors w(j)PΛj u, where P=G[I ]ΛT is the transpose of the modified GOOGLE matrix with parameter c=l, j ranges from 0 to a reasonable number of clicks, w(j)>0 is the j-th term of a sequence forming a convergent series. Here the modified GOOGLE matrix is that of the old model with c=l modified by incorporating the new treatment of dangling nodes: moreover, the present invention is not limited to choosing a hyperlink at random within a page with uniform distribution; if statistical data are known about actual usage patterns, then that information can be included since any arbitrary distribution describing the choice of the hyperlink can be considered.
Here the speed of the decay of w(j) to zero, as j tends to infinity, can be used for deciding to give more or less importance to the stationary limit distribution (solution to the surfing model) or to the transient behaviour. Indeed, if I should choose a page where to put the advertisement of a new product, I would prefer a page with high transient Ranking (transient i.e. for j moderate e.g. at most 10, 15) because many people will have a chance of looking at it, instead of a page with low transient Ranking and high final Ranking (final i.e. as j tends to infinity). In fact no user will wait so much or, if he/she waits on the Web, then he/she will be probably terribly tired and unable to appreciate any commercial suggestion. This can motivate a first concrete proposal of w(0)=w(l )=.... =w(k)=(p-l)/(pk), for a reasonably moderate k (e.g. k integer with k in the interval [7,20]), p belonging to [2,10], and the remaining w(j), j>k, such that w(k+l)+w(k+2)+....=l/p. In practice, for j larger of any reasonable number of clicks, dictated e.g. by the "physical resistance" of a generic user, we could set w(j)=0. Furthermore, since the Cesaro sum of the PΛj u tends to a stationary distribution (as in the Google model) and this stationary distribution is the limit as the teleportation parameter c tends to 1 of PageRank(c), PageRank(c) being the classical PageRank, instead of the general condition w(k+l)+w(k+2)+....=l/p we can safely choose w(k+l)=l/p, w(m)=0 for every m larger than k+1 and the classical PageRank(l) instead of PΛj u, for j=k+l. The choice of PageRank(l) is recommended for stabilizing the computation: indeed the sequence PΛj u may fail to converge, while its Cesaro mean converges to the ergodic projector.
A natural problem at this point is the management of SPAM pages. An interesting idea used in the previous model is based on a careful choice of the personalization vector v (see below): hence as before, in the previous sum, the uniform vector u is replaced by the personalization vector v. A second natural problem is the computation of PageRank(l) intended, by definition, as the limit as the teleportation parameter c tends to 1 of PageRank(c) with generic personalization vector v.
In fact from the analysis in (S. Serra Capizzano, "Jordan canonical form of the Google matrix: a potential contribution to the PageRank computation", SIAM Journal on Matrix Analysis and Applications, Vol. 27, N.2 (2005), pp. 305—312 and in R. Horn, S. Serra Capizzano, "A general setting for the parametric Google matrix" to appear on Internet Mathematics) we know that PageRank(c ) is an analytic function of c on the complex plane, except for a finite number of points different from 1 outside the open unit disk. Therefore PageRank(l) can be approximated, just by continuity, by PageRank(c) with c close to 1 (0.9, 0.99): there is a lot of work by Golub and coauthors (using Arnoldi), Del Corso, Gulli', Romani (using the linear system representation and preconditioned GMRES), Breziski and Redivo Zaglia (vector extrapolation based on explicit rational formulae of PageRank(c)) etc. for making such computations fast.
An appropriate choice of the involved parameters, based on the experience, is also possible with special reference to k, p and to the weights w(j). Here is a first embodiment : A visit to the page A will make A more important if it is longer: following this principle the value w(j) could be decided as a monotone function of the average time of a generic user between the click j and the click j+1 (see below). While the previous model is trying to Rank the importance at the limit (the asymptotic stationary distribution i.e. the solution to the surfing model), the present approach can be seen as a global Ranking i.e. as a weighted integral over the discrete time (decided by clicks on the Web) of the Ranking. Of course, as already informally observed, the weights w(j), like in any weighted quadrature formula, decide where to put the attention for giving the final decision on the global Ranking.
Another healthy effect of the integral approach is the stabilization (typical of any Cesaro like process). Indeed, by considering again the above example, with the old model the Ranking of page B and C are oscillating. They exchange the first and the second top positions at every j and the difference between their Ranking is not negligible. Of course, the use of tele-portation just alleviates the phenomenon, which is eliminated at the limit, but in practice it remains well visible. The averaging implied by the integral approach substantially reduces this fact as any Cesaro like process does: however, it should be noticed that a plain Cesaro approach would again give emphasis only on the limit behaviour, since its representing matrix would converge to the spectral projector (see again J. Lasserre, A formula for singular perturbations of Markov Chains, J. Appl. Prob., 31-2 (1994) pp. 829—833 or Serra-Capizzano, The PageRanking problem: the model and the analysis, Proc. Dagstuhl Seminar on 'Web Information Retrieval and Linear Algebra Algorithms', Dagstuhl, Germany 2007).
Furthermore, let us give more details for the determination of the sequence w(j), based on experience. Consider for a moment to have the following information on all the visits on the Web for a certain window of observation (one week for instance). Let click(O) be the cumulative number of first clicks, over all users, over all visits, on the Web (for opening Internet), let click(l) be the cumulative number of second clicks on the Web and so on. If you are on the Web and you change Web page not clicking, but by writing explicitly the address, then this is counted as a restart i.e. in the number click(O): clearly for any j>0, click(j)-click(j- 1) is a nonnegative integer and represents the number of visits to the Web for which j-1 has been the last click before stopping the visit or starting a new one. Moreover, there exists only a finite number of indices j with nonzero click(j), due to the finiteness of the time interval and due to the physical resistance of the generic user. Now on these visits make a statistics on every t_j, j>0, the time interval between the click number j-1 and the click j, if the click j-1 is not the last click, and the time interval between the click number j-1 and the exit, if the click j-1 is the last. Let us denote by Tj the expected value of t_j on our observations. Then calling \gamma(j)=click(j)*(T(j)-T(j-l)), T(-l)=0, and s(j) the sum of all \gamma(k) with k>j-l, our integral will be
y=F(P,v,w)=w(0) v +w(l) Pv+ +w(k)PΛk v+w(k+l)PageRank(l) (**)
with w0)=\gamma(j)/s(0), j=0,....,k, and w(k+l)=s(k+l)/s(0).
In this way more influence is given to PΛs v if the 'area w(s)' is maximal: w(j) may be viewed as the area of a rectangle where the length of the basis is the average time between click j and click j+1 and the length of the height is the quantity click(j). It is not excluded that the behaviour of such a sequence w(j) can be roughly approximated by a Poisson distribution with a given mean. Along the same line the personalization vector v can be described. It should be nonnegative and with unit 1-norm (just a matter of scaling). Moreover v(j) should be put at zero if it is recognized as a SPAM and for the other pages the value v(j) has to be proportional to the sum over the visits to j at the first click of the visit-time.
Of course these parameters have to be estimated, but such an information may be accessed by the leaders of Web-Searching Market (as Google, Microsoft, Yahoo etc).
Finally, the latter statement suggests to look at the problem in a time dependent and nonlinear way, since the Web evolves in time and the expected values of the various time intervals, i.e., T(j)-T(j-1), j=0, 1,... also depend on the Ranking that we attribute to Web pages. A concrete proposal is the following: if y(t) denotes this new definition of the PageRank according to the formula (**), then we define the new PageRank at t+dt as
y[t+dt]= F(P[t+dt],z,w[t+dt]), z=m y(t)+ (1-m) v[t+dt], 0<=m<=l,
where P [t+dt] is the Web matrix at the time t+dt, w[t+dt] is the vector of the weights at the time t+dt, and where z is defined as a convex combination of v[t+dt] (the personalization vector defined as before at time t+dt) and y[t] which carries the information on PageRank at the older time t. The parameters of the convex combination can be interpreted as weights that measure the level of fidelity, which is based on the 'past importance'
Conclusions and final remarks
In summary two goals are achieved with the present invention. The actual efficiency (fast computation) is preserved, since the new computation will involve at most two vectors, which already were computed in the preceding model, and the old pathologies are removed without introducing new ones.
The new Ranking Method according to the invention may be called the VisibilityRank or the CommercialRank, since a query-independent measure is given of the 'fair value' of any Web page for deciding e.g. the cost of putting an advertisement in that page (as in the determination of the cost of renting a space for advertisement in a given place of a given street, square etc.)
As a final remark, it is worth mentioning that the present invention and its related model could be of interest not only in Web Ranking, but also in political/social sciences e.g. for Ranking who/what is influential and who/what is not (as an example one could be interested in answering to the following questions: Bill Clinton's opinion is really influential and at which level? How to Rank immaterial forces such as a religious authority vs material forces such as economic/military powers?), in many aspects of marketing, for Ranking human resources, for Ranking the importance of a paper and/or of a researcher looking in scientific databases. Let us think to MathSciNet for Mathematicians where a generic node is any paper in the database and a link from A to B is just a bibliographic reference to paper B in paper A. For evaluating the impact (i.e. the Ranking) of a paper the very same model and the same procedure as described before could be applied to the related graph. For evaluating or Ranking a researcher (a very hot topic nowadays in several countries) it would be enough to modify the graph where every single node is a researcher and a link from A to B means that the researcher A has written at least one paper referring to at least one paper of the researcher B: the links have to be weighted and the related weights will be proportional to the number of such papers and will be properly normalized according to the number of authors in the referring papers of A and in the referred papers of B. The algorithm will be again the same and again the same idea would work for Ranking researcher groups or Institutions such as Departments, Faculties, Universities (see the hierarchical approach in EP 1, 653, 380 Al). In addition it is worth stressing that the described procedures for defining the graph and for computing the Ranking are unchanged in any Scientific homogeneous community.
Of course, for modelling in a convincing way such complex phenomena, it would be recommended to enrich the structure of the graph by adding to nodes and/or to edges more information (meta-graph? ....). However, the essential basic idea for defining and computing the Ranking has to remain virtually the same.

Claims

Claims
1. A method for ranking Web pages characterized by the fact that a each dangling node is provided with a link to itself.
2. A method according to claim 1 wherein a link is also provided to the dangling node parent(s).
3. A method according to claim 1 or 2 wherein the ranking is computed as a nonnegative linear combination with coefficients w(j) of PΛjv, j>=0, v right personalization vector,
P GOOGLE-like matrix with parameter c=l .
4. A method according to anyone of the previous claims using the following formula (**) : y=F(P, v,w)=w(0) v +w( 1 ) Pv+ +w(k)PΛk v+w(k+ 1 )PageRank( 1 )
with wO)=\gamma(j)/s(O), j=0,....,k, and w(k+l)=s(k+l)/s(0). and wherein the sum of w(j)pΛj v, j>k, is replaced by the unique term w(k+l)y(l) , y(l) being the limit as c tends to 1 of the GOOGLE-like Page Rank with said matrix P; y(l) being computed by approximation as y(0,99) or by any appropriate numerical technique; y(l) furthermore replacing PΛj v, with j>k.
5. A method according to anyone of the previous claims comprising a step which includes the determination of the average time interval between two successive clicks j and j+1 and the average number of visits over all the users at click j: this is used for computing the weights w(j), j>=0, as in the steps before the application of said formula (**).
6. A method according to anyone of the previous claims including a step for the determination of the entries v(s), s>=l, of the personalization vector v, according to the number of first visits i.e. at click j=0. Use of the method as defined in anyone of the previous claims as a time-dependent model for taking into account the rapid change of the WEB and for determining the ranking as a compromise between the ranking at the present time t and the ranking in the past.
PCT/EP2007/052839 2006-03-24 2007-03-24 Web search engine with new ranking method WO2007110398A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP2006061030 2006-03-24
EPPCT/EP2006/061030 2006-03-24

Publications (1)

Publication Number Publication Date
WO2007110398A1 true WO2007110398A1 (en) 2007-10-04

Family

ID=38179862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2007/052839 WO2007110398A1 (en) 2006-03-24 2007-03-24 Web search engine with new ranking method

Country Status (1)

Country Link
WO (1) WO2007110398A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050256860A1 (en) * 2004-05-15 2005-11-17 International Business Machines Corporation System and method for ranking nodes in a network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050256860A1 (en) * 2004-05-15 2005-11-17 International Business Machines Corporation System and method for ranking nodes in a network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DEL CORSO G M ET AL: "Exploiting Web Matrix Permutations to Speedup PageRank Computation", TECHNICAL REPORT, May 2004 (2004-05-01), University of Pisa, Italy, XP002440050, Retrieved from the Internet <URL:http://www.di.unipi.it/~gulli/papers/itt/exploiting_web_matrix_tech_report-iit.pdf> [retrieved on 20070629] *
EIRON N ET AL: "Ranking the Web Frontier", PROCEEDINGS OF THE 13TH CONFERENCE ON WORLD WIDE WEB, 17 May 2004 (2004-05-17) - 22 May 2004 (2004-05-22), New York, NY, USA, pages 309 - 318, XP002440049, Retrieved from the Internet <URL:http://www2004.org/proceedings/docs/1p309.pdf> [retrieved on 20070629] *

Similar Documents

Publication Publication Date Title
Chen et al. Real-time topic-aware influence maximization using preprocessing
Leskovec et al. Empirical comparison of algorithms for network community detection
EP1304627B1 (en) Methods, systems, and articles of manufacture for soft hierarchical clustering of co-occurring objects
Tong et al. Random walk with restart: fast solutions and applications
Tong et al. Proximity tracking on time-evolving bipartite graphs
CN100470544C (en) Method, equipment and system for chaiming file
Mitrović et al. Quantitative analysis of bloggers’ collective behavior powered by emotions
CN106886579A (en) Real-time streaming textual hierarchy monitoring method and device
US8639703B2 (en) Dual web graph
Huang et al. A link analysis approach to recommendation under sparse data
Baek et al. Efficiently mining erasable stream patterns for intelligent systems over uncertain data
Zhan et al. Fast incremental pagerank on dynamic networks
Ding et al. User modeling for personalized Web search with self‐organizing map
Cicone et al. Google PageRanking problem: the model and the analysis
Srivastava et al. Discussion on damping factor value in PageRank computation
Ustinovskiy et al. An optimization framework for weighting implicit relevance labels for personalized web search
CN105956012B (en) Database schema abstract method based on figure partition strategy
WO2007110398A1 (en) Web search engine with new ranking method
Ngomo et al. Holistic and scalable ranking of RDF data
Rajput et al. A study and comparative analysis of web personalization techniques
Jain et al. Study and analysis of category based PageRank method
De Knijf FAT-miner: Mining frequent attribute trees
Oshino et al. Time graph pattern mining for Web analysis and information retrieval
Misaghian et al. Resource recommender system based on tag and time for social tagging system
Zhang et al. Using Web clustering for Web communities mining and analysis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07727313

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07727313

Country of ref document: EP

Kind code of ref document: A1