GB2405709A

GB2405709A - Search engine optimization using automated target market user profiles

Info

Publication number: GB2405709A
Application number: GB0320583A
Authority: GB
Inventors: Peter H Mowforth
Original assignee: TELEIT Ltd
Current assignee: TELEIT Ltd
Priority date: 2003-09-03
Filing date: 2003-09-03
Publication date: 2005-03-09
Also published as: GB0604247D0; GB2419993A; WO2005024661A2; GB0320583D0; WO2005024661A8

Abstract

A means for continuously adjusting a Web site or Web pages in order to achieve optimization with respect to search engines, comprises first, means for automatically clustering relevant pages and sites according to target market segment via enumerated permutations of keywords and phrases and the use of searches for relevant web sites, and second means for customer profiling based on the clustering of the first means and processing of aggregated queries to search engines. The first and second means are used by an evaluation means which provides an automatic numerical ranking of pages and sites, and which provides continuous adjustment and optimization of web sites and web pages.

Description

Automatic Target Market User Profiles The invention relates to the

development and maintenance of a Web Site or other infor- maton resource where statistical data from the Web and from search engine usage is used for customer profiling and market analysis, and this information used on an ongoing basis to optimize the positioning of the Web site car information resource.

Background

The World Wide Web (the Web) is a source of information unprecedented in its scope and availability. The decentralized nature of the Web makes searching for information to satisfy a specific requirement challenging. This challenge has been met by the Search Engine. Search engines scan the Web and index Web sites and permit searches to be conducted over these indexes. Since, in many cases, information can only be obtained from search engines, visibility in searches becomes of critical importance for web sites.

Methods for optimizing a site for discovery via search engines are well known. Such methods include market segmentation and competitive analysis based on information from search engines. These methods suffer from the disadvantage of being manual, and considerable effort is needed to update sites. Updating on a frequent basis is necessary to cope with rapid changes m the structure and content of the Web.

Related work The World Wide Web (the Web), initially developed in 1989 by Tim Bcrners-Lee at CORN, is a "universe of network-accessible information, the embodiment of human knowl- edge".

The Web has been turned to commercial purposes since the mid 1 990s and an increasing volume of traffic on the Web is devoted to commerce.

lasers of the Web, for both commercial and non-commercial use, often need to find spe- c'fic Information t'rom the Web. An increasingly common tool for searching out specific 2/9 information requirements is the Search Engine. Search Engines provide an automated in- dex of the Web by systematically exploring the Web and recording and indexing the Web sites they visit.

To find information via a search engine a query Is submitted, consstmg of a set of search terms, and a ranked list of results is returned.

It is in the interest of a Web site to be as highly ranked as possible with respect to relevant queries. The process of tuning a web site in order to maximize its ranking is known as Search Engine Optimization.

Search Engine Optimization is a largely manual process, which can involve some or all of the following steps 1. Identifying the market served by the web site.

2. Identifying competitors.

3. Selecting an appropriate set of search keywords.

4. Designing the site to maximize its search engine "visibility".

The invention described herein Improves on the manual process by automating steps 1, 2 and 3 and helping to automate step 4.

Summary of the Invention

Embodiments of the invention will now be described, by way of example only, with ref- erence to the accompanying drawings, in which: Figure I shows the overall control flow of the invention.

Figure 2 shows the means for determining overall market segmentation, and a measure of similarity between Web pages.

Figure 3 shows the means for determining customer profiles.

Figure 4 shows the means for determining a numerical evaluation of Web pages. 3/9

The principal object of this invention is to provide a methodology for continuous ad- justment to a site in order to optimize it with respect to search engines. Accordingly, this invention, provides a means for providing a quantitative ranking for a Web site by evaluating the site with respect to a given market segment and competitive environment.

The ranking mechanism, the competitive environment and the market segmentation are produced automatically by appropriate interaction with one or more search engines. The automatic qualitative ranking allows a search procedure to continually optimize the rank- ing by suggesting small changes which can be continuously evaluated.

The principal driver for the invention is the idea that the principal external driver for optimization is to exploit the difference between the information and services which are currently available, as indicated by Search Engines and what the users/customers want, as indicated by searches and the choices made by users as a result of searches.

Market Segmentation The objective of the competitive analysis is to apply Statistical and Machine Learning techniques to develop clusters of web sites, where each cluster represents a related set of competitors.

The starting point for analysis is a basic set of keywords and phrases relevant to the domain. In addition web sites which are known to fall in the set can be used, as described below.

Algorithm One I. The keywords and phrases are used to generate a set of permutations of subsets of the words and phrases. If there are rib keywords, then there are (k) subsets of length k and (7k)! permutations of these subsets.

2. Each permutation is presented to a series of search engines.

3. Each web page, retrieved from the first N hits on the search engine, is saved, in- dexed by its associated permutation and ranked from 1 to N. 4/9 4. The web pages are processed to remove common words, and the words are stemmed using the Porter algorithm [Porter, 1980].

5. The web pages retrieved in this way arc filteecd for duplicates and then Latent Se- mantic Analysis [Deerwester et al., 1990] is used to create a similarity measure between sites. The similarity measure is the Euclidean distance between pages with respect to the first N latent components, where N is 50 in the pret'erred imple- mentation.

6. The web pages are clustered using k-means clustering [MacQueen, 1965]. The clustering metric Is the similarity measure determined by the Latent Semantic Anal- ysis. Each cluster represents a sector of competition for the "product" defined via the keywords and phrases.

Discussion The use of Latent Semantic Analysis (' SI) on web pages retrieved by searches on pennu- tations of the original keywords and phrases reduces dependence of results on the initial teems and phrases, since ail the information in the retrieved pages is used to determine a clustering metric and the effects of polysemy and synonymy are reduced.

The number of dimensions used from the Singular Value Decomposition in LSI is vari- abic. In the preferred implementation of this invention, 50 dimensions are used.

The clusters can be used decomposed in various ways, for instance geographically, or- thogonal to the original construction via semantic analysis of keywords.

The clustering analysis uses only Web addresses acquired from searches. More sophisti- cated strategies for clustering involving further search using web-bots to explore links not pursued by the standard search engmes can be used where the market is specialized. IS/ 9

Customer Profile The objective of customer profiling Is to determine a set of different customer mforma- tion requirements, each requirement relating to a particular"customer profile". As with Market Segmentation the starting point is a set of keywords and phrases. In the case of Market Segmentation we are interested in the totality of available information within the range defined by the keywords and phrases. With customer profiinmg we are interested in clustering search queries, chosen from our set of keywords and phrases, in such a way that each cluster contains queries used by customers in search of a particular class of infonma- tion resource. This approach is related to those of [Wen et al., 2002] and [Becferman and Berger, 2000] among others.

The information available consists of triples (Q. U. R) where Q Is a query, U is the URL (Uniform Resource Locator) which was selected in response to that query and R is the rank of the selected URL with respect to the query. This is termed "clickthrough" data and is available from search engines.

The value of this data for clustering queries is shown by the following related observa- bons. If two different users search with terms "fly" and "ant" but select the same URL, there is evidence that the search terms are related to a common mfonnation requirement.

Similarly if two distinct users search on the same tend "ant" and visit different URLs there is some evidence that these two URLs are related. Note that such evidence is statistical, the tenm "law" for example might relate to either the legal system or physics.

There are three kinds of information available for clustering.

1. The similarity between queries 2. The similarity between URLs 3. The link structure between queries and URLs S'milanty between queries can be defined as the proportion of words or phrases which they have in common. Similarity between URLs is defined in terms of the distance mea- sure described above in terms of Market Segmentation. The link structure between queries 6/9 and URLs can be described as a bipartite graph. The "white" nodes of the graph are the unique queries, and the "black" nodes are URLs. The similarity between two nodes ofthe same color is the proportion of links they share compared to their total number of links.

It is also possible to assign a weight to links in this graph, the weight being a function of the ranking of the URL selected from a query presented to a search engine.

There are some natural variations of these similarity measures. For example, search I queries which are permutations of each other can be considered equal, and queries can be clustered by containment in a natural manner.

A similarity measure between the queries (white vertices), based on a combination of the above similarities, and a complementary measure between the URLs can be generated by a weighted combination of the individual similarities.

The clustering algorithm, which clusters both URLs and queries, proceeds as follows. It is a version of Hierarchical Agglomerative Clustering (MAC) [Ward, Jr., 1963]. It proceeds on the assumption that the number of URLs is very considerably larger than the number of queries. , Algorithm Two i 1. The two query nodes with the greatest similarity are merged. Record this merger.

2. The most similar URLs are merged. This Is done a reasonably large number of times, since there are many more URLs than queries. Record these mergers.

3. (ioto step l unless the number of queries has been reduced below threshold.

The end result of this algorithm is a hierarchical clustering of both queries and URLs. For any query cluster there is an associated set of URL clusters.

Discussion We make the assumption for any query cluster that each URL cluster corresponds to a distinct market for the query cluster. 7/9

Optimization The goal of optimization is to position an information resource (a Web site, viewed as a set of web pages) so as to maximize its initial value, and to continually evaluate the situation on an ongoing basis so as to maximize its continuing value.

The goal is not to maximize the number of visitors to the site, but the maximize the number of visitors who pay to consume the sited resources. This may mean making a purchase, or just takmg the time to read articles and information on the site.

It is difficult to determine the conversion rate of visitors who consume unless one can investigate in detail the behavior of visitors and perform experiments. We assume that it is possible to optimize conversion rate separately from visitor rate once the initial posi- toning of the site or page has been determined.

The goal of optimization is to generate a numerical measure of the "fitness" ol a web page/site. The following assumptions are made Pages which are semantically similar with respect to the distance measure described in the section titled "Market Segmentation" and which lie in the same cluster with respect to the Customer Profile, and with similar strengths, will attract a similar number of visitors. This is a base assumption, In that it justifies the use a numerical measure of Web site utility.

The ratio of number of visitors to cluster size is a direct determinant of the value of a cluster, as larger values imply more visitors for any site in the cluster.

A query cluster which relates to a URL cluster with low average rank is interesting since users have consistently chosen low-ranking URLs from search queries.

These observations can be used to assign a numerical rank to web pages and web sites.

The factors used to deterrent ranking are.

1. The cluster value (visitors/element).

2. The within-cluster ranking, determined by ranking score on queries within the clus- ter's associated queries. This ranking Is comparative. 8/9

3. The relevance of the cluster to the product being sold.

We now describe the optimization process. Optimization is typically performed for either a web site or a small group of interrelated pages, for example those describing a particular product.

1. Produce an initial set of keywords and phrases.

2. Produce a market segmentation consisting of a similarity measure between web pages, and a clustering of sites/pages as described in Algorithm One above.

3. Produce a Customer Profile, as described in Algorithm Two above.

4. Create initial site and web pages.

5. Determine the degree of membership of pages in the clusters produced hi Step 3.

6. Assign pages to clusters based on a determination of how its numerical ranking can be optimized. This Is a manual process, consisting of the following steps.

(a) Estimate the clusters most relevant to the product bemg sold.

(b) Modify the page in terms of keywords and language to minimize its distance to the cluster center.

7. Make the pages/site live.

8. Monitor the numerical ranking of pages. This is necessary to determine factor 2 above.

9. Monitor the ranking of the site on a continual basis by repeating steps 2, 3 and 8.

References [Beeferman and Berger, 2000] Doug Beeferman and Adam Bergcr. Agglomerative clus- tering of a search engine query log. In Raghu Ramakrishnan, Sal Stolfo, Roberto Ba- yardo, and Ismail Parsa, editors, Pr>ceedinmg.v 'J the 6th ACM SICKDD lnternuti'nul 9/9 Conference on Knowledge Discovery and Data Mining (KDD- OO), pages 407-416, N. Y., August 20-23 2000. ACM Press.

[Deerwester et al, 1990] Scott Deerwester, Susan Dumais, Goerge Furnas, Thomas Lan- dauer, and Richard Harshman. Indexing by latent semantic analysis. Journal oJ the American Society for Information 'Science, 41 (6):391 -407, 1990.

[MacQueen, 1965] J. MacQuecn. Some methods for classification and analysis of multi- variate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, Berkeley, CA, 1965. University of Califor- nia Press.

[Porter, 1980] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, July 1980.

[Ward, Jr., 1963] J. I I. Ward, Jr. Hierarchical grouping to optimize an objective function.

Journal of the American Statistical association, 58:236-244, 1963.

[Wen et al., 2002] Ji-Rong Wen, Jian-Yun N'e, and Hong-Jiang Zhang. Query clustering using user logs. ACM Transactions on Information Systems, 20(1) :59-81, January 2002.

Claims

Claims 1. The automatic positioning of a web site with respect a set of

search engines and methods for continuous adjustment of the site to optimize positioning.
2. A method, as in Claim I for the assignment of a numerical ranking to a web site or sub-site of a web site based on customer profile, market segmentation and searches conducted via a search engine
3. A method, as in Claim 1, for adjustment of a web site in order to maximize a numerical ranking, as in Claim 2.
4. A method for producing customer profiles, as in Claim 2, based on analysis of queries placed with search engines.
5. A method for market segmentation, as in Claim 2, based on analysis of the web, based on specialized search engines.