CN103559258A

CN103559258A - Webpage ranking method based on cloud computation

Info

Publication number: CN103559258A
Application number: CN201310536603.9A
Authority: CN
Inventors: 向阳; 平宇; 张依杨; 陈佑雄; 张波; 袁书寒
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-11-04
Filing date: 2013-11-04
Publication date: 2014-02-05

Abstract

The invention discloses a webpage ranking method based on cloud computation. The method comprises the following steps of analyzing a network file which is crawled by a distributive webpage crawler to obtain a basic topological structure information file of a network; offline calculating a PR value and then storing the PR value into a corresponding document table, wherein the format of the document table adopts url as a main key , and the format containing eight attribute columns containing title, content, type, timestamp, outlinks and the like; adopting a parallel computation technology for establishing an index table of single word - webpage importance, wherein the format of the index table is a format established by a reverse index and containing key and links (link set and sorted according to the importance; adopting a MapReduce parallel architecture to realize the offline PageRank algorithm; and comparing the similarity of an inquiry word and a webpage for online inquiry, and giving a final webpage rank according to the offline inquiry result. The method has the advantages that the offline ranking algorithm is adopted, the MapReduce parallel arhictecture is adequately utilized, so that the offline ranking efficiency is improved; and by adopting the technology combining the key word technology and the PageRank technology, the result is more accurate.

Description

Web page sequencing method based on cloud computing

Technical field

The present invention relates to a kind of distributed Web page sequencing method, especially process the Web page sequencing method based on cloud computing under large data.

Background technology

Along with Internet develops rapidly, WWW (World Wide Web is called for short WWW) has become a huge information space, for user provides valuable information resources.And in the face of a large amount of information resources, it is very inconvenient to browse step by step by browser, how to obtain information needed from WWW fast and accurately, becomes vital problem.The appearance of search engine, has improved the ability that people gather information greatly.Yet existing search engine also exists problems at aspects such as search efficiency, maintenance of information, information repetition, network and website, loads.

At present, from architecture, most of search engine is centralized.From Internet, fetch the page, by analysis, process after by all index information centralized stores at certain website, user realizes inquiry by this website of access.Between them, conventionally there is no what cooperation, separate searches and process information, caused a large amount of repeated works and serious bandwidth waste separately, sometimes even can cause network congestion.This architecture is difficult to adapt to the expanding day of network size, and industry proposes to set up the strategy of distributed search engine one after another.

Traditional search engine, it is universal search engine, in application, can provide a large amount of Search Results for user, but because the scale of data volume constantly increases, these universal search engines are when more information is returned in pursuit, be difficult to take into account accuracy and the degree of correlation of Search Results, thus the result that causes the dissatisfied search engine of user to return.Therefore Algorithms for Page Ranking of simultaneously taking into account search engine response time and accuracy rate becomes the focus of research.

Traditional search engines Algorithms for Page Ranking limitation is in particular in:

(1) network data magnanimity problem: the large broad covered area of network information quantity, need to consume a lot of time and storage space to the calculating of these data and storage.

(2) response time problem when line ordering: owing to can retrieving a large amount of info webs according to keyword, so the response time of search engine is relatively long, does not meet the requirement of search engine to the response time.

(3) accuracy problem of webpage sorting: the precision of webpage sorting is not only relevant with keyword, simultaneously webpage self good also has important relation.Therefore for the sequence of webpage, need to consider in conjunction with both.

(4) expression of term: user is due to the deficiency of domain knowledge, and the query interface of search engine has limitation, thus cannot realize accurately user's search intention.

Therefore, how to make user conveniently from the Search Results of magnanimity, by webpage sorting, obtain required information, become a problem in the urgent need to address.

Summary of the invention

Technical matters to be solved by this invention is that a kind of Web page sequencing method based on cloud computing that improves off-line sequence efficiency will be provided.

In order to solve above technical matters, the invention provides a kind of Web page sequencing method based on cloud computing, utilize storage and the calculating advantage of cluster, accelerate the processing speed to backstage mass data.This sort method comprises the following steps:

(1) by being stored on cloud, the network file crawling through distributed webpage reptile carries out dissection process, obtains the Basic Topological message file of network.

(2) after calculated off-line PR value, deposit corresponding document table in, its form, for its form is for its form is for take url as major key, comprises title, content, type, timestamp, the document sheet format of 8 attributes such as outlinks (pointing out link set);

(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form also for utilize that reverse indexing sets up with key, the concordance list form of links (link set, and by importance ranking)

(4) to the PageRank algorithm of off-line, adopt the parallel framework of MapReduce to realize;

(5) when online query, the similarity of comparison query word and webpage, provides the sequence of final webpage in conjunction with the result of offline search.

The described step network file crawling (1) carries out dissection process and comprises the following steps:

1. crawl the network address on Internet, and according to field, resolve to corresponding form and leave in distributed file system.

2. to leaving the processing of standardizing of file in distributed file system in, adopt the processing mode of parallelization, obtain the file of webpage topology information, and again leave in distributed file system.

The PageRank value calculating method of described step in is (4) as follows:

R ' (u) represents similarity, and computing method are as follows:

Wherein R ' is (u) the importance score of this webpage of u, R ' is (v) score of v webpage, wherein in v webpage, has the link of pointing to u webpage, and Nv represents the university of the set that the webpage of this specific character of v forms, c is constant, and E (u) represents certain distribution function of certain u;

Consider the non-dependence between its record, can utilize distributed computing method to this calculating, parallel calculates.

The similarity of described step in (5) calculate and final Web page importance computing method as follows:

Adopt the similarity that represents of TF-IDF

Figure 2013105366039100002DEST_PATH_IMAGE002

Final Web page importance is

Figure 2013105366039100002DEST_PATH_IMAGE004

TF wherein _i,jrepresent the number of times that term (entry) i occurs in webpage j, IDF _irepresent upset document frequency, conventionally with it, describe the singularity of a word, final score score is a weighted linear combination of TFIDF and PR.

Described concept set is the set of the word that comprises in document table.

Described property set is the set with the series of parameters of the feature of descriptor.

Described example set is the set of search records that some key words comprise.

Described set of relations is the set of keyword and index record.

Described property set comprises TFIDF, pagerank value, and there is position in KeyWords.

The present invention compared with prior art, has the following advantages:

1, well utilized the feature of distributed storage;

2, adopt improvement off-line sort algorithm, taken full advantage of the parallel framework of MapReduce, improved the efficiency of off-line sequence;

3, adopted the technology of keyword technology and PageRank combination to make result more accurate.

Accompanying drawing explanation

Fig. 1 is process flow diagram of the present invention;

Fig. 2 is result for retrieval schematic diagram.

embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.

As shown in Figure 1, the invention provides a kind of Web page sequencing method based on cloud computing, utilize storage and the calculating advantage of cluster, accelerate the processing speed to backstage mass data.This sort method comprises the following steps:

(1) by being stored on cloud, the network file crawling through distributed webpage reptile carries out dissection process, obtains the Basic Topological message file of network;

(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form be utilize that reverse indexing sets up with key, the concordance list form of links (link set, and by importance ranking);

The PageRank value calculating method of described step in is (4) as follows:

R ' (u) represents similarity, and computing method are as follows:

Figure 2013105366039100002DEST_PATH_IMAGE005

Adopt the similarity that represents of TF-IDF

Figure 2013105366039100002DEST_PATH_IMAGE006

Final Web page importance is

Figure 2013105366039100002DEST_PATH_IMAGE007

Described concept set is the set of the word that comprises in document table.

Described set of relations is the set of keyword and index record.

Web data in case is by distributed network reptile search information in internet automatically.Finance cloud search is equipped with server in many places, and the part website in the Chinese areas such as China's Mainland, Hong Kong, Taiwan, Macao, Singapore and North America, Europe has been contained in hunting zone.This search engine has been contained comprehensive financial class info web, and more than total amount reaches 2,000,000 webpages at present, and every day is still in continuous growth.

For every kind of output type provides example.To each in example, illustrative examples is as input: " Tongji University " retrieved, and Output rusults presents in the following manner, as shown in Figure 2.

From result for retrieval, can find out that this method has adopted distributed webpage sorting technology, consider that query word and web pages relevance and the good degree of webpage self can obtain good Query Result.

So far a kind of sort method based on cloud computing of case represent complete, the method can for after vertical search engine customization, the prototype method of the webpage sorting of the correlation search engines such as industry class search engine.

Claims

1. the Web page sequencing method based on cloud computing, this sort method comprises the following steps:

(2) after calculated off-line PR value, deposit corresponding document table in, its form, for its form is for take url as major key, comprises title, content, type, timestamp, the document sheet format of 8 attributes such as outlinks (pointing out link set);

(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form also for utilize that reverse indexing sets up with key, the form of links (link set, and by importance ranking);

2. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the described step network file crawling (1) carries out dissection process and comprises the following steps:

1. crawl the network address on Internet, and according to field, resolve to corresponding form and leave in distributed file system;

3. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the PageRank value calculating method of described step in is (4) as follows:

R ' (u) represents similarity, and computing method are as follows:

This calculating is utilized to distributed computing method, and parallel calculates.

4. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the similarity of described step in (5) calculate and final Web page importance computing method as follows:

Adopt the similarity that represents of TF-IDF

Final Web page importance is

Figure 2013105366039100001DEST_PATH_IMAGE005