CN103559258A - Webpage ranking method based on cloud computation - Google Patents

Webpage ranking method based on cloud computation Download PDF

Info

Publication number
CN103559258A
CN103559258A CN201310536603.9A CN201310536603A CN103559258A CN 103559258 A CN103559258 A CN 103559258A CN 201310536603 A CN201310536603 A CN 201310536603A CN 103559258 A CN103559258 A CN 103559258A
Authority
CN
China
Prior art keywords
webpage
web page
importance
offline
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310536603.9A
Other languages
Chinese (zh)
Inventor
向阳
平宇
张依杨
陈佑雄
张波
袁书寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310536603.9A priority Critical patent/CN103559258A/en
Publication of CN103559258A publication Critical patent/CN103559258A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a webpage ranking method based on cloud computation. The method comprises the following steps of analyzing a network file which is crawled by a distributive webpage crawler to obtain a basic topological structure information file of a network; offline calculating a PR value and then storing the PR value into a corresponding document table, wherein the format of the document table adopts url as a main key , and the format containing eight attribute columns containing title, content, type, timestamp, outlinks and the like; adopting a parallel computation technology for establishing an index table of single word - webpage importance, wherein the format of the index table is a format established by a reverse index and containing key and links (link set and sorted according to the importance; adopting a MapReduce parallel architecture to realize the offline PageRank algorithm; and comparing the similarity of an inquiry word and a webpage for online inquiry, and giving a final webpage rank according to the offline inquiry result. The method has the advantages that the offline ranking algorithm is adopted, the MapReduce parallel arhictecture is adequately utilized, so that the offline ranking efficiency is improved; and by adopting the technology combining the key word technology and the PageRank technology, the result is more accurate.

Description

Web page sequencing method based on cloud computing
Technical field
The present invention relates to a kind of distributed Web page sequencing method, especially process the Web page sequencing method based on cloud computing under large data.
Background technology
Along with Internet develops rapidly, WWW (World Wide Web is called for short WWW) has become a huge information space, for user provides valuable information resources.And in the face of a large amount of information resources, it is very inconvenient to browse step by step by browser, how to obtain information needed from WWW fast and accurately, becomes vital problem.The appearance of search engine, has improved the ability that people gather information greatly.Yet existing search engine also exists problems at aspects such as search efficiency, maintenance of information, information repetition, network and website, loads.
At present, from architecture, most of search engine is centralized.From Internet, fetch the page, by analysis, process after by all index information centralized stores at certain website, user realizes inquiry by this website of access.Between them, conventionally there is no what cooperation, separate searches and process information, caused a large amount of repeated works and serious bandwidth waste separately, sometimes even can cause network congestion.This architecture is difficult to adapt to the expanding day of network size, and industry proposes to set up the strategy of distributed search engine one after another.
Traditional search engine, it is universal search engine, in application, can provide a large amount of Search Results for user, but because the scale of data volume constantly increases, these universal search engines are when more information is returned in pursuit, be difficult to take into account accuracy and the degree of correlation of Search Results, thus the result that causes the dissatisfied search engine of user to return.Therefore Algorithms for Page Ranking of simultaneously taking into account search engine response time and accuracy rate becomes the focus of research.
Traditional search engines Algorithms for Page Ranking limitation is in particular in:
(1) network data magnanimity problem: the large broad covered area of network information quantity, need to consume a lot of time and storage space to the calculating of these data and storage.
(2) response time problem when line ordering: owing to can retrieving a large amount of info webs according to keyword, so the response time of search engine is relatively long, does not meet the requirement of search engine to the response time.
(3) accuracy problem of webpage sorting: the precision of webpage sorting is not only relevant with keyword, simultaneously webpage self good also has important relation.Therefore for the sequence of webpage, need to consider in conjunction with both.
(4) expression of term: user is due to the deficiency of domain knowledge, and the query interface of search engine has limitation, thus cannot realize accurately user's search intention.
Therefore, how to make user conveniently from the Search Results of magnanimity, by webpage sorting, obtain required information, become a problem in the urgent need to address.
Summary of the invention
Technical matters to be solved by this invention is that a kind of Web page sequencing method based on cloud computing that improves off-line sequence efficiency will be provided.
In order to solve above technical matters, the invention provides a kind of Web page sequencing method based on cloud computing, utilize storage and the calculating advantage of cluster, accelerate the processing speed to backstage mass data.This sort method comprises the following steps:
(1) by being stored on cloud, the network file crawling through distributed webpage reptile carries out dissection process, obtains the Basic Topological message file of network.
(2) after calculated off-line PR value, deposit corresponding document table in, its form, for its form is for its form is for take url as major key, comprises title, content, type, timestamp, the document sheet format of 8 attributes such as outlinks (pointing out link set);
(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form also for utilize that reverse indexing sets up with key, the concordance list form of links (link set, and by importance ranking)
(4) to the PageRank algorithm of off-line, adopt the parallel framework of MapReduce to realize;
(5) when online query, the similarity of comparison query word and webpage, provides the sequence of final webpage in conjunction with the result of offline search.
The described step network file crawling (1) carries out dissection process and comprises the following steps:
1. crawl the network address on Internet, and according to field, resolve to corresponding form and leave in distributed file system.
2. to leaving the processing of standardizing of file in distributed file system in, adopt the processing mode of parallelization, obtain the file of webpage topology information, and again leave in distributed file system.
The PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, and computing method are as follows:
Figure 869530DEST_PATH_IMAGE001
Wherein R ' is (u) the importance score of this webpage of u, R ' is (v) score of v webpage, wherein in v webpage, has the link of pointing to u webpage, and Nv represents the university of the set that the webpage of this specific character of v forms, c is constant, and E (u) represents certain distribution function of certain u;
Consider the non-dependence between its record, can utilize distributed computing method to this calculating, parallel calculates.
The similarity of described step in (5) calculate and final Web page importance computing method as follows:
Adopt the similarity that represents of TF-IDF
Figure 2013105366039100002DEST_PATH_IMAGE002
Figure 876056DEST_PATH_IMAGE003
Final Web page importance is
Figure 2013105366039100002DEST_PATH_IMAGE004
TF wherein i,jrepresent the number of times that term (entry) i occurs in webpage j, IDF irepresent upset document frequency, conventionally with it, describe the singularity of a word, final score score is a weighted linear combination of TFIDF and PR.
Described concept set is the set of the word that comprises in document table.
Described property set is the set with the series of parameters of the feature of descriptor.
Described example set is the set of search records that some key words comprise.
Described set of relations is the set of keyword and index record.
Described property set comprises TFIDF, pagerank value, and there is position in KeyWords.
The present invention compared with prior art, has the following advantages:
1, well utilized the feature of distributed storage;
2, adopt improvement off-line sort algorithm, taken full advantage of the parallel framework of MapReduce, improved the efficiency of off-line sequence;
3, adopted the technology of keyword technology and PageRank combination to make result more accurate.
Accompanying drawing explanation
Fig. 1 is process flow diagram of the present invention;
Fig. 2 is result for retrieval schematic diagram.
embodiment
Below in conjunction with the drawings and specific embodiments, the present invention is described in detail.
As shown in Figure 1, the invention provides a kind of Web page sequencing method based on cloud computing, utilize storage and the calculating advantage of cluster, accelerate the processing speed to backstage mass data.This sort method comprises the following steps:
(1) by being stored on cloud, the network file crawling through distributed webpage reptile carries out dissection process, obtains the Basic Topological message file of network;
(2) after calculated off-line PR value, deposit corresponding document table in, its form, for its form is for its form is for take url as major key, comprises title, content, type, timestamp, the document sheet format of 8 attributes such as outlinks (pointing out link set);
(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form be utilize that reverse indexing sets up with key, the concordance list form of links (link set, and by importance ranking);
(4) to the PageRank algorithm of off-line, adopt the parallel framework of MapReduce to realize;
(5) when online query, the similarity of comparison query word and webpage, provides the sequence of final webpage in conjunction with the result of offline search.
The described step network file crawling (1) carries out dissection process and comprises the following steps:
1. crawl the network address on Internet, and according to field, resolve to corresponding form and leave in distributed file system.
2. to leaving the processing of standardizing of file in distributed file system in, adopt the processing mode of parallelization, obtain the file of webpage topology information, and again leave in distributed file system.
The PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, and computing method are as follows:
Figure 2013105366039100002DEST_PATH_IMAGE005
Wherein R ' is (u) the importance score of this webpage of u, R ' is (v) score of v webpage, wherein in v webpage, has the link of pointing to u webpage, and Nv represents the university of the set that the webpage of this specific character of v forms, c is constant, and E (u) represents certain distribution function of certain u;
Consider the non-dependence between its record, can utilize distributed computing method to this calculating, parallel calculates.
The similarity of described step in (5) calculate and final Web page importance computing method as follows:
Adopt the similarity that represents of TF-IDF
Figure 2013105366039100002DEST_PATH_IMAGE006
Final Web page importance is
Figure 2013105366039100002DEST_PATH_IMAGE007
TF wherein i,jrepresent the number of times that term (entry) i occurs in webpage j, IDF irepresent upset document frequency, conventionally with it, describe the singularity of a word, final score score is a weighted linear combination of TFIDF and PR.
Described concept set is the set of the word that comprises in document table.
Described property set is the set with the series of parameters of the feature of descriptor.
Described example set is the set of search records that some key words comprise.
Described set of relations is the set of keyword and index record.
Described property set comprises TFIDF, pagerank value, and there is position in KeyWords.
Web data in case is by distributed network reptile search information in internet automatically.Finance cloud search is equipped with server in many places, and the part website in the Chinese areas such as China's Mainland, Hong Kong, Taiwan, Macao, Singapore and North America, Europe has been contained in hunting zone.This search engine has been contained comprehensive financial class info web, and more than total amount reaches 2,000,000 webpages at present, and every day is still in continuous growth.
For every kind of output type provides example.To each in example, illustrative examples is as input: " Tongji University " retrieved, and Output rusults presents in the following manner, as shown in Figure 2.
From result for retrieval, can find out that this method has adopted distributed webpage sorting technology, consider that query word and web pages relevance and the good degree of webpage self can obtain good Query Result.
So far a kind of sort method based on cloud computing of case represent complete, the method can for after vertical search engine customization, the prototype method of the webpage sorting of the correlation search engines such as industry class search engine.

Claims (4)

1. the Web page sequencing method based on cloud computing, this sort method comprises the following steps:
(1) by being stored on cloud, the network file crawling through distributed webpage reptile carries out dissection process, obtains the Basic Topological message file of network;
(2) after calculated off-line PR value, deposit corresponding document table in, its form, for its form is for take url as major key, comprises title, content, type, timestamp, the document sheet format of 8 attributes such as outlinks (pointing out link set);
(3) by the concordance list of setting up single word-Web page importance, also adopt parallel computing, this concordance list form also for utilize that reverse indexing sets up with key, the form of links (link set, and by importance ranking);
(4) to the PageRank algorithm of off-line, adopt the parallel framework of MapReduce to realize;
(5) when online query, the similarity of comparison query word and webpage, provides the sequence of final webpage in conjunction with the result of offline search.
2. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the described step network file crawling (1) carries out dissection process and comprises the following steps:
1. crawl the network address on Internet, and according to field, resolve to corresponding form and leave in distributed file system;
2. to leaving the processing of standardizing of file in distributed file system in, adopt the processing mode of parallelization, obtain the file of webpage topology information, and again leave in distributed file system.
3. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the PageRank value calculating method of described step in is (4) as follows:
R ' (u) represents similarity, and computing method are as follows:
Wherein R ' is (u) the importance score of this webpage of u, R ' is (v) score of v webpage, wherein in v webpage, has the link of pointing to u webpage, and Nv represents the university of the set that the webpage of this specific character of v forms, c is constant, and E (u) represents certain distribution function of certain u;
This calculating is utilized to distributed computing method, and parallel calculates.
4. the Web page sequencing method based on cloud computing according to claim 1, is characterized in that, the similarity of described step in (5) calculate and final Web page importance computing method as follows:
Adopt the similarity that represents of TF-IDF
Figure 284191DEST_PATH_IMAGE002
Figure 226127DEST_PATH_IMAGE004
Final Web page importance is
Figure 2013105366039100001DEST_PATH_IMAGE005
TF wherein i,jrepresent the number of times that term (entry) i occurs in webpage j, IDF irepresent upset document frequency, conventionally with it, describe the singularity of a word, final score score is a weighted linear combination of TFIDF and PR.
CN201310536603.9A 2013-11-04 2013-11-04 Webpage ranking method based on cloud computation Pending CN103559258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310536603.9A CN103559258A (en) 2013-11-04 2013-11-04 Webpage ranking method based on cloud computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310536603.9A CN103559258A (en) 2013-11-04 2013-11-04 Webpage ranking method based on cloud computation

Publications (1)

Publication Number Publication Date
CN103559258A true CN103559258A (en) 2014-02-05

Family

ID=50013505

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310536603.9A Pending CN103559258A (en) 2013-11-04 2013-11-04 Webpage ranking method based on cloud computation

Country Status (1)

Country Link
CN (1) CN103559258A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105335363A (en) * 2014-05-28 2016-02-17 华为技术有限公司 Object pushing method and system
CN105808779A (en) * 2016-03-30 2016-07-27 北京大学 Picture roaming parallel computing method based on pruning and application
CN105912673A (en) * 2016-04-11 2016-08-31 天津大学 Optimization method for Micro Blog search based on personalized characteristics of user
CN106557483A (en) * 2015-09-25 2017-04-05 阿里巴巴集团控股有限公司 A kind of data processing, data query method and apparatus
CN107943994A (en) * 2017-12-04 2018-04-20 重庆第二师范学院 A kind of Web page sequencing method and system based on transition probability
CN109274750A (en) * 2018-10-07 2019-01-25 杭州安恒信息技术股份有限公司 A method of it is normally accessed based on user after the broken string of cloud platform guarantee website online
CN111353083A (en) * 2018-12-20 2020-06-30 中国科学院计算机网络信息中心 Method and device for sorting web pages through computing cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253971A (en) * 2011-06-14 2011-11-23 南京信息工程大学 PageRank method based on quick similarity
US20120330864A1 (en) * 2011-06-21 2012-12-27 Microsoft Corporation Fast personalized page rank on map reduce

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253971A (en) * 2011-06-14 2011-11-23 南京信息工程大学 PageRank method based on quick similarity
US20120330864A1 (en) * 2011-06-21 2012-12-27 Microsoft Corporation Fast personalized page rank on map reduce

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张超: "基于MapReduce的分布式搜索引擎研究与实现", 《中国优秀硕士论文全文数据库》 *
陈宫 等: "基于mapreduce的pagerank算法的研究", 《微电子学与计算机》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103870329A (en) * 2014-03-03 2014-06-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN103870329B (en) * 2014-03-03 2017-01-18 同济大学 Distributed crawler task scheduling method based on weighted round-robin algorithm
CN105335363A (en) * 2014-05-28 2016-02-17 华为技术有限公司 Object pushing method and system
CN105335363B (en) * 2014-05-28 2018-12-07 华为技术有限公司 A kind of Object Push method and system
CN106557483A (en) * 2015-09-25 2017-04-05 阿里巴巴集团控股有限公司 A kind of data processing, data query method and apparatus
CN105808779A (en) * 2016-03-30 2016-07-27 北京大学 Picture roaming parallel computing method based on pruning and application
CN105912673A (en) * 2016-04-11 2016-08-31 天津大学 Optimization method for Micro Blog search based on personalized characteristics of user
CN107943994A (en) * 2017-12-04 2018-04-20 重庆第二师范学院 A kind of Web page sequencing method and system based on transition probability
CN109274750A (en) * 2018-10-07 2019-01-25 杭州安恒信息技术股份有限公司 A method of it is normally accessed based on user after the broken string of cloud platform guarantee website online
CN111353083A (en) * 2018-12-20 2020-06-30 中国科学院计算机网络信息中心 Method and device for sorting web pages through computing cluster
CN111353083B (en) * 2018-12-20 2023-04-28 中国科学院计算机网络信息中心 Method and device for ordering web pages through computing clusters

Similar Documents

Publication Publication Date Title
US10261954B2 (en) Optimizing search result snippet selection
CN103559258A (en) Webpage ranking method based on cloud computation
CN105022827B (en) A kind of Web news dynamic aggregation method of domain-oriented theme
Cafarella et al. Structured data on the web
CN103617174A (en) Distributed searching method based on cloud computing
CN102890713B (en) A kind of music recommend method based on user's current geographic position and physical environment
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
WO2015172567A1 (en) Internet information searching, aggregating and presentation method
CN104199833A (en) Network search term clustering method and device
CN104778208A (en) Method and system for optimally grasping search engine SEO (search engine optimization) website data
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN113239111B (en) Knowledge graph-based network public opinion visual analysis method and system
US11249993B2 (en) Answer facts from structured content
CN103207864A (en) Online novel content similarity comparison method
CN103745006A (en) Internet information searching system and internet information searching method
Cao et al. Searching for truth in a database of statistics
Li [Retracted] Internet Tourism Resource Retrieval Using PageRank Search Ranking Algorithm
Han et al. Design and implementation of elasticsearch for media data
CN105808761A (en) Solr webpage sorting optimization method based on big data
CN109948015B (en) Meta search list result extraction method and system
Jin et al. Tise: A temporal search engine for web contents
Bharamagoudar et al. Literature survey on web mining
Guo et al. AOL4PS: A large-scale data set for personalized search
Li et al. Research of network data mining based on reliability source under big data environment
CN105912584B (en) Data indexing system based on webpage information data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140205