CN102982184A - Crawler algorithm for capturing webpage in online shopping mall - Google Patents

Crawler algorithm for capturing webpage in online shopping mall Download PDF

Info

Publication number
CN102982184A
CN102982184A CN2012105718194A CN201210571819A CN102982184A CN 102982184 A CN102982184 A CN 102982184A CN 2012105718194 A CN2012105718194 A CN 2012105718194A CN 201210571819 A CN201210571819 A CN 201210571819A CN 102982184 A CN102982184 A CN 102982184A
Authority
CN
China
Prior art keywords
page
url
formation
depth
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012105718194A
Other languages
Chinese (zh)
Inventor
陈志德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN2012105718194A priority Critical patent/CN102982184A/en
Publication of CN102982184A publication Critical patent/CN102982184A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a crawler algorithm for capturing a webpage in an online shopping mall, which comprises the following steps: acquiring a page in the online shopping mall according to an initial link and adding a seed set in the page to a url queue; downloading the page according to the initial link, adding a new link to a list queue, and computing the degree of correlation of the page; setting a corresponding link value according to the depth of the page and the degree of correlation between the page and a topic; for the url existing in both the list queue and the url queue, comparing the potential coefficient in the url queue with that in the list queue to update the potential coefficient in the url queue; for the url existing in the list queue but not in the url queue, inserting the url to the url queue according to the potential coefficient; and finally, setting the depth according to the degree of correlation of the current page. The algorithm is favorable for precisely capturing the webpage in the online shopping mall related to the topic, and is rational in design and good in running effect.

Description

The reptile algorithm that is used for crawl network shopping mall webpage
Technical field
The present invention relates to the Webpage search technical field, particularly a kind of reptile algorithm for crawl network shopping mall webpage.
Background technology
Network shopping mall is to take the internet as the operation carrier, rely on Internet resources, utilize the various means of ecommerce, reach from having bought the virtual shop of the process of selling, thereby cut down the number of intermediate links, eliminate transportation cost and the middle price difference of agency, bring up ordinary consumption and increasing market circulation are brought huge development space.
Network shopping mall can carry out browsing and buying of commodity in 24 hours, and the working time can exchange with customer service at any time, solved the difficulty that runs in the shopping; It contains much information, and can allow client more understand, and has increased the space of selecting; Its client is unlimited, and anyone of the whole world can access by Internet, is not subjected to space constraint; It serves high-quality, not only can finish the All Activity that conventional store can carry out, and it can also provide more comprehensively merchandise news for the user by multimedia technology simultaneously; It is with low cost, owing to saved the StoreFront expense, so overall cost is a lot, so show that the price on the consumer goods is also can relatively traditional StoreFront cheaply a lot.Simultaneously because be the form of distributing after the picking in enormous quantities, so price advantage is apparent in view.
Although network shopping mall has the place of its advantage, its inferior position is arranged also.Online Store has fascination to the description of goods very much, is that mainly client can't directly see and touch commodity, does shopping with storekeeper's description entirely.Whether whether client can't judge goods from regular channel, be certified products perhaps.If Counterfeit Item, the road of consumer's right-safeguarding are often very very long.Often there is significant limitation in Online Store to seller's audit, also can cause counterfeit and shoddy goods to spread unchecked.Get off for a long time to cause a large amount of negative reviews to Online Store, affect the long term growth of Online Store.
Summary of the invention
The object of the present invention is to provide a kind of reptile algorithm for crawl network shopping mall webpage, this algorithm is conducive to accurately grasping with the webpage of Topic relative in the network shopping mall, and algorithm design is reasonable, and operational effect is good.
For achieving the above object, technical scheme of the present invention is: a kind of reptile algorithm for crawl network shopping mall webpage may further comprise the steps:
Step 1: width, the degree of depth and the sum of crawl are set, and the uncorrelated page link of described width means allows the number of access, and described depth representing can also continue the degree of depth of access forward along link, described sum expression accessed web page sum higher limit S; The input initial link;
Step 2: set up the url formation, described url formation is used for the initial link that storage will crawl, and the url subset is added in the described url formation; Can the entrance of the domain name of some online shopping malls as url, obtain described url subset;
Step 3: if the accession page number less than accessed web page sum higher limit S, perhaps the length of url formation is non-vanishing, i.e. url formation be empty, then downloads the corresponding page according to described initial link, otherwise end;
Step 4: extract being linked in the list formation of newly being crawled, and calculate the degree of correlation of the page and theme, then preserve the page that downloads to; Described list formation is used for the link that storage crawls;
Step 5: judge the degree of depth of the page, if the degree of depth of the page greater than zero, then execution in step 6, otherwise return step 3;
Step 6: judge the page whether with Topic relative, and if Topic relative, then increase the link value of described page forward link, otherwise reduce the link value of described page forward link;
Step 7: judge url whether in the list formation, if in the list formation, then execution in step 8, otherwise turn back to step 3;
Step 8: judge url whether in the url formation, if in the url formation, the size of the related coefficient of the related coefficient of url formation and list formation relatively, the related coefficient among both in the larger replacement url formation; Otherwise the size according to related coefficient is inserted in the url formation;
Step 9: if current page is relevant, then the degree of depth is depth (page), otherwise the degree of depth is depth (page)-1, and depth (page) refers to the degree of depth of current page;
Step 10: from the list formation, take out next bar url, then begin to carry out from step 7;
Step 11: algorithm finishes, output Topic relative webpage.
The invention has the beneficial effects as follows can be according to the network shopping mall initial link of input, to searching for, judge with the webpage of Topic relative, and then realizes accurate crawl to the Topic relative webpage in network shopping mall.In addition, this algorithm routine is reasonable in design, and result of use is good, can be widely used in the webpage crawl in diverse network store.
The present invention is described in further detail below in conjunction with drawings and the specific embodiments.
Description of drawings
Fig. 1 is the workflow diagram of the embodiment of the invention.
Embodiment
The present invention is used for the reptile algorithm of crawl network shopping mall webpage, as shown in Figure 1, may further comprise the steps:
Step 1: width, the degree of depth and the sum of crawl are set, and the uncorrelated page link of described width means allows the number of access, and described depth representing can also continue the degree of depth of access forward along link, described sum expression accessed web page sum higher limit S; The input initial link;
Step 2: set up the url formation, described url formation is used for the initial link that storage will crawl, and the url subset is added in the described url formation; Can the entrance of the domain name of some online shopping malls as url, obtain described url subset;
Step 3: if the accession page number less than accessed web page sum higher limit S, perhaps the length of url formation is non-vanishing, i.e. url formation be empty, then downloads the corresponding page according to described initial link, otherwise end;
Step 4: extract being linked in the list formation of newly being crawled, and calculate the degree of correlation of the page and theme, then preserve the page that downloads to; Described list formation is used for the link that storage crawls;
Step 5: judge the degree of depth of the page, if the degree of depth of the page greater than zero, then execution in step 6, otherwise return step 3;
Step 6: judge the page whether with Topic relative, and if Topic relative, then increase the link value of described page forward link, otherwise reduce the link value of described page forward link;
Step 7: judge url whether in the list formation, if in the list formation, then execution in step 8, otherwise turn back to step 3;
Step 8: judge url whether in the url formation, if in the url formation, the size of the related coefficient of the related coefficient of url formation and list formation relatively, the related coefficient of queue queue among the larger replacement url among both; Otherwise the size according to related coefficient is inserted in the url formation;
Step 9: if current page is relevant, then the degree of depth is depth (page), otherwise the degree of depth is depth (page)-1, and depth (page) refers to the degree of depth of current page;
Step 10: from the list formation, take out next bar url, then begin to carry out from step 7;
Step 11: algorithm finishes, output Topic relative webpage.
More than be preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention when the function that produces does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims (1)

1. reptile algorithm that is used for crawl network shopping mall webpage is characterized in that: may further comprise the steps:
Step 1: width, the degree of depth and the sum of crawl are set, and the uncorrelated page link of described width means allows the number of access, and described depth representing can also continue the degree of depth of access forward along link, described sum expression accessed web page sum higher limit S; The input initial link;
Step 2: set up the url formation, described url formation is used for the initial link that storage will crawl, and the url subset is added in the described url formation;
Step 3: if the accession page number less than accessed web page sum higher limit S, perhaps the length of url formation is non-vanishing, i.e. url formation be empty, then downloads the corresponding page according to described initial link, otherwise end;
Step 4: extract being linked in the list formation of newly being crawled, and calculate the degree of correlation of the page and theme, then preserve the page that downloads to; Described list formation is used for the link that storage crawls;
Step 5: judge the degree of depth of the page, if the degree of depth of the page greater than zero, then execution in step 6, otherwise return step 3;
Step 6: judge the page whether with Topic relative, and if Topic relative, then increase the link value of described page forward link, otherwise reduce the link value of described page forward link;
Step 7: judge url whether in the list formation, if in the list formation, then execution in step 8, otherwise turn back to step 3;
Step 8: judge url whether in the url formation, if in the url formation, the size of the related coefficient of the related coefficient of url formation and list formation relatively, the related coefficient among both in the larger replacement url formation; Otherwise the size according to related coefficient is inserted in the url formation;
Step 9: if current page is relevant, then the degree of depth is depth (page), otherwise the degree of depth is depth (page)-1, and depth (page) refers to the degree of depth of current page;
Step 10: from the list formation, take out next bar url, then begin to carry out from step 7;
Step 11: algorithm finishes, output Topic relative webpage.
CN2012105718194A 2012-12-26 2012-12-26 Crawler algorithm for capturing webpage in online shopping mall Pending CN102982184A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012105718194A CN102982184A (en) 2012-12-26 2012-12-26 Crawler algorithm for capturing webpage in online shopping mall

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012105718194A CN102982184A (en) 2012-12-26 2012-12-26 Crawler algorithm for capturing webpage in online shopping mall

Publications (1)

Publication Number Publication Date
CN102982184A true CN102982184A (en) 2013-03-20

Family

ID=47856200

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012105718194A Pending CN102982184A (en) 2012-12-26 2012-12-26 Crawler algorithm for capturing webpage in online shopping mall

Country Status (1)

Country Link
CN (1) CN102982184A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205061A (en) * 2014-06-12 2015-12-30 中国银联股份有限公司 Method for acquiring page information of E-commerce website
CN105701167A (en) * 2015-12-31 2016-06-22 北京工业大学 Topic relevance judgement method based on coal mine safety event
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN107066603A (en) * 2017-04-21 2017-08-18 上海耐相智能科技有限公司 A kind of efficient grain public sentiment monitoring system
CN110442769A (en) * 2019-08-05 2019-11-12 深圳乐信软件技术有限公司 Distributed data crawls system, method, apparatus, equipment and storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112884072A (en) * 2021-03-22 2021-06-01 南京奥派信息产业股份公司 Commodity data classification processing and comparison technical method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MICHAEL HERSOVICI ET AL.: "The shark-searchalgorithm. An application: tailored Web sitemapping", 《PROCEEDINGS OF THE SEVENTH INTERNATIONAL WORLD WIDE WEB CONFERENCE》, 30 April 1998 (1998-04-30), pages 317 - 326 *
PAUL DE BRA,ET AL.: "Information Retrieval in Distributed Hypertexts", 《PROCEEDINGS OF RIAO"94》, 31 December 1994 (1994-12-31), pages 1 - 11 *
侯震宇: "基于Fish算法的实时搜索系统的实现", 《信息检索技术》, no. 6, 25 November 2002 (2002-11-25) *
宋宇等: "基于改进的Fish-search算法的多媒体检索", 《计算机工程》, vol. 34, no. 11, 5 June 2008 (2008-06-05) *
罗方芳等: "基于改进的fish-search算法的信息检索研究", 《福州大学学报(自然科学版)》, vol. 34, no. 2, 30 April 2006 (2006-04-30) *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205061A (en) * 2014-06-12 2015-12-30 中国银联股份有限公司 Method for acquiring page information of E-commerce website
CN105205061B (en) * 2014-06-12 2018-08-10 中国银联股份有限公司 A kind of page info acquisition methods of electric business website
CN105701167A (en) * 2015-12-31 2016-06-22 北京工业大学 Topic relevance judgement method based on coal mine safety event
CN105701167B (en) * 2015-12-31 2019-04-12 北京工业大学 Based on safety of coal mines event topic correlation method of discrimination
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN107066603A (en) * 2017-04-21 2017-08-18 上海耐相智能科技有限公司 A kind of efficient grain public sentiment monitoring system
CN110442769A (en) * 2019-08-05 2019-11-12 深圳乐信软件技术有限公司 Distributed data crawls system, method, apparatus, equipment and storage medium
CN111143649A (en) * 2019-12-09 2020-05-12 杭州迪普科技股份有限公司 Webpage searching method and device
CN112884072A (en) * 2021-03-22 2021-06-01 南京奥派信息产业股份公司 Commodity data classification processing and comparison technical method and system

Similar Documents

Publication Publication Date Title
CN102982184A (en) Crawler algorithm for capturing webpage in online shopping mall
CN103544632B (en) A kind of cyber personalized recommendation method and system
CN105447186A (en) Big data platform based user behavior analysis system
CN106202516A (en) A kind of e-commerce platform merchandise display method according to timing node
CN102073717A (en) Home page recommending method for orienting vertical e-commerce website
TW201239792A (en) Management and storage of distributed bookmarks
CN107730337A (en) Information-pushing method and device
CN102542046A (en) Book recommendation method based on book contents
CN104615721B (en) For the method and system based on return of goods related information Recommendations
CN103744904B (en) A kind of method and device that information is provided
Raff et al. Manufacturers and retailers in the global economy
Gim Evaluating factors influencing consumer satisfaction towards online shopping in Viet Nam
Datta et al. A mobile app search engine
Jaravel The unequal gains from product innovations
CN102789615A (en) Book information correlation recommendation method, server and system
Ranga et al. Search engine marketing-a study of marketing in digital age
CN107845005A (en) webpage generating method and device
US20160232543A1 (en) Predicting Interest for Items Based on Trend Information
Liang Application of big data technology in product selection on cross-border e-commerce platforms
Hamidizadeh et al. Investigating the effect of price image and social media on customers’ intention to purchase
CN110069717A (en) A kind of searching method and device
Balaji et al. A study on problems faced by the consumers and retailers in modern and traditional retail store outlets in India
Launders The transaction graph: requirements capture in semantic enterprise architectures
Walwyn A recommender system for e-retail
Lesley et al. Crossing the line: Direct estimation of cross-border cigarette sales and the effect on tax revenue

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130320