CN102982184A

CN102982184A - Crawler algorithm for capturing webpage in online shopping mall

Info

Publication number: CN102982184A
Application number: CN2012105718194A
Authority: CN
Inventors: 陈志德
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2012-12-26
Filing date: 2012-12-26
Publication date: 2013-03-20

Abstract

The invention relates to a crawler algorithm for capturing a webpage in an online shopping mall, which comprises the following steps: acquiring a page in the online shopping mall according to an initial link and adding a seed set in the page to a url queue; downloading the page according to the initial link, adding a new link to a list queue, and computing the degree of correlation of the page; setting a corresponding link value according to the depth of the page and the degree of correlation between the page and a topic; for the url existing in both the list queue and the url queue, comparing the potential coefficient in the url queue with that in the list queue to update the potential coefficient in the url queue; for the url existing in the list queue but not in the url queue, inserting the url to the url queue according to the potential coefficient; and finally, setting the depth according to the degree of correlation of the current page. The algorithm is favorable for precisely capturing the webpage in the online shopping mall related to the topic, and is rational in design and good in running effect.

Description

The reptile algorithm that is used for crawl network shopping mall webpage

Technical field

The present invention relates to the Webpage search technical field, particularly a kind of reptile algorithm for crawl network shopping mall webpage.

Background technology

Network shopping mall is to take the internet as the operation carrier, rely on Internet resources, utilize the various means of ecommerce, reach from having bought the virtual shop of the process of selling, thereby cut down the number of intermediate links, eliminate transportation cost and the middle price difference of agency, bring up ordinary consumption and increasing market circulation are brought huge development space.

Network shopping mall can carry out browsing and buying of commodity in 24 hours, and the working time can exchange with customer service at any time, solved the difficulty that runs in the shopping; It contains much information, and can allow client more understand, and has increased the space of selecting; Its client is unlimited, and anyone of the whole world can access by Internet, is not subjected to space constraint; It serves high-quality, not only can finish the All Activity that conventional store can carry out, and it can also provide more comprehensively merchandise news for the user by multimedia technology simultaneously; It is with low cost, owing to saved the StoreFront expense, so overall cost is a lot, so show that the price on the consumer goods is also can relatively traditional StoreFront cheaply a lot.Simultaneously because be the form of distributing after the picking in enormous quantities, so price advantage is apparent in view.

Although network shopping mall has the place of its advantage, its inferior position is arranged also.Online Store has fascination to the description of goods very much, is that mainly client can't directly see and touch commodity, does shopping with storekeeper's description entirely.Whether whether client can't judge goods from regular channel, be certified products perhaps.If Counterfeit Item, the road of consumer's right-safeguarding are often very very long.Often there is significant limitation in Online Store to seller's audit, also can cause counterfeit and shoddy goods to spread unchecked.Get off for a long time to cause a large amount of negative reviews to Online Store, affect the long term growth of Online Store.

Summary of the invention

The object of the present invention is to provide a kind of reptile algorithm for crawl network shopping mall webpage, this algorithm is conducive to accurately grasping with the webpage of Topic relative in the network shopping mall, and algorithm design is reasonable, and operational effect is good.

For achieving the above object, technical scheme of the present invention is: a kind of reptile algorithm for crawl network shopping mall webpage may further comprise the steps:

Step 1: width, the degree of depth and the sum of crawl are set, and the uncorrelated page link of described width means allows the number of access, and described depth representing can also continue the degree of depth of access forward along link, described sum expression accessed web page sum higher limit S; The input initial link;

Step 2: set up the url formation, described url formation is used for the initial link that storage will crawl, and the url subset is added in the described url formation; Can the entrance of the domain name of some online shopping malls as url, obtain described url subset;

Step 3: if the accession page number less than accessed web page sum higher limit S, perhaps the length of url formation is non-vanishing, i.e. url formation be empty, then downloads the corresponding page according to described initial link, otherwise end;

Step 4: extract being linked in the list formation of newly being crawled, and calculate the degree of correlation of the page and theme, then preserve the page that downloads to; Described list formation is used for the link that storage crawls;

Step 5: judge the degree of depth of the page, if the degree of depth of the page greater than zero, then execution in step 6, otherwise return step 3;

Step 6: judge the page whether with Topic relative, and if Topic relative, then increase the link value of described page forward link, otherwise reduce the link value of described page forward link;

Step 7: judge url whether in the list formation, if in the list formation, then execution in step 8, otherwise turn back to step 3;

Step 8: judge url whether in the url formation, if in the url formation, the size of the related coefficient of the related coefficient of url formation and list formation relatively, the related coefficient among both in the larger replacement url formation; Otherwise the size according to related coefficient is inserted in the url formation;

Step 9: if current page is relevant, then the degree of depth is depth (page), otherwise the degree of depth is depth (page)-1, and depth (page) refers to the degree of depth of current page;

Step 10: from the list formation, take out next bar url, then begin to carry out from step 7;

Step 11: algorithm finishes, output Topic relative webpage.

The invention has the beneficial effects as follows can be according to the network shopping mall initial link of input, to searching for, judge with the webpage of Topic relative, and then realizes accurate crawl to the Topic relative webpage in network shopping mall.In addition, this algorithm routine is reasonable in design, and result of use is good, can be widely used in the webpage crawl in diverse network store.

The present invention is described in further detail below in conjunction with drawings and the specific embodiments.

Description of drawings

Fig. 1 is the workflow diagram of the embodiment of the invention.

Embodiment

The present invention is used for the reptile algorithm of crawl network shopping mall webpage, as shown in Figure 1, may further comprise the steps:

Step 8: judge url whether in the url formation, if in the url formation, the size of the related coefficient of the related coefficient of url formation and list formation relatively, the related coefficient of queue queue among the larger replacement url among both; Otherwise the size according to related coefficient is inserted in the url formation;

Step 11: algorithm finishes, output Topic relative webpage.

More than be preferred embodiment of the present invention, all changes of doing according to technical solution of the present invention when the function that produces does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims

1. reptile algorithm that is used for crawl network shopping mall webpage is characterized in that: may further comprise the steps:

Step 2: set up the url formation, described url formation is used for the initial link that storage will crawl, and the url subset is added in the described url formation;

Step 11: algorithm finishes, output Topic relative webpage.