CN105205061A

CN105205061A - Method for acquiring page information of E-commerce website

Info

Publication number: CN105205061A
Application number: CN201410260218.0A
Authority: CN
Inventors: 冯亮; 尹亚伟; 费志军
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2014-06-12
Filing date: 2014-06-12
Publication date: 2015-12-30
Anticipated expiration: 2034-06-12
Also published as: CN105205061B

Abstract

The invention discloses a method for acquiring page information of an E-commerce website. The E-commerce website comprises a main page, navigation pages and commodity pages. The main page is provided with navigation page URLs. The navigation pages are provided with commodity page URLs. Each navigation page corresponds to commodities of one category. The method includes the steps that the navigation page URLs are extracted from the main page, the commodity page URLs are extracted from the navigation pages, and the commodity page information is downloaded according to the commodity page URLs.

Description

A kind of page info acquisition methods of electric business website

Technical field

The present invention relates to Internet information technique, and particularly relate to a kind of page info acquisition methods of electric business website.

Background technology

The Website page of electricity business website (such as Taobao, Amazon) comprises the such as information such as commodity and user comment.These information can be collected for data analysis, such as, be used to personalized recommendation, goods marketing analysis, sentiment analysis etc.

In prior art, web crawlers is usually used to collect Website page information as the program of traversal down loading network resource.Web crawlers captures strategy according to the page and determines the order that the page captures and update strategy etc.Increase income web crawlers such as Nutch, Heritrix and Crawler4J of main flow adopts breadth first search or breadth-first search analysis of strategies to capture link discovery new page in the page; Meanwhile, again travel through whole Website page in the set time, then upgrade the content of pages captured.For electric business website, this page info obtains scheme stably can not capture page info, consumes larger computational resource and upgrades the content general captured, and be difficult to control crawl process.

Summary of the invention

A kind of page info acquisition methods of electric business website is disclosed according to one object of the present invention, this electric business website comprises homepage, navigation page, the commodity page, wherein, homepage has navigation page URL, navigation page has commodity page URL, the commodity of a corresponding classification of navigation page, the method comprises:

Navigation page URL is extracted from homepage,

Commodity page URL is extracted from navigation page,

Commodity page info is downloaded according to commodity page URL,

Wherein, again download commodity page URL every predetermined time interval from navigation page, the commodity page URL wherein downloaded is sorted by shelf life,

Wherein, by following process, commodity page info is upgraded:

The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, when comparative result is not identical, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.

In one embodiment, by following process, commodity page info is upgraded:

The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, and the commodity page URL of this navigation page that downloads with corresponding tagmeme history of the SimHash condition code of the commodity page of commodity page URL of navigation page comparing this up-to-date download distinguish the SimHash condition code of the corresponding commodity page

When the SimHash condition code difference of the URL commodity page of the commodity page of this navigation page that the not identical and SimHash condition code that the is commodity page of the commodity page URL of the navigation page of this up-to-date download of the commodity page URL of this navigation page that the commodity page URL of the navigation page of this up-to-date download downloads with corresponding tagmeme history is downloaded with corresponding tagmeme history is less than setting value, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.

In one embodiment, extract navigation page URL by navigation page URL pattern from homepage, wherein, navigation page URL pattern represents with regular expression form.

In one embodiment, extract commodity page URL from navigation page and comprise following process:

Extract the URL of coupling XPath pattern as commodity page URL according to commodity page XPath pattern from navigation page, wherein, obtained the XPath pattern of the class I goods page of a corresponding navigation page by following process:

Create the XPath set of all URL places node of this navigation page,

Utilize KNN clustering algorithm to carry out cluster to this XPath set, wherein use XPath distance as the distance metric parameter of this KNN clustering algorithm,

Select the maximum class of number of members as the template area of commodity page URL,

Commodity page URL in the class of above-mentioned selection is carried out the XPath pattern that medelling obtains such commodity page of this navigation page,

Wherein, described XPath distance dis () instruction XPathp ₁and XPathp ₂difference degree, its computing formula is as follows:

，

Wherein, TL () is function XPath being resolved into one group of sequence idol, and each sequence is even to be made up of the bookmark name in this XPath and the position number of this label in this XPath.

Advantage of the present invention comprises following one or more: when carrying out renewal of the page, without the need to traveling through whole Website page, avoid insignificant page repeated downloads.By unsupervised template region recognition technology, without the need to the training data manually marked, automaticity is high.Merchandise classification is utilized to carry out piecemeal download.Compared to breadth-first strategy and depth-first strategy, page search order is clear, and what facilitate user's awareness network reptile crawls progress.

Accompanying drawing explanation

After having read the specific embodiment of the present invention with reference to accompanying drawing, those skilled in the art will become apparent various aspects of the present invention.One skilled in the art will appreciate that these accompanying drawings only for coordinating embodiment that technical scheme of the present invention is described, and and be not intended to be construed as limiting protection scope of the present invention.

Fig. 1 is the schematic diagram of the page info acquisition methods of electric business website according to the embodiment of the present invention.

Fig. 2 illustrates and organizes schematic diagram according to the page of the electric business website of the embodiment of the present invention.

Fig. 3 downloads index example according to the history of the embodiment of the present invention.

Fig. 4 is the navigation page example according to the embodiment of the present invention.

Embodiment

With reference to the accompanying drawings, the specific embodiment of the present invention is described in further detail.In the following description, in order to the object explained, state many details to provide the thorough understanding to one or more aspect of the present invention.But, can it is evident that for those skilled in the art, the less degree of these details can put into practice each one or more aspect of the present invention.

The page tissue of electricity business website has hierarchical structure, generally includes homepage, navigation page, comprises the commodity page of merchandise news.Containing URL(URL(uniform resource locator) in the page of this three class) link, point to next class page one by one.Commodity page info is the important Data Source of E-business applications (such as personalized recommendation, the rate of exchange etc.), hereafter will describe the method obtaining commodity page info from electric business website.

Fig. 1 is the schematic diagram of the page info acquisition methods of electric business website according to the embodiment of the present invention.Here, electric business website comprises homepage, navigation page, the commodity page, and wherein, homepage has navigation page URL, and navigation page has commodity page URL, the commodity of a corresponding classification of navigation page.

As shown in Figure 1, extract navigation page URL from homepage in a step 101, extract commodity page URL from navigation page in a step 102, download commodity page info according to commodity page URL in step 103.Fig. 2 illustrates and organizes schematic diagram according to the page of the electric business website of the embodiment of the present invention.

In one embodiment, again download (extraction) commodity page URL every predetermined time interval from navigation page, the commodity page URL wherein downloaded is sorted by shelf life.In actual mechanical process, " shelf life " ranking function of navigation page can be utilized, be set to by navigation page according to shelf life sequencing display commodity, thus the member in commodity page url list can be stored sequentially in history according to the shelf life of commodity and download in index.

Fig. 3 downloads index example according to the history of the embodiment of the present invention.As shown in the figure, the data structure that history downloads index is tlv triple vector lists, comprises the class name of instruction merchandise classification, the URL of corresponding navigation page, and the pointer pointing to commodity page url list.Commodity page url list is two tuple vector lists, comprises the condition code of commodity page URL and this commodity page optional.

In one embodiment, by following process, commodity page info is upgraded:

The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, when comparative result is not identical, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.Such as, the commodity page URL of this navigation page that history is downloaded comprises a, b, c, d, and the commodity page URL of the navigation page of up-to-date download comprises b, a, c, d, then a and b in the commodity page url list of this navigation page of more new historical download, and download corresponding page info.

In another embodiment, wherein, by following process, commodity page info is upgraded:

The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, and the commodity page URL of this navigation page that downloads with corresponding tagmeme history of the SimHash condition code of the commodity page of commodity page URL of navigation page comparing this up-to-date download distinguish the SimHash condition code of the corresponding commodity page, the SimHash value of such as page p equals 121938123.

In another embodiment, after completing above-mentioned download first, web crawlers timing is used to carry out increment type down loading updating to electric business website.This process comprises: the navigation page downloading each merchandise classification one by one, by shelf life sequencing display commodity; Then, for a classification, commodity page URL is captured from navigation page, and the commodity page URL added recently downloaded with history one by one in index in this classification in commodity page url list compares, and the commodity URL that identical for URL and page SimHash condition code diversity factor are less than or equal to before the commodity page of setting value is carried out down loading updating.

Exemplarily, the false code of an example realized of said process is below shown:

Whether the URL that function URL_equal (head, u) representation page head with u is corresponding is identical, if same functions returns true, otherwise, return false

The difference of the SimHash value of function SimHash (head, u) representation page head and u, returns the difference value of two pages

In one embodiment, for step 101, can by navigation page URL pattern from homepage extract navigation page URL, wherein, navigation page URL pattern represents with regular expression form, such as www.example.com/products/ w.

In one example, the homepage address of electric business's web-site is set to the seed address of web crawlers, adopt breadth first search traversal site page, the link degree of depth is arranged between 2 to 5.In the process of traversal webpage, capture URL in the page, coupling navigation page URL pattern, is added into navigation page url list by the URL that the match is successful.

In one case, navigation page is made up of multiple paging.Now capture the paging link of each navigation page, and be added into navigation page address list.The paging captured can with commodity shelf life ascending sort.Generally speaking, there is certain regularity in the paging URL of the navigation page of same website.The URL of a navigation page of such as Amazon China is http://www.amazon.cn/s/ref=sa_menu_office_l3_shard_disc page=1 & rh=n%3A888650051, wherein point page number of page=1 parameter item representation page.Preferably, electric business's web-site that can be respectively different arranges different paging rules for grasping, in order to collect navigation page paging URL.

Finally by the URL address of navigation page address list, download corresponding navigation page.

In yet another embodiment, identified from navigation page extraction commodity page URL by template area.Fig. 4 is the navigation page example according to the embodiment of the present invention.Navigation page, except comprising other commodity of specified class URL, also contains user's historical viewings, the out of Memory such as Recommendations such as seniority among brothers and sisters, personalization of selling fast.In the present invention, the region of other commodity of specified class URL is called template area, and the region of out of Memory is extraneous areas.As a rule, template area occupies the larger space of a whole page of the page, and wherein the hypertext link structure of each commodity is comparatively regular, points to the XPath comparatively unification of commodity page URL place node.XPath is a kind of XML file path language, and the present invention utilizes by the specific label in XPath mode positioning webpage (such as, html page).

In this embodiment, extract the URL of coupling XPath pattern as commodity page URL according to commodity page XPath pattern from navigation page, wherein, obtained the XPath pattern of the class I goods page of a corresponding navigation page by following process:

Create the XPath set of all URL places node of this navigation page,

Utilize KNN clustering algorithm to carry out cluster to this XPath set, wherein use XPath distance as the distance metric parameter of this KNN clustering algorithm.Here XPath distance can be used to identify page area.XPath distance represents the difference degree of two XPath, and Jaccard similarity mode can be adopted with metrology path difference degree.

Commodity page URL in the class of above-mentioned selection is carried out the XPath pattern that medelling obtains such commodity page of this navigation page, because the XPath of the commodity URL node in same region is comparatively regular, therefore, it is possible to use extensive XPath pattern to carry out unified representation.The constituent grammar of XPath pattern can be identical with XPath, strengthens further compared to the query context of XPath, XPath pattern and dirigibility.In one example, take greedy algorithm to obtain the maximum public part of the XPath group of template area, and carry out extensive to changing unit, generate XPath pattern, to represent template area.

，

Wherein, TL () is function XPath being resolved into one group of sequence idol, and each sequence is even to be made up of the bookmark name in this XPath and the position number of this label in this XPath.Such as XPath is " A/B/C/D/E ", then TL (A/B/C/D/E)={ <A, 1>, <B, 2>, <C, 3>, <D, 4>, <D, 5>}.

By the description of above embodiment, those skilled in the art can understand, and when without departing from the spirit and scope of the present invention, can also do various change and replacement to the specific embodiment of the present invention.These change and replace and all drop in claims of the present invention limited range.

Claims

1. a page info acquisition methods for electric business website, is characterized in that, this electric business website comprises homepage, navigation page, the commodity page, wherein, homepage has navigation page URL, and navigation page has commodity page URL, the commodity of a corresponding classification of navigation page, the method comprises:

Navigation page URL is extracted from homepage,

Commodity page URL is extracted from navigation page,

Commodity page info is downloaded according to commodity page URL,

Wherein, by following process, commodity page info is upgraded:

2. the method for claim 1, is characterized in that, wherein, is upgraded commodity page info by following process:

3. the method for claim 1, is characterized in that,

Extract navigation page URL by navigation page URL pattern from homepage, wherein, navigation page URL pattern represents with regular expression form.

4. the method for claim 1, is characterized in that,

Extract commodity page URL from navigation page and comprise following process:

Create the XPath set of all URL places node of this navigation page,

，