CN105205061A - Method for acquiring page information of E-commerce website - Google Patents

Method for acquiring page information of E-commerce website Download PDF

Info

Publication number
CN105205061A
CN105205061A CN201410260218.0A CN201410260218A CN105205061A CN 105205061 A CN105205061 A CN 105205061A CN 201410260218 A CN201410260218 A CN 201410260218A CN 105205061 A CN105205061 A CN 105205061A
Authority
CN
China
Prior art keywords
page
commodity
url
navigation
navigation page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410260218.0A
Other languages
Chinese (zh)
Other versions
CN105205061B (en
Inventor
冯亮
尹亚伟
费志军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201410260218.0A priority Critical patent/CN105205061B/en
Publication of CN105205061A publication Critical patent/CN105205061A/en
Application granted granted Critical
Publication of CN105205061B publication Critical patent/CN105205061B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method for acquiring page information of an E-commerce website. The E-commerce website comprises a main page, navigation pages and commodity pages. The main page is provided with navigation page URLs. The navigation pages are provided with commodity page URLs. Each navigation page corresponds to commodities of one category. The method includes the steps that the navigation page URLs are extracted from the main page, the commodity page URLs are extracted from the navigation pages, and the commodity page information is downloaded according to the commodity page URLs.

Description

A kind of page info acquisition methods of electric business website
Technical field
The present invention relates to Internet information technique, and particularly relate to a kind of page info acquisition methods of electric business website.
Background technology
The Website page of electricity business website (such as Taobao, Amazon) comprises the such as information such as commodity and user comment.These information can be collected for data analysis, such as, be used to personalized recommendation, goods marketing analysis, sentiment analysis etc.
In prior art, web crawlers is usually used to collect Website page information as the program of traversal down loading network resource.Web crawlers captures strategy according to the page and determines the order that the page captures and update strategy etc.Increase income web crawlers such as Nutch, Heritrix and Crawler4J of main flow adopts breadth first search or breadth-first search analysis of strategies to capture link discovery new page in the page; Meanwhile, again travel through whole Website page in the set time, then upgrade the content of pages captured.For electric business website, this page info obtains scheme stably can not capture page info, consumes larger computational resource and upgrades the content general captured, and be difficult to control crawl process.
Summary of the invention
A kind of page info acquisition methods of electric business website is disclosed according to one object of the present invention, this electric business website comprises homepage, navigation page, the commodity page, wherein, homepage has navigation page URL, navigation page has commodity page URL, the commodity of a corresponding classification of navigation page, the method comprises:
Navigation page URL is extracted from homepage,
Commodity page URL is extracted from navigation page,
Commodity page info is downloaded according to commodity page URL,
Wherein, again download commodity page URL every predetermined time interval from navigation page, the commodity page URL wherein downloaded is sorted by shelf life,
Wherein, by following process, commodity page info is upgraded:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, when comparative result is not identical, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.
In one embodiment, by following process, commodity page info is upgraded:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, and the commodity page URL of this navigation page that downloads with corresponding tagmeme history of the SimHash condition code of the commodity page of commodity page URL of navigation page comparing this up-to-date download distinguish the SimHash condition code of the corresponding commodity page
When the SimHash condition code difference of the URL commodity page of the commodity page of this navigation page that the not identical and SimHash condition code that the is commodity page of the commodity page URL of the navigation page of this up-to-date download of the commodity page URL of this navigation page that the commodity page URL of the navigation page of this up-to-date download downloads with corresponding tagmeme history is downloaded with corresponding tagmeme history is less than setting value, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.
In one embodiment, extract navigation page URL by navigation page URL pattern from homepage, wherein, navigation page URL pattern represents with regular expression form.
In one embodiment, extract commodity page URL from navigation page and comprise following process:
Extract the URL of coupling XPath pattern as commodity page URL according to commodity page XPath pattern from navigation page, wherein, obtained the XPath pattern of the class I goods page of a corresponding navigation page by following process:
Create the XPath set of all URL places node of this navigation page,
Utilize KNN clustering algorithm to carry out cluster to this XPath set, wherein use XPath distance as the distance metric parameter of this KNN clustering algorithm,
Select the maximum class of number of members as the template area of commodity page URL,
Commodity page URL in the class of above-mentioned selection is carried out the XPath pattern that medelling obtains such commodity page of this navigation page,
Wherein, described XPath distance dis () instruction XPathp 1and XPathp 2difference degree, its computing formula is as follows:
Wherein, TL () is function XPath being resolved into one group of sequence idol, and each sequence is even to be made up of the bookmark name in this XPath and the position number of this label in this XPath.
Advantage of the present invention comprises following one or more: when carrying out renewal of the page, without the need to traveling through whole Website page, avoid insignificant page repeated downloads.By unsupervised template region recognition technology, without the need to the training data manually marked, automaticity is high.Merchandise classification is utilized to carry out piecemeal download.Compared to breadth-first strategy and depth-first strategy, page search order is clear, and what facilitate user's awareness network reptile crawls progress.
Accompanying drawing explanation
After having read the specific embodiment of the present invention with reference to accompanying drawing, those skilled in the art will become apparent various aspects of the present invention.One skilled in the art will appreciate that these accompanying drawings only for coordinating embodiment that technical scheme of the present invention is described, and and be not intended to be construed as limiting protection scope of the present invention.
Fig. 1 is the schematic diagram of the page info acquisition methods of electric business website according to the embodiment of the present invention.
Fig. 2 illustrates and organizes schematic diagram according to the page of the electric business website of the embodiment of the present invention.
Fig. 3 downloads index example according to the history of the embodiment of the present invention.
Fig. 4 is the navigation page example according to the embodiment of the present invention.
Embodiment
With reference to the accompanying drawings, the specific embodiment of the present invention is described in further detail.In the following description, in order to the object explained, state many details to provide the thorough understanding to one or more aspect of the present invention.But, can it is evident that for those skilled in the art, the less degree of these details can put into practice each one or more aspect of the present invention.
The page tissue of electricity business website has hierarchical structure, generally includes homepage, navigation page, comprises the commodity page of merchandise news.Containing URL(URL(uniform resource locator) in the page of this three class) link, point to next class page one by one.Commodity page info is the important Data Source of E-business applications (such as personalized recommendation, the rate of exchange etc.), hereafter will describe the method obtaining commodity page info from electric business website.
Fig. 1 is the schematic diagram of the page info acquisition methods of electric business website according to the embodiment of the present invention.Here, electric business website comprises homepage, navigation page, the commodity page, and wherein, homepage has navigation page URL, and navigation page has commodity page URL, the commodity of a corresponding classification of navigation page.
As shown in Figure 1, extract navigation page URL from homepage in a step 101, extract commodity page URL from navigation page in a step 102, download commodity page info according to commodity page URL in step 103.Fig. 2 illustrates and organizes schematic diagram according to the page of the electric business website of the embodiment of the present invention.
In one embodiment, again download (extraction) commodity page URL every predetermined time interval from navigation page, the commodity page URL wherein downloaded is sorted by shelf life.In actual mechanical process, " shelf life " ranking function of navigation page can be utilized, be set to by navigation page according to shelf life sequencing display commodity, thus the member in commodity page url list can be stored sequentially in history according to the shelf life of commodity and download in index.
Fig. 3 downloads index example according to the history of the embodiment of the present invention.As shown in the figure, the data structure that history downloads index is tlv triple vector lists, comprises the class name of instruction merchandise classification, the URL of corresponding navigation page, and the pointer pointing to commodity page url list.Commodity page url list is two tuple vector lists, comprises the condition code of commodity page URL and this commodity page optional.
In one embodiment, by following process, commodity page info is upgraded:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, when comparative result is not identical, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.Such as, the commodity page URL of this navigation page that history is downloaded comprises a, b, c, d, and the commodity page URL of the navigation page of up-to-date download comprises b, a, c, d, then a and b in the commodity page url list of this navigation page of more new historical download, and download corresponding page info.
In another embodiment, wherein, by following process, commodity page info is upgraded:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, and the commodity page URL of this navigation page that downloads with corresponding tagmeme history of the SimHash condition code of the commodity page of commodity page URL of navigation page comparing this up-to-date download distinguish the SimHash condition code of the corresponding commodity page, the SimHash value of such as page p equals 121938123.
When the SimHash condition code difference of the URL commodity page of the commodity page of this navigation page that the not identical and SimHash condition code that the is commodity page of the commodity page URL of the navigation page of this up-to-date download of the commodity page URL of this navigation page that the commodity page URL of the navigation page of this up-to-date download downloads with corresponding tagmeme history is downloaded with corresponding tagmeme history is less than setting value, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.
In another embodiment, after completing above-mentioned download first, web crawlers timing is used to carry out increment type down loading updating to electric business website.This process comprises: the navigation page downloading each merchandise classification one by one, by shelf life sequencing display commodity; Then, for a classification, commodity page URL is captured from navigation page, and the commodity page URL added recently downloaded with history one by one in index in this classification in commodity page url list compares, and the commodity URL that identical for URL and page SimHash condition code diversity factor are less than or equal to before the commodity page of setting value is carried out down loading updating.
Exemplarily, the false code of an example realized of said process is below shown:
Whether the URL that function URL_equal (head, u) representation page head with u is corresponding is identical, if same functions returns true, otherwise, return false
The difference of the SimHash value of function SimHash (head, u) representation page head and u, returns the difference value of two pages
In one embodiment, for step 101, can by navigation page URL pattern from homepage extract navigation page URL, wherein, navigation page URL pattern represents with regular expression form, such as www.example.com/products/ w.
In one example, the homepage address of electric business's web-site is set to the seed address of web crawlers, adopt breadth first search traversal site page, the link degree of depth is arranged between 2 to 5.In the process of traversal webpage, capture URL in the page, coupling navigation page URL pattern, is added into navigation page url list by the URL that the match is successful.
In one case, navigation page is made up of multiple paging.Now capture the paging link of each navigation page, and be added into navigation page address list.The paging captured can with commodity shelf life ascending sort.Generally speaking, there is certain regularity in the paging URL of the navigation page of same website.The URL of a navigation page of such as Amazon China is http://www.amazon.cn/s/ref=sa_menu_office_l3_shard_disc page=1 & rh=n%3A888650051, wherein point page number of page=1 parameter item representation page.Preferably, electric business's web-site that can be respectively different arranges different paging rules for grasping, in order to collect navigation page paging URL.
Finally by the URL address of navigation page address list, download corresponding navigation page.
In yet another embodiment, identified from navigation page extraction commodity page URL by template area.Fig. 4 is the navigation page example according to the embodiment of the present invention.Navigation page, except comprising other commodity of specified class URL, also contains user's historical viewings, the out of Memory such as Recommendations such as seniority among brothers and sisters, personalization of selling fast.In the present invention, the region of other commodity of specified class URL is called template area, and the region of out of Memory is extraneous areas.As a rule, template area occupies the larger space of a whole page of the page, and wherein the hypertext link structure of each commodity is comparatively regular, points to the XPath comparatively unification of commodity page URL place node.XPath is a kind of XML file path language, and the present invention utilizes by the specific label in XPath mode positioning webpage (such as, html page).
In this embodiment, extract the URL of coupling XPath pattern as commodity page URL according to commodity page XPath pattern from navigation page, wherein, obtained the XPath pattern of the class I goods page of a corresponding navigation page by following process:
Create the XPath set of all URL places node of this navigation page,
Utilize KNN clustering algorithm to carry out cluster to this XPath set, wherein use XPath distance as the distance metric parameter of this KNN clustering algorithm.Here XPath distance can be used to identify page area.XPath distance represents the difference degree of two XPath, and Jaccard similarity mode can be adopted with metrology path difference degree.
Select the maximum class of number of members as the template area of commodity page URL,
Commodity page URL in the class of above-mentioned selection is carried out the XPath pattern that medelling obtains such commodity page of this navigation page, because the XPath of the commodity URL node in same region is comparatively regular, therefore, it is possible to use extensive XPath pattern to carry out unified representation.The constituent grammar of XPath pattern can be identical with XPath, strengthens further compared to the query context of XPath, XPath pattern and dirigibility.In one example, take greedy algorithm to obtain the maximum public part of the XPath group of template area, and carry out extensive to changing unit, generate XPath pattern, to represent template area.
Wherein, described XPath distance dis () instruction XPathp 1and XPathp 2difference degree, its computing formula is as follows:
Wherein, TL () is function XPath being resolved into one group of sequence idol, and each sequence is even to be made up of the bookmark name in this XPath and the position number of this label in this XPath.Such as XPath is " A/B/C/D/E ", then TL (A/B/C/D/E)={ <A, 1>, <B, 2>, <C, 3>, <D, 4>, <D, 5>}.
By the description of above embodiment, those skilled in the art can understand, and when without departing from the spirit and scope of the present invention, can also do various change and replacement to the specific embodiment of the present invention.These change and replace and all drop in claims of the present invention limited range.

Claims (4)

1. a page info acquisition methods for electric business website, is characterized in that, this electric business website comprises homepage, navigation page, the commodity page, wherein, homepage has navigation page URL, and navigation page has commodity page URL, the commodity of a corresponding classification of navigation page, the method comprises:
Navigation page URL is extracted from homepage,
Commodity page URL is extracted from navigation page,
Commodity page info is downloaded according to commodity page URL,
Wherein, again download commodity page URL every predetermined time interval from navigation page, the commodity page URL wherein downloaded is sorted by shelf life,
Wherein, by following process, commodity page info is upgraded:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, when comparative result is not identical, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.
2. the method for claim 1, is characterized in that, wherein, is upgraded commodity page info by following process:
The commodity page URL of this navigation page that downloads with corresponding tagmeme history of the commodity page URL of the navigation page of more up-to-date download in order, and the commodity page URL of this navigation page that downloads with corresponding tagmeme history of the SimHash condition code of the commodity page of commodity page URL of navigation page comparing this up-to-date download distinguish the SimHash condition code of the corresponding commodity page
When the SimHash condition code difference of the URL commodity page of the commodity page of this navigation page that the not identical and SimHash condition code that the is commodity page of the commodity page URL of the navigation page of this up-to-date download of the commodity page URL of this navigation page that the commodity page URL of the navigation page of this up-to-date download downloads with corresponding tagmeme history is downloaded with corresponding tagmeme history is less than setting value, the commodity page URL of this navigation page history downloaded is updated to the commodity page URL of the navigation page of the up-to-date download of corresponding tagmeme, and downloads the merchandise news of the commodity page URL of the navigation page of this up-to-date download.
3. the method for claim 1, is characterized in that,
Extract navigation page URL by navigation page URL pattern from homepage, wherein, navigation page URL pattern represents with regular expression form.
4. the method for claim 1, is characterized in that,
Extract commodity page URL from navigation page and comprise following process:
Extract the URL of coupling XPath pattern as commodity page URL according to commodity page XPath pattern from navigation page, wherein, obtained the XPath pattern of the class I goods page of a corresponding navigation page by following process:
Create the XPath set of all URL places node of this navigation page,
Utilize KNN clustering algorithm to carry out cluster to this XPath set, wherein use XPath distance as the distance metric parameter of this KNN clustering algorithm,
Select the maximum class of number of members as the template area of commodity page URL,
Commodity page URL in the class of above-mentioned selection is carried out the XPath pattern that medelling obtains such commodity page of this navigation page,
Wherein, described XPath distance dis () instruction XPathp 1and XPathp 2difference degree, its computing formula is as follows:
Wherein, TL () is function XPath being resolved into one group of sequence idol, and each sequence is even to be made up of the bookmark name in this XPath and the position number of this label in this XPath.
CN201410260218.0A 2014-06-12 2014-06-12 A kind of page info acquisition methods of electric business website Active CN105205061B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410260218.0A CN105205061B (en) 2014-06-12 2014-06-12 A kind of page info acquisition methods of electric business website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410260218.0A CN105205061B (en) 2014-06-12 2014-06-12 A kind of page info acquisition methods of electric business website

Publications (2)

Publication Number Publication Date
CN105205061A true CN105205061A (en) 2015-12-30
CN105205061B CN105205061B (en) 2018-08-10

Family

ID=54952751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410260218.0A Active CN105205061B (en) 2014-06-12 2014-06-12 A kind of page info acquisition methods of electric business website

Country Status (1)

Country Link
CN (1) CN105205061B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294535A (en) * 2016-07-19 2017-01-04 百度在线网络技术(北京)有限公司 The recognition methods of website and device
CN106933915A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The generation method and device of web page navigation
CN108572900A (en) * 2017-03-09 2018-09-25 北京京东尚科信息技术有限公司 Method, system, electronic equipment and the storage medium of a kind of blank hole position monitoring
CN112232835A (en) * 2020-10-16 2021-01-15 北京明略昭辉科技有限公司 Method, server and terminal equipment for monitoring e-commerce platform product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265632A1 (en) * 2011-04-13 2012-10-18 Jatin Patro Online marketplace methods facilitating local commerce
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN102867053A (en) * 2012-09-12 2013-01-09 北京奇虎科技有限公司 Method, device and system for collecting effective information web pages in website information
CN102982184A (en) * 2012-12-26 2013-03-20 福建师范大学 Crawler algorithm for capturing webpage in online shopping mall

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120265632A1 (en) * 2011-04-13 2012-10-18 Jatin Patro Online marketplace methods facilitating local commerce
CN102867053A (en) * 2012-09-12 2013-01-09 北京奇虎科技有限公司 Method, device and system for collecting effective information web pages in website information
CN102831252A (en) * 2012-09-21 2012-12-19 北京奇虎科技有限公司 Method and device for updating index database and search method and system
CN102982184A (en) * 2012-12-26 2013-03-20 福建师范大学 Crawler algorithm for capturing webpage in online shopping mall

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨颂: "面向电子商务网站的增量爬虫设计与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933915A (en) * 2015-12-31 2017-07-07 北京国双科技有限公司 The generation method and device of web page navigation
CN106294535A (en) * 2016-07-19 2017-01-04 百度在线网络技术(北京)有限公司 The recognition methods of website and device
CN108572900A (en) * 2017-03-09 2018-09-25 北京京东尚科信息技术有限公司 Method, system, electronic equipment and the storage medium of a kind of blank hole position monitoring
CN112232835A (en) * 2020-10-16 2021-01-15 北京明略昭辉科技有限公司 Method, server and terminal equipment for monitoring e-commerce platform product

Also Published As

Publication number Publication date
CN105205061B (en) 2018-08-10

Similar Documents

Publication Publication Date Title
CN111444395B (en) Method, system and equipment for obtaining relation expression between entities and advertisement recall system
US9600530B2 (en) Updating a search index used to facilitate application searches
US20180239781A1 (en) Automatically obtaining real-time, geographically-relevant product information from heterogeneus sources
US9317613B2 (en) Large scale entity-specific resource classification
JP5721818B2 (en) Use of model information group in search
CN102831252B (en) A kind of method for upgrading index data base and device, searching method and system
US11550856B2 (en) Artificial intelligence for product data extraction
US20120066380A1 (en) Update notification method and system
CN102129428B (en) A kind of method and device realizing subscription information from webpage
CN105718184A (en) Data processing method and apparatus
US20120023127A1 (en) Method and system for processing a uniform resource locator
CN103942712A (en) Product similarity based e-commerce recommendation system and method thereof
CN103189836A (en) Method for classification of objects in a graph data stream
JP2009080624A (en) Information display device, method and program
CN103294781A (en) Method and equipment used for processing page data
US20200356541A1 (en) Mechanism for efficient storage of graph data
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN105205061A (en) Method for acquiring page information of E-commerce website
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN101310277B (en) Method of obtaining a representation of a text and system
CN105117434A (en) Webpage classification method and webpage classification system
CN102902792B (en) list page identification system and method
CN108280102A (en) Internet behavior recording method, device and user terminal
Omari et al. Cross-supervised synthesis of web-crawlers
CN105512281A (en) Display method and device for official website type research result page

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant