CN104504115A - Method and device for extracting POI (Point of Interest) data from webpages - Google Patents

Method and device for extracting POI (Point of Interest) data from webpages Download PDF

Info

Publication number
CN104504115A
CN104504115A CN 201410844236 CN201410844236A CN104504115A CN 104504115 A CN104504115 A CN 104504115A CN 201410844236 CN201410844236 CN 201410844236 CN 201410844236 A CN201410844236 A CN 201410844236A CN 104504115 A CN104504115 A CN 104504115A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
page
address
web
corresponding
mode
Prior art date
Application number
CN 201410844236
Other languages
Chinese (zh)
Inventor
魏少俊
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30699Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/3061Information retrieval; Database structures therefor ; File system structures therefor of unstructured textual data
    • G06F17/30705Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • G06F17/30864Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems
    • G06F17/30867Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems with filtering and personalisation

Abstract

The invention discloses a method and a device for extracting POI (Point of Interest) data from webpages, and relates to the technical field of information point extraction. The method comprises the following steps: acquiring a plurality of webpages comprising POI data; performing address mode clustering on the webpages according to the URL (Uniform Resource Locator) address of each webpage; sequencing a plurality of address modes based on the quantity of webpages corresponding to each address mode to obtain the sequencing result of each address mode; selecting N address modes with largest webpage quantities; extracting POI data comprised in the webpages corresponding to the N address modes respectively. Through the method and the device, mass webpages of POI data can be determined from hundred-billion-scale webpages more rapidly, and the POI data are extracted from the webpages more accurately.

Description

一种网页中的POI数据提取方法及装置 POI data extracting method and apparatus of the page

技术领域 FIELD

[0001] 本发明涉及互联网技术领域,特别涉及一种网页中的POI数据提取方法及装置。 [0001] The present invention relates to the field of Internet technologies, and particularly relates to a method of extracting POI data and a device page.

背景技术 Background technique

[0002] PO1:“Point of Interest”的缩写,每个POI中的数据一般包含名称、类别、经度、玮度、附近的酒店饭店商铺等信息,可以作为电子地图中的位置标识。 [0002] PO1: abbreviations "Point of Interest", the data for each POI in general contain the name, category, longitude, Wei degree, shops and other restaurants near the hotel information, location identification can be used as an electronic map. 地图搜索需要海量的P0I数据作为搜索源,P0I数据主要是通过购买、合作或者自行建设来构建。 Search map requires vast amounts of data as a search source P0I, P0I data mainly through the purchase of, or cooperation to build their own building.

[0003] 互联网上含有大量的可以作为P0I数据的地理位置信息,比如一个公司会在它的主页上给出地址和联系方式,一个美食网站会给出店家的具体位置和订餐电话信息等,这些含P0I数据的网页给地图搜索提供了丰富的数据来源。 [0003] contain a lot on the Internet can serve as geographic information P0I data, such as a company will give the address and contact information on its home page, a food site will give specific location information and reservations, please call the store and so on, these pages with P0I data to the search map provides a rich source of data.

[0004] 不过,这些含P0I数据的网页淹没在海量的网页中,相对于动辄几千亿的网页,其比例不到百分之一,但现有技术中却没有较为准确、快捷的P0I数据提取方式。 [0004] However, these data pages containing P0I submerged in the flood of the page, relative to the hundreds of billions of web pages every turn, the ratio is less than one percent, but the art was not more accurate, efficient data P0I extraction methods.

发明内容 SUMMARY

[0005] 鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种网页中的P0I数据提取方法和相应的一种网页P0I数据提取装置。 [0005] In view of the above problems, the present invention is proposed in order to overcome the above problems or to provide an at least partially solve P0I page data extraction method of the above problems and the corresponding one of the data extracting means P0I page.

[0006] 依据本发明的一个方面,提供了一种网页中的P0I数据提取方法,所述方法包括: [0006] According to one aspect of the invention there is provided a method for extracting P0I data page, said method comprising:

[0007] 获取包含P0I数据的多个网页; [0007] Gets P0I multiple pages of data;

[0008] 根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0008] The URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0009] 基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0009] Based on the number of modes correspond to the respective addresses of the pages, the plurality of address modes are sorted to rank the respective address mode;

[0010] 选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的P0I数据,所述N为不小于1的整数。 [0010] select the N maximum number of page address mode, extracting the N P0I address mode data corresponding to each page included, the N is an integer not less than 1.

[0011] 可选地,所述根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式,进一步包括: [0011] Alternatively, the address of each page according to the URL, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page, further comprising:

[0012] 对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 [0012] The URL address of each page, the prefix extraction, the address of the page based on pattern clustering the extracted prefix, to obtain a corresponding plurality of address mode of each page.

[0013] 可选地,所述获取包含P0I数据的多个网页,进一步包括: [0013] Alternatively, the acquiring P0I multiple pages of data, further comprising:

[0014] 根据预设目标网站的网址获取包含P0I数据的多个网页。 [0014] to obtain multiple pages P0I data according to a preset target URL of the site.

[0015] 可选地,所述提取所述N个地址模式分别对应的网页中包含的P0I数据之前,所述方法还包括: Before [0015] Alternatively, the extraction of the N addresses P0I data pattern corresponding to each page included, said method further comprising:

[0016] 获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0016] to obtain the corresponding sites of the N page address corresponding to each pattern, the web site directed capture and a web address corresponding to the pattern to crawl web pages.

[0017] 可选地,所述提取所述N个地址模式分别对应的网页中包含的P0I数据,进一步包括: [0017] Alternatively, the extraction P0I the N data page address corresponding to each pattern contains, further comprising:

[0018] 建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 [0018] create a page template page address mode the N respective address corresponding to each mode, the page extraction POI data corresponding to the N modes are contained based on the web page template.

[0019] 可选地,所述根据预设目标网站的网址获取包含POI数据的多个网页,进一步包括: [0019] Alternatively, the POI data acquiring multiple pages in accordance with the preset target URL of the website, further comprising:

[0020] 根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 [0020] screened according to the preset target site URL crawl the preset target web site, and crawled pages for the web page.

[0021] 可选地,所述对抓取到的网页进行筛选,以获取所述网页,进一步包括: [0021] Alternatively, the screening of crawled web pages, to obtain the web page, further comprising:

[0022] 从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0022] acquires the Web page containing the first keyword pages and eliminate page containing the first keyword from the keyword having a second crawled pages, and excluding web page as.

[0023] 可选地,所述第二关键词为预设垃圾地址字符串。 [0023] Alternatively, the second keyword is junk predetermined address string.

[0024] 可选地,所述提取所述N个地址模式分别对应的网页中包含的POI数据,进一步包括: [0024] Alternatively, the extraction POI data of the N page addresses corresponding to each pattern contains, further comprising:

[0025] 标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0025] POI data mark the location on the page N addresses corresponding to each mode, so as to extract the POI data out every page.

[0026] 依据本发明的另一个方面,提供了一种网页中的POI数据提取装置,所述装置包括: [0026] According to another aspect of the present invention, there is provided an apparatus for extracting the POI data page, said apparatus comprising:

[0027] 网页获取器,适于获取包含POI数据的多个网页; [0027] Web page acquiring unit, adapted to acquire POI data of multiple pages;

[0028] 模式聚类器,适于根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0028] pattern clustering is adapted to address each page according to the URL, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0029] 模式排序器,适于基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0029] Sequencer mode, each adapted to the number of pages corresponding to each address mode based on the plurality of address modes are sorted to rank the respective address mode;

[0030] 信息提取器,适于选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0030] The information extractor, adapted to the largest number of addresses N mode selection page, the page data of a POI extracting the N addresses corresponding to each pattern contained in said N is an integer not less than 1.

[0031] 可选地,所述模式聚类器,还适于对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 [0031] Alternatively, the pattern clustering is further adapted to URL address of each page, the prefix extraction, the plurality of prefixes extracted webpage address pattern clustering, to obtain a corresponding page according to the respective address mode.

[0032] 可选地,所述网页获取器,还适于根据预设目标网站的网址获取包含POI数据的多个网页。 [0032] Alternatively, the Web page acquiring unit, adapted to obtain a plurality of pages further comprising a POI data according to a predetermined URL of the target Web site.

[0033] 可选地,所述信息提取器,还适于获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0033] Alternatively, the information extractor, further adapted to obtain the N respective web site address corresponding to each pattern, the web site directed capture and crawled web pages corresponding to the address as page mode.

[0034] 可选地,所述信息提取器,还适于建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 [0034] Alternatively, the information extractor, further adapted to establish a pattern of the N addresses each address corresponding to each pattern page template page, retrieve the web pages respectively corresponding to the N mode based on the web page template contained in the POI data.

[0035] 可选地,所述网页获取器,还适于根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 [0035] Alternatively, the Web page acquiring unit, further adapted to the preset target website page based on the URL to fetch the preset target site, and screening crawled web pages, to obtain the webpage.

[0036] 可选地,所述网页获取器,还适于从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0036] Alternatively, the Web page acquiring, further adapted to obtain the first keyword from the web containing pages crawled, and having a second keyword removed first keyword contained in the web page after page and removed as a web page.

[0037] 可选地,所述第二关键词为预设垃圾地址字符串。 [0037] Alternatively, the second keyword is junk predetermined address string.

[0038] 可选地,所述信息提取器,进一步适于标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0038] Alternatively, the information extractor is further adapted to mark the position of the POI data on the page N addresses corresponding to each mode, so as to extract the POI data out every page.

[0039] 本发明获取包含P0I数据的多个网页,根据各个网页的URL地址,对所述网页进行地址模式聚类,基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果,选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的P0I数据。 [0039] The present invention acquires a plurality of data pages containing P0I, each page according to the URL address, the address of the web pattern clustering, based on the number of patterns corresponding to the respective addresses of pages, the plurality of sorting address mode to give the respective results of sorting address mode, select the maximum number of N page address mode, extracting the N P0I pattern data corresponding to each page address contained. 通过本发明可以更加快捷的从千亿级网页中确定出P0I数据的海量网页,并且实现了更为准确地从网页中提取P0I数据。 Can be more efficient by the present invention is determined from the mass of one hundred billion web pages P0I data, and to achieve a more accurately extract data from web pages P0I.

附图说明 BRIEF DESCRIPTION

[0040] 通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。 [0040] By reading the following detailed description of preferred embodiments Hereinafter, a variety of other advantages and benefits to those of ordinary skill in the art will become apparent. 附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。 The drawings are only for purposes of illustrating a preferred embodiment and are not to be considered limiting of the present invention. 而且在整个附图中,用相同的参考符号表示相同的部件。 But throughout the drawings, like parts with the same reference symbols. 在附图中: In the drawings:

[0041] 图1是本发明一个实施例的一种网页P0I数据提取方法的步骤流程图; [0041] FIG. 1 is a step of a method for extracting data P0I web embodiment of the present invention, a flow chart;

[0042] 图2是本发明一个实施例的一种网页P0I数据提取方法的步骤流程图; [0042] FIG. 2 is a step of one kind of embodiment P0I page data extraction method of the present invention, a flowchart of the embodiment;

[0043] 图3是本发明一个实施例的一种网页P0I数据提取方法的步骤流程图; [0043] FIG. 3 is a data extraction step P0I a web embodiment of the present invention, a flow chart;

[0044] 图4是本发明一个实施例的一种网页P0I数据提取装置的结构框图。 [0044] FIG. 4 is a block diagram of a configuration example of a data page P0I extraction device according to one embodiment of the present invention.

具体实施方式 detailed description

[0045] 下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。 [0045] The following embodiments and the accompanying drawings, specific embodiments of the present invention will be further described in detail. 以下实施例用于说明本发明,但不用来限制本发明的范围。 The following examples serve to illustrate the present invention but are not intended to limit the scope of the present invention.

[0046] 图1是本发明一个实施例的一种网页中的P0I数据提取方法的步骤流程图;参照图1,所述方法包括: [0046] FIG. 1 is a step of a method for extracting data P0I web embodiment of the present invention, a flowchart of the embodiment; see FIG. 1, the method comprising:

[0047] S101:获取包含P0I数据的多个网页; [0047] S101: acquiring a plurality of data pages containing P0I;

[0048] 需要说明的是,获取所述网页可采用多种方式,本实施例中,根据预设目标网站的网址获取包含P0I数据的多个网页,当然,还可采用其他方式,本实施例对此不加以限制。 [0048] Incidentally, the web can be acquired in various ways, according to the present embodiment, acquiring the plurality of data pages containing P0I preset target site according to the URL, of course, also be employed another embodiment, the present embodiment It has no limitation thereon.

[0049] S102:根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0049] S102: The URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0050] 可理解的是,根据各个网页URL对所述网页进行地址模式聚类可采用多种方式,为了便于实现地址模式聚类,一种情况下,可以采用前缀模式(pattern)进行聚类,即对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0050] understood that each page of the URL address of the web page may take a variety of ways according to the cluster pattern, the cluster address mode to facilitate, in one case, may be employed prefix pattern (pattern) cluster , that is the URL address of each page, the prefix extraction, the address of the page based on pattern clustering the extracted prefix, to obtain a corresponding plurality of address mode of each page;

[0051]例如:有 3 个包含P0I 数据的网页URL分别为http://www.360.cn/weishi/index,html,http://www.360.cn/haosou/index, html,http://www.360.cn/zhushou/index,html,等等,则进行URL模式(pattern)聚类可按照这些URL的相同前缀“www.360.cn/”作为对应上述网页URL的地址模式,另外,也可以根据多种分隔符主要有等进行地址模式聚类。 [0051] For example: There are three URL of the page containing the data P0I are http://www.360.cn/weishi/index,html,http://www.360.cn/haosou/index, html, http: //www.360.cn/zhushou/index,html, etc., is performed URL patterns (pattern) in the same cluster may prefix the URL of "www.360.cn/" mode corresponds to the above page address as the URL, Further, there may be the main mode cluster address, etc. according to various separators.

[0052] 当然,还可采用其他方式(如保留段模式)进行聚类,例如,另有4个包含P0I数据的网页URL 分别为http://wfy89395688.huangye.360.cn/, http://wfy35467354.huangye.360.cn/, http://wfy12345678.huangye.360.cn/, http://wfy98765432.huangye.360.cn/,对上述网页进行地址模式聚类,得到对应这些网页的地址模式(pattern)为“wfy********.huangye.360.cn/,,(或“***********.huangye.360.cn/,,,看具体情况而定),这里的代表通配符,其他为保留字段。 [0052] Of course, other means may also be employed (e.g., reserved segment patterns) cluster, e.g., another four data page URL contains P0I respectively http://wfy89395688.huangye.360.cn/, http: / /wfy35467354.huangye.360.cn/, http://wfy12345678.huangye.360.cn/, http://wfy98765432.huangye.360.cn/, the above-described mode cluster address pages, these pages to give the corresponding address mode (pattern) is "wfy ********. huangye.360.cn / ,, (or" ***********. huangye.360.cn / ,,, See the case may be), on behalf of a wildcard here, other fields are reserved.

[0053] 本步骤中并不必须限定按照一种方式进行聚类,可以多种聚类方式结合。 [0053] The present step is not necessarily limited clustering According to one embodiment, it may be combined various ways clustering.

[0054] S103:基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0054] S103: Based on the number of patterns corresponding to the respective addresses of pages, the plurality of address modes are sorted to rank the respective address mode;

[0055] 需要说明的是,基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序可采用多种方式,为便于实现步骤S104中的选择网页数量最多的N个地址模式,本实施例中,可基于各个地址模式对应的网页的数量对所述多个地址模式采用从大到小的顺序进行排序。 [0055] Incidentally, based on the number of patterns corresponding to the respective addresses of pages, the plurality of address modes can be sorted in various ways, to facilitate selection of the page in step S104 maximum number of addresses N mode, in this embodiment, it may be sorted based on the number of pages corresponding to the respective addresses of the pattern employed in descending order of said plurality of address mode.

[0056] 比如上面两个例子中,地址模式“wfy********.huangye.360.cn/,,(或“***********.huangye.360.cn/”)对应有4个网页,而地址模式“www.360.cn/”则对应有3个网页,前者的排序更靠前。 [0056] For example, the above two examples, the address mode "wfy ********. Huangye.360.cn / ,, (or" ***********. Huangye.360. cn / ") corresponding to 4 pages, and the address mode" www.360.cn/ "corresponds 3 page, the former is closer to the top ranking.

[0057] 当然,还可采用其他方式(如从小到大的顺序)进行排序,本实施例对此不加以限制。 [0057] Of course, other means may also be employed (e.g., ascending order) are sorted, and the embodiment is not limited thereto.

[0058] S104:选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0058] S104: Select the largest number of N page address mode, a POI extracting the N data page address corresponding to each pattern contained in said N is an integer not less than 1.

[0059] 需要说明的是,通常网页数量较多的地址模式可提供较大的POI数据贡献力,故而,本实施例中,选择网页数量最多的N个地址模式,并提取所述N个地址模式分别对应的网页中包含的POI数据; [0059] Incidentally, generally larger number of address for the page mode provides a larger contribution for the POI data, therefore, the present embodiment, a maximum number of pages of selected addresses N mode, and extracts the N addresses POI data pattern corresponding to each page included;

[0060] 比如上面两个例子中,对应网页数量最多(4个)的地址模式是“wfy林林林**.huangye.360.cn/,,(或“***********.huangye.360.cn/,,),其次是“www.360.cn/,,,如果N 取1,则只选择地址模式(即pattern)为“wfy********.huangye.360.cn/,,(或“***********.huangye.360.cn/”)的,如果N取2,则这两个都会选择。 [0060] For example, in the above two examples, corresponding to a maximum number of pages (4) Address mode "wfy Linlin Lin **. Huangye.360.cn / ,, (or" ********* **. huangye.360.cn/ ,,), followed by "www.360.cn / ,,, if N is 1, only the address selection mode (i.e., pattern) is" wfy ******** .huangye.360.cn / ,, (or "***********. huangye.360.cn/"), if N is 2, then the two will be selected.

[0061] 当然,在提取所述N个地址模式分别对应的网页中包含的POI数据之前,还可依次对所述N个地址模式进行分析,以确定所述N个地址模式分别对应的网页中至少大部分具有POI数据。 [0061] Of course, before the POI data included in the page mode of extracting the N addresses respectively corresponding to, further sequentially addresses the N mode are analyzed, to determine the page of the N addresses respectively corresponding to the pattern most have at least POI data.

[0062] 优选地,可以建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板识别或标识出POI数据所在网页位置,从而提取所述N个模式分别对应的网页中包含的POI数据。 [0062] Preferably, the N may be established in each addressing mode address mode web page template corresponding to each page, the page where the POI data according to the location of the page template recognition or identification, thereby extracting the N mode respectively POI data pages contain.

[0063] 优选地,还可以通过标注的方式确定N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0063] Preferably, the position can also be determined on the page mode N addresses corresponding to each POI data by way of tagging, thereby extracts POI data out every page. 相同地址模式的网页可以通过同一抽取工具将POI数据提取出来。 The same page address mode may be the same POI data extracted by the extraction tool.

[0064] 本实施例根据各个网页的URL地址,对所述网页进行地址模式聚类,基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果,选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据。 Sorting result [0064] According to the present embodiment, each page of the URL address, the address of the web pattern clustering, based on the number of patterns corresponding to the respective addresses of pages, the plurality of sorting address mode, address mode of each obtained selecting the largest number of N page address mode, extracts the POI data of N pages corresponding to each address mode included. 通过本发明可以更加快捷的从千亿级网页中确定出POI数据的海量网页,并且实现了更为准确地从网页中提取POI数据。 Can be more efficient by the present invention is determined from the mass of one hundred billion web pages POI data, and to achieve a more accurately extract POI data from web pages.

[0065] 另外,由于在步骤S101中获取网页时,可能存在获取不充分的问题,由于所述N个地址模式可提供较大的POI数据贡献力,为充分提取所述N个地址模式分别对应的网页相应的网站中可能存在的POI数据,可选地,所述提取所述N个地址模式分别对应的网页中包含的P0I数据之前,所述方法还包括: [0065] Further, since the page is acquired in step S101, there may be a problem of insufficient acquired, since the N-address mode provides a larger contribution for the POI data, to extract substantially corresponding to the N address mode respectively POI data corresponding web site may be present, optionally before the extraction P0I the N data page address corresponding to each mode included, the method further comprising:

[0066] 获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0066] to obtain the corresponding sites of the N page address corresponding to each pattern, the web site directed capture and a web address corresponding to the pattern to crawl web pages.

[0067] 图2是本发明一个实施例的一种网页中的P0I数据提取方法的步骤流程图;参照图2,所述方法包括: [0067] FIG. 2 is a step of a method for data extraction P0I a web embodiment of the present invention in a flow chart; 2, said method comprising:

[0068] S201:获取包含P0I数据的多个网页; [0068] S201: acquiring a plurality of data pages containing P0I;

[0069] S202:根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0069] S202: The URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0070] S203:基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0070] S203: Based on the number of patterns corresponding to the respective addresses of pages, the plurality of address modes are sorted to rank the respective address mode;

[0071] 可理解的是,步骤S201〜S203与图1所示的实施例的步骤S101〜S103相同,在此不再赘述。 [0071] appreciated that the same procedure of the step S201~S203 embodiment shown in FIG. 1 S101~S103, are not repeated here.

[0072] S204:选择网页数量最多的N个地址模式,建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的P0I数据,所述N为不小于1的整数。 [0072] S204: Select the largest number of N page address mode, create a page template page address mode the N respective address corresponding to each mode, the N mode extracting web page corresponding to the web page template in accordance with P0I data contained in said N is an integer not less than 1.

[0073] 需要说明的是,由于一个地址模式所对应的网页的结构非常相似,故而本实施例中通过建立网页模板的方式来提取对应地址模式对应的网页中包含的P0I数据,建立网页模板可采用多种方式实现,本实施例中,为便于实现简单,可选择一个地址模式中所对应的多个网页,由运营人员在所选择的网页上标识P0I数据的位置(即在所选择的网页上分别标识P0I数据中的名称、类别、经度、玮度及附近的酒店饭店商铺等信息进行标识),标识出P0I数据的位置,并形成该地址模式的网页模板,这样可以通过网页模板统一抽取的方式,大批量获取海量网页中的P0I数据; [0073] Incidentally, since the structure of an address pattern corresponding to the page is very similar, and therefore P0I data web embodiment to extract a corresponding address mode, by way of establishing a web page template corresponding contained the present embodiment, the establishment of the page template implemented in various ways, in this embodiment, to facilitate simple, a plurality of selectable pages corresponding to the address mode, P0I identification data on a web page selected by the person operating position (i.e. the selected page web page template name on individually identified P0I data, category, longitude, Wei degree and nearby restaurants shops and other information are identified), identifies the location P0I data and form the address mode, so you can unify drawn through web templates way, high-volume data acquisition P0I massive web pages;

[0074] 当然,还可采用其他方式建立网页模板,本实施例对此不加以限制。 [0074] Of course, other means may be employed to establish a web page template, and the embodiment is not limited thereto.

[0075] 图3是本发明一个实施例的一种网页中的P0I数据提取方法的步骤流程图;参照图3,所述方法包括: [0075] FIG. 3 is a step of a method for data extraction P0I a web embodiment of the present invention in a flow chart; Referring to Figure 3, the method comprising:

[0076] S301:根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页; [0076] S301: URL filter according to a preset target site pages crawl the preset target site, and crawled pages to get the web page;

[0077] 需要说明的是,本实施例中根据预设目标网站的网址通过网络蜘蛛抓取所述预设目标网站的网页,当然,还可采用其他方式抓取所述预设目标网站的网页。 [0077] Incidentally, in the embodiment according to the URL web site preset target spiders crawling through web pages of the preset target site present embodiment, of course, other ways may also be used to fetch the preset target site .

[0078] 可理解的是,由于抓取到的网页中包括很多不具有P0I数据的网页,若将这些网页直接进行P0I数据的提取,则会导致提取效率和提取准确率过低,故而需要对抓取到的网页进行筛选,以获得所述网页。 [0078] appreciated that, due to crawl web pages do not include many data having P0I, if those pages directly P0I data extraction, and the extraction efficiency will result in extraction accuracy is too low, and therefore the need for crawled pages screened to obtain the web page.

[0079] 为提高网页的筛选准确率,本实施例中,从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0079] In order to improve the accuracy of screening page, the present embodiment acquires the Web page containing the first keyword from crawled pages, and having a second keyword removed first keyword contained in the web page after page and removed as a web page.

[0080] 由于P0I数据实际上即为与地址相关的信息,故而可将第一关键词设为“地址”,但包括第二关键词(例如:“网站地址”、“下载地址”等预设垃圾地址字符串)的网页同样会被确定为含有第一关键词的网页,但这些网页实际上并为提供P0I数据,反而会对P0I数据的提取造成影响,故而,需要将剔除后的网页作为网页。 [0080] Since the data is actually P0I information that is relevant to address, and therefore can be the first keyword is set to "address", but includes a second keyword (for example: "website address", "Download", etc. Preset garbage address string) of the web page will also be identified as containing the first keyword pages, but these pages actually provide P0I data, but would extract data P0I impact, therefore, we need to be removed after the web as Web page.

[0081] 为提高网页剔除的效率,本实施例中,可先获取含有“地址”字样的网页,再从“地址”两字往前取2个字(“字”定义为汉字、英文单词或数值),统计则4个字(即“XX地址”)的出现次数,并按照出现次数从大到小排序,由于排序在前的结果更具普遍性,可按照从大到小的顺序依次确定垃圾地址字符串,再剔除包含这些垃圾地址字符串的网页。 [0081] In order to improve the efficiency of the page removed, the present embodiment may contain first acquires the "address" on the page, and then from the "address" word Previous take two words ( "word" is defined as characters, English words or values), statistics are four words (i.e., "XX address") number of occurrences, and in accordance with the number of occurrences in descending order, as a result of the top ranked more general, may be sequentially determined in descending order address string of garbage, trash and then removing pages that contain the address of the string.

[0082] S302:根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0082] S302: The URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0083] S303:基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0083] S303: Based on the number of patterns corresponding to the respective addresses of pages, the plurality of address modes are sorted to rank the respective address mode;

[0084] S304:选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0084] S304: Select the largest number of N page address mode, a POI extracting the N data page address corresponding to each pattern contained in said N is an integer not less than 1.

[0085] 步骤S302〜S304与图1所示的实施例的步骤S102〜S104相同,在此不再赘述。 [0085] The same procedure as step S302~S304 embodiment shown in FIG. 1 S102~S104, are not repeated here.

[0086] 对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。 [0086] For Example, for ease of description, it is described as a series combination of actions, those skilled in the art should know that, the operation is not limited by the order of the described embodiments of the present invention, since according to the present Example invention, some steps may be performed simultaneously or in other sequences. 其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。 Secondly, those skilled in the art should also understand that the embodiments are described in the specification are exemplary embodiments, the operation is not necessarily related to the embodiment of the present invention must be implemented.

[0087] 图4是本发明一个实施例的一种网页中的POI数据提取装置的结构框图;参照图4,所述装置包括: [0087] FIG. 4 is a block diagram showing an example of POI data one kind of web extraction device according to an embodiment of the present invention; FIG. 4, the apparatus comprising:

[0088] 网页获取器401,适于获取包含POI数据的多个网页; [0088] Web page acquiring unit 401, adapted to obtain a plurality of pages of data comprising a POI;

[0089] 模式聚类器402,适于根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0089] Clustering pattern 402, adapted according to the URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0090] 模式排序器403,适于基于各个地址模式对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0090] Sequencer Model 403, adapted to the number of pages corresponding to the respective address mode based on the plurality of address modes are sorted to rank the respective address mode;

[0091] 信息提取器404,适于选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0091] The information extractor 404, the number N of the most suitable mode selection page addresses, POI data extracting pattern of the N addresses corresponding to each page included, the N is an integer not less than 1.

[0092] 在本发明的一种可选实施例中,所述模式聚类器,还适于对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 [0092] In an alternative embodiment of the present invention, an embodiment of the clustering mode, further adapted to URL address of each page, the prefix extraction, the web page address pattern clustering the extracted prefix according to to obtain a corresponding plurality of address mode of each page, respectively.

[0093] 在本发明的一种可选实施例中,所述网页获取器,还适于根据预设目标网站的网址获取包含POI数据的多个网页。 [0093] In an alternative embodiment of the present invention, the Web page acquiring unit, adapted to obtain a plurality of pages further comprising a POI data according to a predetermined URL of the target Web site.

[0094] 在本发明的一种可选实施例中,所述信息提取器,还适于获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0094] In an alternative embodiment of the present invention, the information extractor embodiment, further adapted to obtain the corresponding sites of the N page address corresponding to each pattern, the web site directed capture and crawled web pages corresponding to a page address mode.

[0095] 在本发明的一种可选实施例中,所述信息提取器,还适于建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 [0095] In an alternative embodiment of the present invention, the information extractor embodiment, further adapted to establish a pattern of the N addresses each address corresponding to each pattern page template page, the web page template according extracting said pattern corresponding to each of the N pages included in the POI data.

[0096] 在本发明的一种可选实施例中,所述网页获取器,还适于根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 [0096] In an alternative embodiment of the present invention, the Web page acquiring unit, further adapted to the preset target website page based on the URL to fetch the preset target site, and web pages to crawl screened to obtain the web page.

[0097] 在本发明的一种可选实施例中,所述网页获取器,还适于从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0097] In an alternative embodiment of the present invention, the Web page acquiring, further adapted to obtain the first keyword from the web containing pages crawled, and comprising a first reject the keyword web pages having a second keyword, and excluding web page as.

[0098] 在本发明的一种可选实施例中,所述第二关键词为预设垃圾地址字符串。 [0098] In an alternative embodiment of the present invention, the second predetermined keyword is junk address string.

[0099] 在本发明的一种可选实施例中,所述信息提取器,进一步适于标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0099] In an alternative embodiment of the present invention, the information extractor is further adapted to mark the location on the page N addresses corresponding to each POI data pattern, thereby extracting POI data out of each page .

[0100] 对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。 [0100] For the apparatus of the embodiment, since the method of the embodiment which is substantially similar, the description of a relatively simple, some embodiments of the methods see relevant point can be described.

[0101] 应当注意的是,在本发明的装置的各个部件中,根据其要实现的功能而对其中的部件进行了逻辑划分,但是,本发明不受限于此,可以根据需要对各个部件进行重新划分或者组合,例如,可以将一些部件组合为单个部件,或者可以将一些部件进一步分解为更多的子部件。 [0101] It should be noted that the respective components of the device according to the present invention, according to the function thereof to be implemented while the components which were logically divided, however, the present invention is not limited thereto, as needed for each member or a combination of re-division, for example, may be some of the components may be further broken down into more sub-components some components combined into a single component, or.

[0102] 本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。 Example [0102] The various components of the present invention may be implemented in hardware, or as software modules running on one or more processors, or in a combination thereof. 本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的装置中的一些或者全部部件的一些或者全部功能。 Those skilled in the art will appreciate that a microprocessor may be used or a digital signal processor (DSP) in practice to implement some or all functions of the apparatus in accordance with some or all members of the embodiment of the present invention. 本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。 The present invention may also be implemented as a part or all of the device or apparatus programs for performing the methods described herein (e.g., computer programs and computer program products). 这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。 Such a program implementing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals. 这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。 Such signals can be downloaded from the Internet website, or provided on a carrier signal, or in any other form.

[0103] 应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。 [0103] It should be noted that the embodiments of the present invention, the above-described embodiments illustrate rather than limit the invention, and those skilled in the art without departing from the scope of the appended claims may be devised alternative embodiments. 在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。 In the claims, should not be limited by any reference signs located claimed configured to claims between parentheses. 单词“包含”不排除存在未列在权利要求中的元件或步骤。 The word "comprising" does not exclude the presence of elements or steps not listed in the appended claims. 位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。 Preceding an element of the word "a" or "an" does not exclude the presence of a plurality of such elements. 本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。 The present invention by means of hardware comprising several distinct elements, and by means of a suitably programmed computer implemented. 在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。 Unit claims enumerating several means, several of these means may be embodied by the same item of hardware. 单词第一、第二、以及第三等的使用不表示任何顺序。 Word of the first, second, and third, etc. does not denote any order. 可将这些单词解释为名称。 These words can be interpreted as names.

[0104] 以上实施方式仅用于说明本发明,而并非对本发明的限制,有关技术领域的普通技术人员,在不脱离本发明的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本发明的范畴,本发明的专利保护范围应由权利要求限定。 [0104] the above embodiments are merely illustrative of the present invention, and are not restrictive of the invention, relating to ordinary skill in the art, without departing from the spirit and scope of the present invention, can make various changes and modifications , all equivalent technical solutions also within the scope of the present invention, the scope of the present invention patent is defined by the appended claims.

[0105] 本发明还公开了A1、一种网页中的POI数据提取方法,包括: [0105] The present invention also discloses A1, POI data one kind of page extraction method, comprising:

[0106] 获取包含POI数据的多个网页; [0106] POI data acquisition comprising a plurality of pages;

[0107] 根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0107] The URL address of each page, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0108] 基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0108] Based on the number of modes correspond to the respective addresses of the pages, the plurality of address modes are sorted to rank the respective address mode;

[0109] 选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0109] select the N maximum number of page address mode, the N extraction POI data corresponding to each page address mode included, the N is an integer not less than 1.

[0110] A2、如A1所述的方法,所述根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式,进一步包括: [0110] A2, A method according to A1, the address of each page according to the URL, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page, further comprising:

[0111] 对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 [0111] The URL address of each page, the prefix extraction, the address of the page based on pattern clustering the extracted prefix, to obtain a corresponding plurality of address mode of each page.

[0112] A3、如A1-A2中任一项所述的方法,所述获取包含POI数据的多个网页,进一步包括: [0112] A3, A1-A2 method according to any one of said plurality of acquired pages containing POI data, further comprising:

[0113] 根据预设目标网站的网址获取包含POI数据的多个网页。 [0113] to obtain multiple pages POI data according to a preset target URL of the site.

[0114] A4、如A1-A3中任一项所述的方法,所述提取所述N个地址模式分别对应的网页中包含的POI数据之前,所述方法还包括: Before [0114] A4, A1-A3 method according to any one of the extraction POI data of the N page addresses corresponding to each mode included, the method further comprising:

[0115] 获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0115] to obtain the corresponding sites of the N page address corresponding to each pattern, the web site directed capture and a web address corresponding to the pattern to crawl web pages.

[0116] A5、如A1-A4中任一项所述的方法,所述提取所述N个地址模式分别对应网页中包含的POI数据,进一步包括: [0116] A5, A1-A4 method according to any one of the N addresses extracting the POI data corresponding pattern contained in the page, further comprising:

[0117] 建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 [0117] create web page template pattern of the N addresses each address corresponding to each mode, the page extraction POI data corresponding to each of the N patterns is included based on the web page template.

[0118] A6、如A1-A5任一项所述的方法,所述根据预设目标网站的网址获取包含POI数据的多个网页,进一步包括: [0118] A6, A1-A5 method according to any one of the plurality of pages containing POI data acquired according to the preset target URL of the website, further comprising:

[0119] 根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 [0119] screened according to the preset target site URL crawl the preset target web site, and crawled pages for the web page.

[0120] A7、如A1-A6任一项所述的方法,所述对抓取到的网页进行筛选,以获取所述网页,进一步包括: [0120] A7, A1-A6 method according to any one of the screening of crawled web pages, to obtain the web page, further comprising:

[0121] 从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0121] acquires the Web page containing the first keyword pages and eliminate page containing the first keyword from the keyword having a second crawled pages, and excluding web page as.

[0122] A8、如A1-A7任一项所述的方法,所述第二关键词为预设垃圾地址字符串。 [0122] A8, A1-A7 method according to any one of the second keyword is junk predetermined address string.

[0123] A9、如A1-A8任一项所述的方法,所述提取所述N个地址模式分别对应的网页中包含的POI数据,进一步包括:标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0123] A9, as a method according to any of A1-A8, the web page to extract the POI data corresponding to N addresses modes are included in, further comprising: a label web N addresses respectively corresponding to the pattern the position of the POI data, the POI data thus extracted from the respective web pages.

[0124] 本发明还公开了B10、一种网页中的POI数据提取装置,所述装置包括: [0124] The present invention also discloses B10, POI data one kind of page extraction apparatus, said apparatus comprising:

[0125] 网页获取器,适于获取包含POI数据的多个网页; [0125] Web page acquiring unit, adapted to acquire POI data of multiple pages;

[0126] 模式聚类器,适于根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; [0126] pattern clustering is adapted to address each page according to the URL, the web page address pattern clustering, to obtain a corresponding plurality of address mode of each page;

[0127] 模式排序器,适于基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; [0127] sequencer mode, each adapted to the number of pages corresponding to each address mode based on the plurality of address modes are sorted to rank the respective address mode;

[0128] 信息提取器,适于选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于1的整数。 [0128] an information extractor is adapted to the largest number of addresses N mode selection page, the page data of a POI extracting the N addresses corresponding to each pattern contained in said N is an integer not less than 1.

[0129] B11、如B10所述的装置,所述模式聚类器,还适于对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 [0129] B11, The apparatus of claim BlO, the pattern clustering is further adapted to URL address of each page, the prefix extraction, the extracted webpage address prefix pattern clustering, according to obtain a corresponding respectively multiple address mode each page.

[0130] B12、如B10-B11中任一项所述的装置,所述网页获取器,还适于根据预设目标网站的网址获取包含POI数据的多个网页。 [0130] B12, device according to any of B10-B11, the Web page acquiring, further adapted to obtain the URL of the preset target site comprises a plurality of POI data pages.

[0131] B13、如B10〜B12中任一项所述的装置,所述信息提取器,还适于获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 [0131] B13, as B10~B12 device according to any of the information extractor is further adapted to obtain the corresponding sites of the N page address corresponding to each mode, the page orientation site crawling and crawled web pages corresponding to a page address mode.

[0132] B14、如B10〜B13中任一项所述的装置,所述信息提取器,还适于建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 [0132] B14, B10~B13 device according to any of the information extractor is further adapted to establish a pattern of the N addresses each address corresponding to each pattern page template page, according to the webpage extracting the N template pattern corresponding to each POI data included in the web page.

[0133] B15、如B10-B14中任一项所述的装置,所述网页获取器,还适于根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 [0133] B15, device according to any of B10-B14, the Web page acquiring unit, further adapted to the preset target website page based on the URL to fetch the preset target site, and to crawl pages screened to obtain the webpage.

[0134] B16、如B10-B15中任一项所述的装置,所述网页获取器,还适于从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 [0134] B16, device according to any of B10-B15, the Web page acquiring, further adapted to acquire the page containing the first keyword from crawled web pages, and a first reject containing keyword web pages having a second keyword, and excluding web page as.

[0135] B17、如B10-B16中任一项所述的装置,所述第二关键词为预设垃圾地址字符串。 [0135] B17, device according to any of B10-B16, the second keyword is junk predetermined address string.

[0136] B18、如B10-B17中任一项所述的装置,所述信息提取器,进一步适于标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 [0136] B18, device according to any of B10-B17, the information extractor is further adapted to mark the position of the POI data on the page N addresses corresponding to each mode, so that each page of the POI data extracted.

Claims (10)

  1. 1.一种网页中的POI数据提取方法,其特征在于,包括: 获取包含POI数据的多个网页; 根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; 基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; 选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于I的整数。 1. A method of extracting POI data page, characterized by comprising: obtaining a plurality of pages of data comprising a POI; each page according to the URL address, the address of the web pattern clustering, each page corresponding to obtain a plurality of address mode; based on the number of modes correspond to the respective addresses of pages, the plurality of address modes are sorted to rank the respective address mode; N maximum number of pages selected address mode, extracting the N POI data corresponding to the page address modes are included, the N is an integer not less than I,.
  2. 2.如权利要求1所述的方法,其特征在于,所述根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式,进一步包括: 对各个网页的URL地址进行前缀提取,根据提取的前缀对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式。 2. The method according to claim 1, wherein said each page according to the URL address, the address of the web pattern clustering, to obtain a corresponding plurality of address mode of each page, further comprising: URL address of each page, the prefix extraction, the address of the page based on pattern clustering the extracted prefix, to obtain a corresponding plurality of address mode of each page.
  3. 3.如权利要求1-2中任一项所述的方法,其特征在于,所述获取包含POI数据的多个网页,进一步包括: 根据预设目标网站的网址获取包含POI数据的多个网页。 3. The method according to any one of claims 1-2 claims, characterized in that, said webpage comprising obtaining the plurality of POI data, further comprising: obtaining a plurality of POI data according to the page that contains the URL preset target site .
  4. 4.如权利要求1-3中任一项所述的方法,其特征在于,所述提取所述N个地址模式分别对应的网页中包含的POI数据之前,所述方法还包括: 获取所述N个地址模式分别对应的网页相应的网站,对该网站进行网页定向抓取,并将抓取到的网页作为对应地址模式的网页。 4. The method according to any one of claims 1-3, characterized in that, prior to the extraction POI data of the N page addresses corresponding to each mode included, the method further comprising: obtaining the N addresses respectively corresponding to the page mode corresponding website, the web site directed capture and a web address corresponding to the pattern to crawl web pages.
  5. 5.如权利要求1-4中任一项所述的方法,其特征在于,所述提取所述N个地址模式分别对应网页中包含的POI数据,进一步包括: 建立所述N个地址模式中各个地址模式分别对应的网页的网页模板,根据所述网页模板提取所述N个模式分别对应的网页中包含的POI数据。 5. A method according to any one of claims 1-4, wherein the extraction mode of the N addresses corresponding POI data included in the page, further comprising: establishing the N address mode each address corresponding to each pattern page web page template, POI data of said page corresponding to the N modes are included in the web page template based on the extracted.
  6. 6.如权利要求1-5任一项所述的方法,其特征在于,所述根据预设目标网站的网址获取包含POI数据的多个网页,进一步包括: 根据预设目标网站的网址抓取所述预设目标网站的网页,并对抓取到的网页进行筛选,以获取所述网页。 6. The method of any one of claims 1-5, wherein said preset target site according to the URL acquired POI data of multiple pages, further comprising: fetch URLs in accordance with the preset target site the preset target web site, and crawled pages were screened to obtain the web page.
  7. 7.如权利要求1-6任一项所述的方法,其特征在于,所述对抓取到的网页进行筛选,以获取所述网页,进一步包括: 从抓取到的网页中获取含有第一关键词的网页,并剔除所述含有第一关键词的网页中具有第二关键词的网页,并将剔除后的网页作为网页。 7. The method of any one of claims 1-6, wherein said screening of crawled web pages, to obtain the web page, further comprising: obtaining a first containing from crawled pages a page keyword, and keywords excluding the first page containing the second keyword has a web, and the web as the web removed.
  8. 8.如权利要求1-7任一项所述的方法,其特征在于,所述第二关键词为预设垃圾地址字符串。 8. The method according to any of claims 1-7, wherein said second predetermined keyword is junk address string.
  9. 9.如权利要求1-8任一项所述的方法,其特征在于,所述提取所述N个地址模式分别对应的网页中包含的POI数据,进一步包括:标注出N个地址模式分别对应的网页上POI数据的位置,从而将各个网页中的POI数据提取出来。 9. The method of any one of claims 1-8, wherein the extraction POI data of the N page addresses corresponding to each pattern contains, further comprising: an annotation mode addresses corresponding N the position of the POI data on the page, so that the POI data extracted from the respective web pages.
  10. 10.一种网页中的POI数据提取装置,其特征在于,所述装置包括: 网页获取器,适于获取包含POI数据的多个网页; 模式聚类器,适于根据各个网页的URL地址,对所述网页进行地址模式聚类,以获得分别对应各个网页的多个地址模式; 模式排序器,适于基于各个地址模式分别对应的网页的数量,对所述多个地址模式进行排序,得到各个地址模式的排序结果; 信息提取器,适于选择网页数量最多的N个地址模式,提取所述N个地址模式分别对应的网页中包含的POI数据,所述N为不小于I的整数。 10. A web POI data extraction means, wherein, said means comprising: a Web page acquiring unit, adapted to acquire POI data of multiple pages; pattern clustering is adapted to address each page according to the URL, the address of the web pattern clustering, to obtain a corresponding plurality of address mode of each page; sequencer mode, based on a number adapted to the respective addresses corresponding to each pattern page, the plurality of sorting address mode, to obtain the results of each sorting address mode; an information extractor is adapted to the largest number of addresses N mode selection page, the page data of a POI extracting the N addresses corresponding to each pattern contained in the N-I is an integer of not less than.
CN 201410844236 2014-12-30 2014-12-30 Method and device for extracting POI (Point of Interest) data from webpages CN104504115A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201410844236 CN104504115A (en) 2014-12-30 2014-12-30 Method and device for extracting POI (Point of Interest) data from webpages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201410844236 CN104504115A (en) 2014-12-30 2014-12-30 Method and device for extracting POI (Point of Interest) data from webpages

Publications (1)

Publication Number Publication Date
CN104504115A true true CN104504115A (en) 2015-04-08

Family

ID=52945512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201410844236 CN104504115A (en) 2014-12-30 2014-12-30 Method and device for extracting POI (Point of Interest) data from webpages

Country Status (1)

Country Link
CN (1) CN104504115A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1783633A1 (en) * 2005-10-10 2007-05-09 Deutsche Telekom Medien GmbH Search engine for a location related search
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1783633A1 (en) * 2005-10-10 2007-05-09 Deutsche Telekom Medien GmbH Search engine for a location related search
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information

Similar Documents

Publication Publication Date Title
Chakrabarti et al. Focused crawling: a new approach to topic-specific Web resource discovery
US20090089278A1 (en) Techniques for keyword extraction from urls using statistical analysis
US20080153467A1 (en) Methods and apparatus for enabling use of web content on various types of devices
US7890499B1 (en) Presentation of search results with common subject matters
US7565630B1 (en) Customization of search results for search queries received from third party sites
US20110225115A1 (en) Systems and methods for facilitating open source intelligence gathering
US20070260589A1 (en) Computer product, request grouping apparatus, and request grouping method
JP2002175301A (en) Map information retrieving device and its method
US20110074789A1 (en) Interactive dendrogram controls
US20120173565A1 (en) Systems and Methods for Creating and Using Keyword Navigation on the Internet
US7668812B1 (en) Filtering search results using annotations
CN101097578A (en) Network resource searching method and system
US20120323905A1 (en) Ranking data utilizing attributes associated with semantic sub-keys
CN102456018A (en) Interactive search method and device
US8555157B1 (en) Document update generation
US20070198727A1 (en) Method, apparatus and system for extracting field-specific structured data from the web using sample
CN101231661A (en) Method and system for digging object grade knowledge
US20110113064A1 (en) Custom local search
CN103136228A (en) Image search method and image search device
CN101231641A (en) Method and system for automatic analysis of hotspot subject propagation process in the internet
US20090222434A1 (en) Inclusion of metadata in indexed composite document
CN101404666A (en) Infinite layer collection method based on Web page
CN103678637A (en) Method and device for acquiring test question information
CN102541874A (en) Webpage text content extracting method and device
WO2001075640A2 (en) Method and system for gathering, organizing, and displaying information from data searches

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination