CN104699835A - Method and device used for determining webpages including POI (point of interest) data - Google Patents

Method and device used for determining webpages including POI (point of interest) data Download PDF

Info

Publication number
CN104699835A
CN104699835A CN201510148638.4A CN201510148638A CN104699835A CN 104699835 A CN104699835 A CN 104699835A CN 201510148638 A CN201510148638 A CN 201510148638A CN 104699835 A CN104699835 A CN 104699835A
Authority
CN
China
Prior art keywords
plurality
poi
poi data
information
web page
Prior art date
Application number
CN201510148638.4A
Other languages
Chinese (zh)
Other versions
CN104699835B (en
Inventor
王智广
魏少俊
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Priority to CN201510148638.4A priority Critical patent/CN104699835B/en
Publication of CN104699835A publication Critical patent/CN104699835A/en
Application granted granted Critical
Publication of CN104699835B publication Critical patent/CN104699835B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Abstract

The invention provides a method and a device used for determining webpages including POI (point of interest) data. The method includes: acquiring multiple POI data from the Internet; crawling multiple webpages including address information; normalizing address information in the POI data and the address information included in the webpages to be longitude and latitude information respectively; matching the longitude and latitude information of the POI data with that of the webpages; for the POI data and the webpages identical in longitude and latitude information, seeking in the webpages according to POI names corresponding to the POI data to determine whether the POI names of the POI data are included in the webpages or not; if yes, determining that the webpages include the POI data. The method and the device are conducive to subsequently determining accuracy of collected POI data according to accuracy of content recorder by the webpages, and subsequently providing a reliable guarantee for collecting accurate POI data in the Internet in a large scale.

Description

用于确定网页页面中包括兴趣点POI数据的方法及装置 Web pages comprising data and a point of interest POI means for determining

技术领域 FIELD

[0001] 本发明涉及计算机技术领域,具体而言,本发明涉及一种用于确定网页页面中包括兴趣点POI数据的方法及装置。 [0001] The present invention relates to computer technologies, and particularly, the present invention relates to a method and apparatus comprising a point of interest POI data for determining the web page.

背景技术 Background technique

[0002] 在地理信息系统中,一个POI (Point Of Interest,兴趣点)可以是一栋房子、一个商铺、一个邮筒、一个公交站等。 [0002] In GIS, a POI (Point Of Interest, POI) can be a house, a shop, a mailbox, a bus station. POI数据包括地址信息和POI名称。 POI data including address information and POI name.

[0003] 传统的POI数据采集方法,需要技术人员采用精密的测绘仪器去获取每个POI的经玮度信息,然后再标记下来,这种方法比较费时费力,导致通过采集得到的POI数据的数量很少,地理信息系统很难根据数量很少的POI数据来提供高水平的服务。 [0003] POI conventional data collection method, requires the use of sophisticated skill to acquire surveying instrument by Wei level information of each POI, and then marking down, this method is more time consuming, resulting in the number of POI data obtained by collecting the few, GIS is difficult to provide a high level of service based on a small number of POI data.

[0004] 互联网上存在着大量的POI数据,如果能从互联网上收集包含POI数据的网页,从收集的网页中提取出这些POI数据供地理信息系统使用,则会大大节省人力和时间。 [0004] there are a lot of POI data on the Internet, if the page containing POI data from the collection on the Internet, which is extracted POI data from the use of geographic information systems for the collection of web pages, which will greatly save manpower and time. 但是互联网上充斥着大量虚假的POI数据,比如博客网页内容中包含“原文地址:http://xxx.XXX.xxx/xxx”,虽然包含“地址”字样,但该地址是网络地址或者说是URL (Uniform ResoureLocator,统一资源定位器),并不是POI数据中的地理地址信息;从而导致收集到的POI数据中虚假的POI数据的比例较高。 But the Internet is full of a large number of false POI data, such as blog web content with "Original Address: http: //xxx.XXX.xxx/xxx", although containing the word "address", but the address is the network address or is URL (uniform ResoureLocator, uniform resource locator), geographical address information is not POI data; resulting in a higher proportion of POI data collected in false POI data.

发明内容 SUMMARY

[0005] 本发明针对现有技术的缺点,提出一种用于确定网页页面中包括兴趣点POI数据的方法和装置,用以解决现有技术存在的收集较多虚假的POI数据问题。 [0005] The present invention addresses shortcomings of the prior art, provides a method for determining a web page and means comprising a point of interest POI data, in the prior art to solve many false POI data collection problems.

[0006] 本发明根据一个方面,提供了一种用于确定网页页面中包括兴趣点POI数据的方法,包括: [0006] In accordance with one aspect of the present invention, there is provided a method of determining the web page comprising a point of interest POI data, comprising:

[0007] 从互联网中获取多个POI数据; [0007] acquiring a plurality of POI data from the Internet;

[0008] 爬取包括地址信息的多个网页页面; [0008] crawl the web page includes a plurality of address information;

[0009] 将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; [0009] The address information of address information of the plurality of POI data and the plurality of web pages are included by Wei normalized level information;

[0010] 基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; [0010] Based on the same level information by Wei, Wei by matching degree information by Wei degree information in the plurality of POI data and the plurality of web pages;

[0011] 对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; [0011] POI data for a web page and having the same degree by Wei information, to find the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes the name of the POI data of the POI;

[0012] 当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0012] When the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.

[0013] 本发明根据另一个方面,还提供了一种用于确定网页页面中包括兴趣点POI数据的装置,包括: [0013] According to another aspect of the present invention also provides an apparatus for determining a web page includes a point of interest POI data, comprising:

[0014] POI数据获取模块,用于从互联网中获取多个POI数据; [0014] POI data acquisition module, for acquiring a plurality of POI data from the Internet;

[0015] 网页页面爬取模块,用于爬取包括地址信息的多个网页页面; [0015] webpage crawling module for crawling the web page includes a plurality of address information;

[0016] 经玮度信息归一化模块,用于将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; [0016] Wei degree information by the normalization module, the address information for address information of the plurality of POI data and the plurality of web pages are included by Wei normalized level information;

[0017] 经玮度信息匹配模块,用于基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; [0017] Wei degree information by matching module, based on the same information via the degree of Wei, Wei by matching degree information in the information of the plurality of through Wei POI data and the plurality of web pages;

[0018] 网页页面包括POI名称确定模块,用于对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; [0018] The web page comprises determining module POI name, POI data for a web page and having the same degree by Wei information, to find the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes POI name of the POI data;

[0019] 网页页面包括POI数据确定模块,用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0019] POI data includes web page determining module, configured to, when the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.

[0020] 本发明的技术方案中,将地址信息归一化为经玮度信息,可以滤除非地理的地址信息,由于经玮度的唯一性,基于经玮度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而有利于后续避免收集到虚假地址信息的POI数据;在? [0020] aspect of the present invention, the address information is normalized by Wei information, the non-geographic address information can be filtered, since the degree of uniqueness by Wei, Wei based on the matching result by the accuracy of the information, much higher than the existing text based on the matching results of the accuracy of the information, which will help to avoid the follow-up to the POI data collected false address information; in? 01数据的经玮度信息与网页页面中的经玮度信息相匹配的基础上,进一步确定网页页面中是否包括POI数据的POI名称,来准确判断POI数据是否被包含在同一网页页面中,有利于后续根据网页页面所记载内容的权威性和准确性,来确定收集到的POI数据的准确性,进而为大批量地收集互联网中的准确度较高的POI数据提供可靠保证。 Basic information by Wei degree information web page matches the 01 data via Wei degrees, to further determine the web page is included POI name of the POI data, to accurately determine whether a POI data is included in the same web page, there are according to authoritative and accurate follow-up is conducive to web pages that mention contents to determine the accuracy of the collected data POI, and thus higher accuracy in large quantities in the POI collection of Internet data to provide a reliable guarantee.

[0021] 本发明附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本发明的实践了解到。 [0021] This additional aspects and advantages of the invention will be set forth in part in the description which follows, the following description will become apparent from, or learned by practice of the present invention.

附图说明 BRIEF DESCRIPTION

[0022] 本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中: [0022] The present invention described above and / or additional aspects and advantages from the following description of embodiments in conjunction with the accompanying drawings of the embodiments will become apparent and more readily appreciated, wherein:

[0023] 图1a为本发明实施例的用于确定网页页面中包括兴趣点POI数据的方法的流程示意图; [0023] Figure 1a schematic embodiment for determining a flow of the process of web page includes a point of interest POI data of the present invention;

[0024] 图1b为本发明实施例的包括多个POI数据的网页的示意图; [0024] POI data including representations of the plurality of pages according to an embodiment of the present invention, FIG. 1b;

[0025] 图2为本发明实施例的用于确定网页页面中包括兴趣点POI数据的装置的内部结构的框架示意图; [0025] FIG. 2 embodiment for determining a frame showing the internal configuration of a point of interest POI data included in the web page of the present invention;

[0026] 图3为本发明实施例的POI数据获取模块的内部结构的框架示意图。 [0026] Fig 3 a schematic view of the frame structure of the internal module acquires POI data according to an embodiment of the present invention.

具体实施方式 Detailed ways

[0027] 下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。 [0027] Example embodiments of the present invention is described in detail below, exemplary embodiments of the embodiment shown in the accompanying drawings, wherein same or similar reference numerals designate the same or similar elements or elements having the same or similar functions. 下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。 By following with reference to the embodiments described are exemplary only for explaining the present invention and should not be construed as limiting the present invention.

[0028] 本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。 [0028] skilled in the art will be understood that, unless expressly stated, as used herein, the singular forms "a", "an", "the" and "the" include the plural form. 应该进一步理解的是,本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。 It should be further understood that the phraseology used in the present specification "comprises" means the presence of stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. 应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。 It should be understood that when an element is referred to us "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. 此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。 Further, as used herein, "connected" or "coupled" may include wirelessly connected or wirelessly coupled. 这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。 As used herein, the phrase "and / or" includes any or all of a unit of one or more of the associated listed items associated and all combinations.

[0029] 本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本发明所属领域中的普通技术人员的一般理解相同的意义。 [0029] skilled in the art will be understood that, unless otherwise defined, all terms (including technical and scientific terms), and having a general understanding of the art the present invention belongs to one of ordinary skill in the same meaning. 还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。 It should also be understood that generic terms, such as those defined in the dictionary, it should be understood as meaning that is consistent with significance in the context of the prior art, and unless, as here defined as being specific, it would not idealized or overly formal sense explanation.

[0030] 图1a为本发明中用于确定网页页面中包括兴趣点POI数据的方法的流程示意图。 [0030] FIG. 1a of the present invention, for showing the method comprising the point of interest POI data identified web page.

[0031] SlOl:从互联网中获取多个POI数据;S102:爬取包括地址信息的多个网页页面;S103:将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经玮度信息;S104:基于同一经玮度信息,在多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配;S105:对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;S106:当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0031] SlOl: acquiring a plurality of POI data from the Internet; S102: crawling the web page includes a plurality of address information; S103: the address information as a plurality of POI data and a plurality of web pages are included in a normalized Wei into via level information; S104: based on the same level information by Wei, Wei by matching degree information by Wei degree information in a plurality of POI data and the plurality of web pages; S105: having the same degree of information by Wei POI data and web pages, to find the web page based on the POI data corresponding to the POI name, determining that the web page is included POI name of the POI data; S106: when the web page includes a POI name of the POI data , it is determined that the web page includes the point of interest POI data.

[0032] 本发明的上述用于确定网页页面中包括兴趣点POI数据的方法,将地址信息归一化为经玮度信息,可以滤除非地理位置的地址信息,由于经玮度的唯一性,基于经玮度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而有利于后续避免收集虚假地址信息的数据;在POI数据的经玮度信息与网页页面中的经玮度信息相匹配的基础上,进一步确定网页页面中是否包括POI数据的POI名称,来准确判断POI数据是否被包含在同一网页页面中,有利于后续根据网页页面所记载内容的权威性和准确性,来确定收集到的POI数据的准确性,进而为大批量地收集互联网中的准确度较高的POI数据提供可靠保证。 [0032] The web page includes a point of interest POI data method described above for determining the present invention, the address information is normalized by Wei information, the geographic location can filter out non-address information, due to the uniqueness of degrees by Wei, by Wei accuracy matching result information based on the degree, much higher than the conventional matching result based on the accuracy of the text information, so as to facilitate the collection of data subsequent to avoid false address information; Wei by the POI data and information of the page basis by Wei degree information page matches the further determined web page is included POI name of the POI data, to accurately determine whether a POI data is included in the same web page, facilitates later described SUMMARY the web page authority and accuracy, to determine the accuracy of the collected data POI, and then for the accuracy of the Internet to collect large quantities of high POI data to provide a reliable guarantee.

[0033] 下面具体介绍流程示意图如图1a所示的用于确定网页页面中包括兴趣点POI数据的方法,包括如下步骤: [0033] The following detailed description of the schematic flowchart of Figure 1a in a method for determining the web page includes a point of interest POI data, comprising the steps of:

[0034] SlOl:从互联网中获取多个POI数据。 [0034] SlOl: acquiring a plurality of POI data from the Internet.

[0035] 具体地,利用网络爬虫类的程序,从互联网中爬取多个包括POI数据的网页;随后从多个包括POI数据的网页中提取多个POI数据。 [0035] Specifically, using the network program reptiles, crawling pages comprising a plurality of POI data from the Internet; subsequent page comprises a plurality of POI data extracted from a plurality of POI data. POI数据包括地址信息和POI名称;优选地,POI数据还可以包括联系方式、邮编和网络标签等等。 POI data including address information, and POI name; preferably, may further include POI data Information, zip code, etc. and network label.

[0036] 本发明的发明人发现,在互联网中存在这样一些网页,它们中每个网页的内容包含有一个或者多个POI数据,POI数据中的地址信息包括“地址”等字样的地址关键词;并且这些网页的页面结构特征URL格式,以及POI数据在网页中的位置和格式是有规律性的。 [0036] The present inventors have found that there is some pages on the Internet, the content of each page which contains one or more POI data, the address information of POI data includes the word "address" address, etc. Image ; URL format and page structure wherein these pages, and the POI data in the location and format of the page is regular. 也就是说可以通过一种统一的方法快捷地从这些网页上提取出POI数据。 That can quickly extract through a unified approach from these pages the POI data.

[0037] 较佳地,可以从互联网中,爬取包括“地址”等地址关键词的多个网页对应的多个URL (Uniform Resoure Locator,统一资源定位器);对爬取得到的多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。 [0037] Preferably, from the Internet, crawling plurality URL (Uniform Resoure Locator, uniform resource locator) including the "address" and the address keyword corresponding to the plurality of pages; climb on to obtain a plurality of URL URL for the cluster pattern cluster, having the same structural characteristics for the same set of pattern.

[0038] 更优地,对于众多包括地址关键词的网页中,只包括一个POI数据的网页,获取所有只包括一个POI数据的网页的URL ;对获取的所有URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。 [0038] More preferably, the web page includes an address for a number of keywords, the web page includes only a POI data, acquire all pages include only the URL of a POI data; all were acquired URL pattern cluster, with the same URL structure clustering feature set is the same pattern.

[0039] 例如,众多包括地址关键词的网页中,URL为http://www.aibang.com/detail/1537772035-1606201508的网页中只包括“爱普生(中国)有限公司”这一POI数据、URL 为http://www.aibang.com/detail/152928073-419169481 的网页中只包括“北京王府中西医结合医院”这一POI数据,这两个URL具有相同的结构特征www.aibang.com/detail/*,其中*为通配符表示任意字符;因此,可以将这两个URL聚类到同一pattern集合中;也就是说,该pattern集合中所有的URL都具有同一结构特征www.aibang.com/detail/氺。 [0039] For example, many web pages including keywords in the address, URL for the Web page http://www.aibang.com/detail/1537772035-1606201508 included only "Epson (China) Co., Ltd.," the POI data, URL the web page includes only http://www.aibang.com/detail/152928073-419169481 "Beijing royal Integrative Medicine hospital," the POI data, both have the same URL structure features www.aibang.com/detail / *, where * is a wildcard character indicates any character; therefore, the two URL into the same cluster pattern set; that is, the URL pattern set all have the same structural features www.aibang.com/detail / Shui.

[0040] 更优的,对于众多的包括地址关键词的网页中,包括多个POI数据的网页,获取所有包括多个POI数据的网页的URL ;对获取的所有URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。 [0040] More preferably, the web page includes an address for many keywords, the web page comprising a plurality of POI data obtaining URL includes all pages of a plurality of POI data; all were acquired URL pattern cluster, having URL clustering the same structural characteristics for the same set of pattern.

[0041]例如,URL 为www.dianping.com/topic/s_c_2_120_rl4_x540/p7 的网页,如图1b所示,该网页中包括POI标题为“boy london”、“C0ACH(悠唐折扣店)”和“米兰店(三里屯店)”等的多个POI 数据;URL 为www.dianping.com/topic/s_c_2_l20_r14_x540/p6 的网页中也包括多个POI数据;获取所有结构特征符合www.dianping.com/topic/*的URL,其中*为通配符表示任意字符;对获取的所有URL进行pattern聚类,聚类得到的同一pattern集合中的URL都具有结构特征www.dianping.com/topic/*。 [0041] For example, URL of the web page www.dianping.com/topic/s_c_2_120_rl4_x540/p7, shown in Figure 1b, the web page includes POI entitled "boy london", "C0ACH (Yau Tang outlet)" and " Milan shop (Sanlitun store) "and other multiple POI data; URL for the web page also includes a plurality of www.dianping.com/topic/s_c_2_l20_r14_x540/p6 POI data; get all the structural features consistent with www.dianping.com/topic/ * the URL, where * is a wildcard character indicates any character; URL acquired all be clustered pattern, pattern set obtained in the same cluster URL has structural features www.dianping.com/topic/*.

[0042] 从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 [0042] selected from the plurality of pattern set comprising a plurality of pattern pages include a set of POI data, and extracts POI data including a plurality of pages from the pattern set.

[0043] 较佳地,从多个包括POI数据的网页中提取多个POI数据具体可以包括: [0043] Preferably, the POI extracted from a plurality of POI data includes a plurality of pages of data may specifically comprise:

[0044] 基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板。 [0044] Based on the set belonging to the same pattern corresponding to the plurality of URL features comprise a plurality of web page structure POI data, generating the set pattern data extraction template corresponding POI. 具体地,对于属于同一pattern集合中每个URL,根据该URL对应的各网页中POI数据的格式和位置,生成与该pattern集合相应的POI数据提取模板。 In particular, for the same set of pattern belonging to each URL, and the position of each page according to the format corresponding to the URL of the POI data, the POI data to generate corresponding sets the extracted pattern with templates.

[0045] 基于POI数据提取模板,从多个包括POI数据的网页中提取多个POI数据。 [0045] Based on the extracted POI data templates, including web pages to extract the plurality of POI data from a plurality of POI data. 具体地,对于同一pattern集合中每个URL,针对该URL对应的网页,根据生成的POI数据提取模板中的POI数据的格式、以及多个POI数据在网页中的位置,从该网页中提取多个POI数据。 In particular, for the same pattern in each set of URL, the URL for the Web page corresponding to, extraction POI data according to the format template data generated in the POI and the position of the plurality of POI data in the web page, to extract from the multiple page a POI data.

[0046] S102:爬取包括地址信息的多个网页页面。 [0046] S102: crawling the web page includes a plurality of address information.

[0047] 具体地,利用网络爬虫类的程序,从互联网中爬取包括地址关键词的多个网页页面。 [0047] Specifically, using the network program reptiles crawling the web page includes a plurality of keywords from the Internet address.

[0048] 提取多个网页页面中与地址关键词相关联的多个文本信息。 [0048] more text extracts information from multiple web pages with keywords associated with the address.

[0049] 具体地,对于一个网页页面,提取该网页页面的文本内容,在文本内容中查找“地址”、“位于”或“坐落于”等可能包括地址信息的地址关键词;提取地址关键词附近的文本片段;根据设定的分隔符以及片段长度对文本片段进行分割,比如文本片段距离地址关键词的文本长度大于设定的阈值、和/或文本片段出现设定的分隔符(比如空格、逗号、句号等),则对文本片段进行分割;将分割结果中,分割处(例如分隔符处)与地址关键词之间的文本片段,作为该网页页面中与地址关键词相关联的文本信息。 [0049] In particular, for a web page, extract the text content of the web site for the "address" in the text, the "on" or "located" and other keywords might address include address information; extracting address Keywords near the text segment; delimiter is set according to the length of the text segments and segment division, such as a text segment from the address keyword text length is greater than the set threshold value, and / or text segment set separator appears (such as spaces , commas, periods, etc.), dividing the text fragments; the segmentation result, the segmentation (e.g. text segment between the delimiters) and the address keyword as the text associated with the web page address keywords information.

[0050] 从多个文本信息中提取相应网页页面的地址信息。 [0050] The address information is extracted from the plurality of corresponding web page text information.

[0051] 具体地,对于提取自网页页面中的每个文本信息,从该文本信息中提取出地址信息,作为该网页页面的地址信息。 [0051] Specifically, for each text information extracted from the web page, text information extracted from the address information as the address information of the web page.

[0052] S103:将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经玮度信息。 [0052] S103: The address information in the address information of the plurality of POI data and a plurality of web pages are included normalized by Wei degree information.

[0053] 预先获取包括全国的省、市、县(区)、乡镇、道路等的地址信息、经玮度信息,以及地址信息与经玮度信息之间的对应关系的地理信息库。 [0053] in advance to obtain address information includes the provinces, cities and counties (districts), towns, roads, etc., by Wei of information, as well as geographical information database corresponding relationship between the address information by Wei of information. 其中,地理信息库中的地址信息,可以包括表示同一地理地址的多种表达形式的地址信息;例如,“朝阳区酒仙桥路6号”、“北京市朝阳酒仙桥6号”和“朝阳区酒仙桥6号”等多个地址信息,都表示同一地理地址。 Among them, the address information geographical information database may include address information indicating a variety of forms of expression of the same geographical address; for example, "Jiuxianqiao Road, Chaoyang District, No. 6", "Beijing Chaoyang Jiuxianqiao No. 6" and "Chaoyang District Jiuxianqiao No. 6 ", and other address information, have said the same geographic address.

[0054] 具体地,将多个POI数据中的地址信息分别归一化为多个POI数据的经玮度信息。 [0054] Specifically, the address information of the plurality of POI data are normalized by Wei level information of a plurality of POI data. 例如,对于每个POI数据中的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经玮度信息,将查找出的经玮度信息确定为该POI数据的经玮度信息。 For example, address information for each POI data, acquired in advance from look geographical information database of the information by Wei corresponding to the address information, to find out the information determined by the degree by Wei Wei of POI data for information.

[0055] 将多个网页页面包含的地址信息分别归一化为多个网页页面的经玮度信息。 [0055] Address information page that contains multiple pages are normalized by Wei of information from multiple web pages. 较佳地,对于每个网页页面包含的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经玮度信息,将查找出的经玮度信息确定为该网页页面的经玮度信息。 Preferably, for each web page address information contained in the acquired in advance from look geographical information database of the information by Wei corresponding to the address information, to find out the degree information determined by Wei that page by page Wei degree information.

[0056] S104:基于同一经玮度信息,在多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配。 [0056] S104: Based on the same level information by Wei, Wei by matching degree information by Wei degree information in a plurality of POI data and the plurality of web pages.

[0057] 具体地,对于每个POI数据,判断各网页页面中,是否存在经玮度信息与该POI数据的经玮度信息相一致的网页页面,若是,则确定出该POI数据与该网页页面相匹配,即确定出该POI数据与该网页页面具有相同经玮度信息,否则,忽略该POI数据。 [0057] Specifically, for each POI data, each web page is determined, whether there is Wei Wei degree information by the information of POI data consistent with the web page, if it is determined that the POI data to the page page matches, i.e. determines that the POI data to the web page information having the same degree by Wei, otherwise ignore the POI data.

[0058] 由于经玮度的唯一性,基于经玮度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而后续可以根据更准确的匹配结果收集到更为准确的POI数据。 [0058] Because of the uniqueness by Wei, Wei by the accuracy of the matching result information based much higher than conventional text matching result based on the accuracy of the information, so that subsequent to collect more accurate matching results according to more accurate POI data. 而且,基于经玮度信息进行匹配,相当于基于该经玮度信息所对应的多个地理信息分别进行匹配,扩大了匹配的范围,有利于后续收集到更多的POI数据。 Further, based on the matching degree information Wei, matching respectively corresponding to the plurality of geographic based information by Wei degree information corresponding to expand the scope of matching facilitates the subsequent collect more POI data.

[0059] S105:对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称。 [0059] S105: POI data for a web page and having the same degree by Wei information, to find the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes the name of the POI data of the POI.

[0060] 具体地,对于具有相同经玮度信息的POI数据及网页页面,从该网页页面中查找出所有的名称信息;对于查找出的每个名称信息,判断该名称信息是否与该POI数据中的POI名称相匹配:若是,确定出该网页页面中包括该POI数据的POI名称;否则,忽略该POI数据。 [0060] In particular, for the POI and a web page with the same data by Wei level information, to find out the names of all of the information from the webpage; for each name to find out the information, determines whether the name information and the POI data the POI name to match: if so, to determine the name of the POI web page includes the POI data; otherwise, ignore the POI data.

[0061] 较佳地,对于具有相同经玮度信息的POI数据及网页页面,若该网页页面中的名称信息与该POI数据中的POI名称,虽然在文字表达上不完全一致,但是实质上表示同一Ρ0Ι,可以确认为该POI数据中的POI名称与该网页页面中的名称信息相匹配,从而确定出该网页页面中包括该POI数据的POI名称。 [0061] Preferably, the data for the POI and a web page with the same degree by Wei information, name information of the page if the page, although not completely consistent with the POI data on the POI name written expression, but substantially It represents the same Ρ0Ι, POI data that confirmed the POI name matches the name of the web page information, thereby determining the POI name of the web page data included in the POI.

[0062] 例如,对于具有相同经玮度信息的POI数据及网页页面,该POI数据中的POI名称为“奇虎360”,而该网页页面中的名称信息为“北京奇虎科技有限公司”,可以确认为该网页页面中包括该POI数据的POI名称。 [0062] For example, for POI data and web pages with the same degree of information by Wei, the name of the POI POI data is "Qihoo 360", and the name of the web page information is "Beijing Qihoo Technology Co., Ltd." you can confirm that the web page is included in the POI data POI name.

[0063] 优选地,对于具有相同经玮度信息的POI数据及网页页面,当判断出该网页页面中包括多个名称信息时,分别计算多个名称信息与该网页页面的地址信息之间的文本距离。 [0063] Preferably, the data for the POI and a web page with the same degree by Wei information, when it is determined that the web page includes a plurality of name information, name information are respectively calculated between the plurality of address information of the web page text distance. 将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息。 The name of the text information of the minimum distance corresponding to the determined name information and address information corresponding to the web page. 其中,文本距离可以是名称信息与地址信息之间的字符的数量。 Among them, the text may be from the number of characters of information between the name and address information.

[0064] 对于具有相同经玮度信息的POI数据及网页页面,将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对。 [0064] POI data for a web page and having the same degree by Wei information, the POI data corresponding to the POI name are aligned with the web page address information corresponding to the name information. 比对一致时,确定该网页页面中包括该POI数据的POI名称。 When the same comparison, to determine the web page includes the name of the POI POI data.

[0065] 具体地,对于具有相同经玮度信息的POI数据及网页页面,判断该POI数据对应的POI名称,与该网页页面中地址信息相对应的名称信息是否一致:若是,则确定该网页页面中包括该POI数据的POI名称;否则,确定该网页页面中不包括该POI数据的POI名称。 [0065] In particular, for the POI data and a web page with the same by Wei level information, determines the POI data corresponding to the POI name, whether the address of the web page information corresponding to the name information matches: If it is determined that the page page includes the name of the POI POI data; otherwise, determining that the web page does not include the name of the POI POI data.

[0066] S106:对于具有相同经玮度信息的POI数据及网页页面,当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0066] S106: POI data for a web page and having the same degree by Wei information when the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.

[0067] 具体地,对于具有相同经玮度信息的POI数据及网页页面,当在上述步骤S105中确定出该网页页面中包括该POI数据的POI名称时,在本步骤中确定该网页页面包括该兴趣点POI数据,具体地确定该网页页面包括该兴趣点POI名称和地址信息。 [0067] In particular, for the POI data and a web page with the same by Wei level information, when it is determined in step S105 that the web page includes a POI name of the POI data, is determined in this step, the web page comprising the point of interest POI data, determine the specific web page includes the point of interest POI name and address information.

[0068] 基于上述用于确定网页页面中包括兴趣点POI数据的方法,本发明提供了用于确定网页页面中包括兴趣点POI数据的装置,该装置的内部结构的框架示意图如图2所示,包括:POI数据获取模块201、网页页面爬取模块202、经玮度信息归一化模块203、经玮度信息匹配模块204、网页页面包括POI名称确定模块205和网页页面包括POI数据确定模块206。 [0068] The method based on the web page includes a point of interest POI data described above for determining the present invention provides an apparatus for determining the web page includes a point of interest POI data, the internal configuration of the frame of the apparatus shown in Figure 2 a schematic view of comprising: POI data acquisition module 201, a web page crawler module 202, by Wei degree information normalization module 203, POI name determined by Wei degree information matching module 204, the web page includes a module 205 and a webpage comprising POI data determining module 206.

[0069] 其中,POI数据获取模块201用于从互联网中获取多个POI数据。 [0069] wherein, POI data acquisition module 201 for acquiring a plurality of POI data from the Internet.

[0070] 网页页面爬取模块202用于爬取包括地址信息的多个网页页面。 [0070] module 202 crawls the web page for the web page includes a plurality of crawling address information.

[0071] 具体地,网页页面爬取模块202从互联网中爬取包括地址关键词的多个网页页面;提取多个网页页面中与地址关键词相关联的多个文本信息;从多个文本信息中提取相应网页页面的地址信息。 [0071] Specifically, a web page crawling module 202 crawls the web page includes a plurality of keywords from the Internet address; extracting a plurality of text information in a plurality of web page address associated keywords; text information from a plurality of extracts the address information of the corresponding web page.

[0072] 经玮度信息归一化模块203用于将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经玮度信息。 [0072] Wei degree information by the normalization module 203 for address information in the address information of the plurality of POI data and a plurality of web pages are included normalized by Wei degree information.

[0073] 具体地,经玮度信息归一化模块203将多个POI数据中的地址信息分别归一化为多个POI数据的经玮度信息。 [0073] In particular, the level information Wei normalization module 203 address information of the plurality of POI data are normalized by Wei plurality of POI data information. 较佳地,对于每个POI数据中的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经玮度信息,将查找出的经玮度信息确定为该POI数据的经玮度信息。 Preferably, the address information for each POI data, acquired in advance from look geographical information database of the information by Wei corresponding to the address information, to find out the degree information determined by Wei by the POI data for Wei degree information. 其中,预先获取的地理信息库中包括全国的省、市、县(区)、乡镇、道路等的地址信息、经玮度信息,以及地址信息与经玮度信息之间的对应关系。 Among them, acquired in advance of geographical information database including address information of provinces, cities and counties (districts), towns, roads, etc., by Wei of information, as well as correspondence between the address information by Wei of information.

[0074] 以及,经玮度信息归一化模块203将多个网页页面包含的地址信息分别归一化为多个网页页面的经玮度信息。 [0074] and, by Wei address information of a normalization information module 203 comprises a plurality of web pages are normalized by Wei plurality of web page information. 较佳地,对于每个网页页面包含的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经玮度信息,将查找出的经玮度信息确定为该网页页面的经玮度信息。 Preferably, for each web page address information contained in the acquired in advance from look geographical information database of the information by Wei corresponding to the address information, to find out the degree information determined by Wei that page by page Wei degree information.

[0075] 经玮度信息匹配模块204用于基于同一经玮度信息,在多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配。 [0075], the degree of matching by information contained in the Wei Wei degree information matching module 204 based on information of the same by Wei Wei degree information via a plurality of POI data and the plurality of web pages. 具体地,经玮度信息匹配模块204对于每个POI数据,判断各网页页面中,是否存在经玮度信息与该POI数据的经玮度信息相一致的网页页面,若是,则确定出该POI数据与该网页页面相匹配,即确定出该POI数据与该网页页面具有相同经玮度信息,否则,忽略该POI数据。 Specifically, for each 204 via the POI data matching module Wei level information, each web page is determined, whether there is Wei Wei degree information by the information of POI data consistent with the web page, if it is determined that the POI data matches the web page, i.e., it is determined that the POI data to the web page information having the same degree by Wei, otherwise ignore the POI data.

[0076] 网页页面包括POI名称确定模块205用于对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称。 [0076] POI name comprises a web page determining module 205 is used for the POI and a web page with the same data by Wei level information, to find the web page based on the POI data corresponding to the name of a POI, determining whether the web page includes the name of the POI POI data. 具体地,网页页面包括POI名称确定模块205对于具有相同经玮度信息的POI数据及网页页面,从该网页页面中查找出所有的名称信息;对于查找出的每个名称信息,判断该名称信息是否与该POI数据中的POI名称相匹配:若是,确定出该网页页面中包括该POI数据的POI名称;否则,忽略该POI数据。 In particular, the web page including POI names for POI data determining module 205 and the web page having the same degree by Wei information, to find out the names of all of the information from the webpage; for each to find out the name information, the name information is determined matches the name of the POI POI data: if so, to determine the name of the POI web page includes the POI data; otherwise, ignore the POI data.

[0077] 网页页面包括POI数据确定模块206用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0077] The web page includes a module 206 for determining POI data when the POI name of the web page included in the POI data, determine that the web page includes a point of interest POI data.

[0078] 优选地,POI数据获取模块201的内部结构的框架示意图如图3所示,包括:网页爬取单元301和POI数据提取单元302。 [0078] Preferably, the internal frame structure POI data acquisition module 201 is a schematic diagram shown in Figure 3, comprising: a web crawling unit 301 and the POI data extraction unit 302.

[0079] 其中,网页爬取单元301用于从互联网中爬取多个包括POI数据的网页。 [0079] wherein, for web crawling unit 301 crawled from Internet web page including a plurality of POI data.

[0080] 具体地,网页爬取单元301从互联网中爬取包括地址关键词的多个网页对应的多个URL ;对多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 [0080] In particular, web crawling unit 301 crawling multiple web pages comprise a plurality of URL addresses corresponding to the keyword from the Internet; URL the URL pattern clustering a plurality of clusters, having the same structural characteristics for the same pattern set; selected from the plurality of pattern set comprising a plurality of pattern pages include a set of POI data, and extracts POI data including a plurality of pages from the pattern set.

[0081] POI数据提取单元302用于从多个包括POI数据的网页中提取多个POI数据。 [0081] POI data extraction unit 302 comprises a web page for extracting a plurality of POI data from a plurality of POI data.

[0082] 具体地,POI数据提取单元302具体用于基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;基于POI数据提取模板,从多个包括POI数据的网页中提取多个POI数据。 [0082] In particular, specific POI data extracting unit 302 belongs to based on the same set of pattern corresponds to the URL in a plurality of structural features comprises a plurality of web pages POI data, to generate the corresponding set of pattern extraction POI data template; POI based data extraction templates, including web pages POI data to extract more data from multiple POI.

[0083] 优选地,如图2所示,本发明的用于确定网页页面中包括兴趣点POI数据的装置,还包括:网页页面中名称信息确定模块207。 [0083] Preferably, as shown in FIG. 2, the present invention comprises a device web page, determining a point of interest POI data, further comprising: web page name information determination module 207.

[0084] 网页页面中名称信息确定模块207用于当判断该网页页面中包括多个名称信息时,分别计算多个名称信息与该网页页面的地址信息之间的文本距离;将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息。 [0084] The web page information name determining module 207 is used when it is determined that the web page includes a plurality of name information, calculates the distance between the text address information and name information of the plurality of web page; the minimum distance of the text corresponding name information, determined as the web page address information corresponding to the name information.

[0085] 此时,网页页面包括POI名称确定模块205还用于将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对;比对一致时,确定该网页页面中包括该POI数据的POI名称。 [0085] At this time, the name of the web page comprises determining module 205 further POI name information for the POI data corresponding to the POI name as the web page corresponding to the address information for comparison; ratio of the same, it is determined that the web page POI includes the name of the POI data.

[0086] 本发明实施例的技术方案中,将地址信息归一化为经玮度信息,可以滤除非地理的地址信息,从而有利于后续避免收集地址信息虚假的POI数据;在? [0086] Technical solutions of the embodiments of the present invention, the address information is normalized by Wei information, the address information can be filtered out of the non-geographic to facilitate subsequent POI data collection to avoid false address information; in? 01数据的经玮度信息与网页页面中的经玮度信息相匹配的基础上,进一步确定网页页面中包括POI数据的POI名称,有利于后续避免收集一些经玮度信息或者POI名称无法匹配的POI数据,而无法匹配的POI数据的准确性往往较低,从而可以便于后续根据确定出的网页页面中包括的兴趣点POI数据,收集到更为准确的POI数据。 The basis of 01 data by Wei of information by Wei of information web page matches on further define the web page includes the POI name of the POI data, is conducive to the subsequent avoid collecting some of the warp Wei of information or POI name can not be matched POI data, the POI data matching accuracy can not tends to be lower, thus facilitating the subsequent points of interest POI data according to the determined web page included in the collected POI data more accurate.

[0087] 本技术领域技术人员可以理解,本发明包括涉及用于执行本申请中所述操作中的一项或多项的设备。 [0087] skilled in the art will be appreciated, the present invention relates to a comprising performing an operation in this application or in more of the device. 这些设备可以为所需的目的而专门设计和制造,或者也可以包括通用计算机中的已知设备。 These devices can be specifically designed and manufactured for the required purposes, or may comprise a general purpose computer in the known devices. 这些设备具有存储在其内的计算机程序,这些计算机程序选择性地激活或重构。 These devices have a computer program stored therein, the computer programs are selectively activated or reconfigured. 这样的计算机程序可以被存储在设备(例如,计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中,所述计算机可读介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、⑶-ROM、和磁光盘)、ROM (Read-Only Memory,只读存储器)、RAM (Random Access Memory,随即存储器)、EPROM (Erasable ProgrammableRead-Only Memory,可擦写可编程只读存储器)、EEPROM (Electrical Iy ErasableProgrammable Read-Only Memory,电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。 Such a computer program may be stored in the device (e.g., computer) readable medium suitable for storing or stored in electronic instructions, and each coupled to the memory bus of any type of medium, the computer readable medium includes, but is not limited to any type of disk (including a flexible disk, a hard disk, ⑶-ROM, and magneto-optical disks), ROM (Read-Only memory, Read-Only memory), RAM (Random Access memory, then memory), EPROM (Erasable ProgrammableRead-Only memory, erasable programmable read-only memory), EEPROM (electrical Iy ErasableProgrammable Read-Only memory, electrically erasable programmable Read Only memory), a flash memory, a magnetic card or light card. 也就是,可读介质包括由设备(例如,计算机)以能够读的形式存储或传输信息的任何介质。 That is, a device-readable medium includes information (e.g., a computer) can be transmitted or stored in the form of any medium that is read.

[0088] 本技术领域技术人员可以理解,可以用计算机程序指令来实现这些结构图和/或框图和/或流图中的每个框以及这些结构图和/或框图和/或流图中的框的组合。 [0088] skilled in the art will be appreciated, can be implemented by computer program instructions of these structural and / or block diagrams and / or flow diagram block and each of these structural and / or block diagrams and / or flow graph the combo box. 本技术领域技术人员可以理解,可以将这些计算机程序指令提供给通用计算机、专业计算机或其他可编程数据处理方法的处理器来实现,从而通过计算机或其他可编程数据处理方法的处理器来执行本发明公开的结构图和/或框图和/或流图的框或多个框中指定的方案。 Skilled in the art will be appreciated that these computer program instructions may be provided to a general purpose computer, a specialized computer or other programmable data processing method implemented by a processor, so that this is performed by a processor of the computer or other programmable data processing method FIG disclosed structure and / or diagrams and / or flow diagram block or blocks of the specified program.

[0089] 本技术领域技术人员可以理解,本发明中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。 [0089] skilled in the art will understand that the steps of the present invention have been discussed in various operations, methods, processes, and measures scheme may be alternately changed, combined or deleted. 进一步地,具有本发明中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。 Further, the present invention has various operation has been discussed, the method, the other steps in the process, the measures, the program may be alternately changed, rearranged, decomposed, combined or deleted. 进一步地,现有技术中的具有与本发明中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。 Further, the prior art has disclosed the various operations of the present invention, a method, the steps in the process, the measures, the program may be alternately changed, rearranged, decomposed, combined or deleted.

[0090] 以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。 [0090] The embodiments described above are only part of the embodiment of the present invention, it should be noted that for those of ordinary skill in the art, in the present invention without departing from the principles of the premise, can make various improvements and modifications, such modifications and modifications should also be regarded as the protection scope of the present invention.

[0091] 本发明公开了Al、一种用于确定网页页面中包括兴趣点POI数据的方法,包括: [0091] The present invention discloses Al, a method for determining a web page includes a point of interest POI data, comprising:

[0092] 从互联网中获取多个POI数据; [0092] acquiring a plurality of POI data from the Internet;

[0093] 爬取包括地址信息的多个网页页面; [0093] crawl the web page includes a plurality of address information;

[0094] 将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; [0094] The address information of address information of the plurality of POI data and the plurality of web pages are included by Wei normalized level information;

[0095] 基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; [0095] Based on the same level information by Wei, Wei by matching degree information by Wei degree information in the plurality of POI data and the plurality of web pages;

[0096] 对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; [0096] POI data for a web page and having the same degree by Wei information, to find the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes the name of the POI data of the POI;

[0097] 当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0097] When the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.

[0098] A2、根据权利要求Al所述的方法,其特征在于,所述从互联网中获取多个POI数据的步骤进一步包括: [0098] A2, A method as claimed in claim Al, wherein the plurality of POI data acquired from the Internet further comprises the step of:

[0099] 从互联网中爬取多个包括POI数据的网页; [0099] crawled from Internet web page including a plurality of POI data;

[0100] 从所述多个包括POI数据的网页中提取多个POI数据。 [0100] Web page includes a plurality of POI data extracted from the plurality of POI data.

[0101] A3、根据权利要求Al或A2所述的方法,其特征在于,所述从互联网中爬取多个包括POI数据的网页的步骤进一步包括: [0101] A3, Al or A2 of the method according to claim, wherein the step of including a plurality of pages of the POI data from crawling the Internet further comprises:

[0102] 从互联网中爬取包括地址关键词的多个网页对应的多个URL ; [0102] crawling multiple web pages comprise a plurality of URL addresses corresponding to the keyword from the Internet;

[0103] 对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern 集合; URL clustering [0103] pattern of the URL of the plurality of clusters, having the same structural characteristics for the same set of pattern;

[0104] 从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 [0104] selected from the plurality of pattern set includes a pattern comprising a plurality of POI data set of web pages, including web pages and extracting a plurality of POI data from the pattern set.

[0105] A4、根据权利要求A1-A3任一项所述的方法,其特征在于,所述从所述多个包括POI数据的网页中提取多个POI数据的步骤进一步包括: [0105] A4, A1-A3 method according to any one of the preceding claims, wherein the step of extracting a plurality of POI data from the POI comprises a plurality of web page data further comprises:

[0106] 基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板; [0106] Based on the set belonging to the same pattern corresponding to the plurality of URL features comprise a plurality of web page structure POI data, the extraction template to generate a set of POI data corresponding to the pattern;

[0107] 基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。 [0107] POI data extraction based on the template, including Web POI data extracted from the plurality of the plurality of POI data.

[0108] A5、根据权利要求A1-A4任一项所述的方法,其特征在于,所述爬取包括地址信息的多个网页页面的步骤进一步包括: [0108] A5, A1-A4 according to the method of any one of the preceding claims, wherein said step of crawled web page includes a plurality of address information further comprises:

[0109] 从互联网中爬取包括地址关键词的多个网页页面; [0109] crawled from Internet web page includes a plurality of address keywords;

[0110] 提取所述多个网页页面中与所述地址关键词相关联的多个文本信息; [0110] Extraction of the plurality of web pages associated with the keyword of the plurality of address text information;

[0111] 从所述多个文本信息中提取相应网页页面的地址信息。 [0111] extract the address information of the corresponding web page from the plurality of text information.

[0112] A6、根据权利要求A1-A5任一项所述的方法,其特征在于,该方法还包括: [0112] A6, A1-A5 method according to any one of the preceding claims, characterized in that the method further comprises:

[0113] 当判断该网页页面中包括多个名称信息时,分别计算所述多个名称信息与该网页页面的地址信息之间的文本距离; [0113] When determining that the web page includes a plurality of name information, calculates the distance between the text address information of the plurality of name information of the web page;

[0114] 将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称ί目息; [0114] The minimum text name information corresponding to the distance, it is determined ί mesh with the web page information in the address information corresponding to the name;

[0115] 其中,所述根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称的步骤进一步包括: [0115] wherein the lookup in the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes the step of POI name of the POI data further comprises:

[0116] 将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对; [0116] POI data corresponding to the POI name are aligned with the web page address information corresponding to the name information;

[0117] 比对一致时,确定该网页页面中包括该POI数据的POI名称。 [0117] coincident, it is determined that the web page includes the name of the POI POI data comparison.

[0118] 本发明公开了Α7、一种用于确定网页页面中包括兴趣点POI数据的装置,包括: [0118] The present invention discloses [alpha] 7, means for determining a web page includes a point of interest POI data, comprising:

[0119] POI数据获取模块,用于从互联网中获取多个POI数据; [0119] POI data acquisition module, for acquiring a plurality of POI data from the Internet;

[0120] 网页页面爬取模块,用于爬取包括地址信息的多个网页页面; [0120] webpage crawling module for crawling the web page includes a plurality of address information;

[0121] 经玮度信息归一化模块,用于将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; [0121] Wei degree information by the normalization module, the address information for address information of the plurality of POI data and the plurality of web pages are included by Wei normalized level information;

[0122] 经玮度信息匹配模块,用于基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; [0122] Wei degree information by matching module, based on the same information via the degree of Wei, Wei by matching degree information in the information of the plurality of through Wei POI data and the plurality of web pages;

[0123] 网页页面包括POI名称确定模块,用于对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; [0123] POI name of the web page comprises determining module, and a web page for the POI data having the same degree by Wei information, to find the page based on the page data corresponding to the POI name of a POI, determining whether the web page includes POI name of the POI data;

[0124] 网页页面包括POI数据确定模块,用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 [0124] POI data includes web page determining module, configured to, when the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.

[0125] AS、根据权利要求Α7所述的装置,其特征在于,所述POI数据获取模块具体包括: [0125] AS, The apparatus of claim Α7 claims, characterized in that the POI data acquisition module comprises:

[0126] 网页爬取单元,用于从互联网中爬取多个包括POI数据的网页; [0126] page parsing unit for crawled from Internet web page including a plurality of POI data;

[0127] POI数据提取单元,用于从所述多个包括POI数据的网页中提取多个POI数据。 [0127] POI data extraction unit for web page including a plurality of POI data extracted from the plurality of POI data.

[0128] Α9、根据权利要求Α7或Α8所述的装置,其特征在于, [0128] Α9, or apparatus as claimed in claim Α8 Α7 claims, characterized in that,

[0129] 所述网页爬取单元具体用于从互联网中爬取包括地址关键词的多个网页对应的多个URL ;对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 [0129] The web crawler unit is configured to crawling multiple web pages comprise a plurality of URL addresses corresponding to the keyword from the Internet; URL the URL of the plurality of poly cluster pattern, having the same structural characteristics class set to the same pattern; selected from the plurality of pattern set comprising a plurality of pattern pages include a set of POI data, and extracts POI data including a plurality of pages from the pattern set.

[0130] A10、根据权利要求A7-A9任一项所述的装置,其特征在于,[0131 ] 所述POI数据提取单元具体用于基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。 [0130] A10, apparatus as claimed in any one of claims A7-A9, characterized in that, [0131] The POI data extracting unit is configured based on the same pattern corresponding to the set belonging to a plurality of POI data including a plurality of URL characteristics of the web page structure, and generates a pattern corresponding to the set of POI data extraction template; template based on the extraction POI data, including web pages POI data extracted from the plurality of the plurality of POI data.

[0132] All、根据权利要求A7-A10任一项所述的装置,其特征在于, [0132] All, apparatus as claimed in any one of claims A7-A10, characterized in that,

[0133] 所述网页页面爬取模块具体用于从互联网中爬取包括地址关键词的多个网页页面;提取所述多个网页页面中与所述地址关键词相关联的多个文本信息;从所述多个文本信息中提取相应网页页面的地址信息。 [0133] The web page crawler module is configured to crawl the web page includes a plurality of keywords from the Internet address; a plurality of extracting the plurality of web page text keywords associated with the address information; extracting the address information from the web page corresponding to the plurality of text information.

[0134] A12、根据权利要求A7-A11任一项所述的装置,其特征在于,还包括:网页页面中名称信息确定模块; [0134] A12, apparatus as claimed in any one of claims A7-A11, characterized in that, further comprising: web page information name determining module;

[0135] 所述网页页面中名称信息确定模块用于当判断该网页页面中包括多个名称信息时,分别计算所述多个名称信息与该网页页面的地址信息之间的文本距离;将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息;以及 [0135] in the name of the web page information determining means for determining when the web page includes a plurality of name information, calculates the distance between the text address information of the plurality of name information of the web page; minimum the name text information corresponding to a distance, determined as the address of the web page information corresponding to the name information;

[0136] 所述网页页面包括POI名称确定模块还用于将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对;比对一致时,确定该网页页面中包括该POI数据的POI名称。 [0136] POI name of the web page comprises determining module is further configured POI data name information corresponding to the POI name as the web page corresponding to the address information for comparison; ratio of consistent, determining that the web page includes the name of the POI POI data.

Claims (10)

1.一种用于确定网页页面中包括兴趣点POI数据的方法,其特征在于,包括: 从互联网中获取多个POI数据; 爬取包括地址信息的多个网页页面; 将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; 基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; 对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; 当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 A web page includes a point of interest in a method for determining the POI data, characterized by comprising: obtaining a plurality of POI data from the Internet; crawled web pages comprise a plurality of address information; and said plurality of POI the address information data and address information of the web page comprises a plurality of normalized respectively by Wei degree information; Wei was based on the same information, the information of the plurality of through Wei POI data with a plurality of web pages, matching Wei degree information in the via; for POI data and the web page with the same by Wei level information, to find the web page based on the POI data corresponding to the POI name, determining that the web page is included the POI data POI name; when the web page includes the name of the POI POI data, determine the web page includes the point of interest POI data.
2.根据权利要求1所述的方法,其特征在于,所述从互联网中获取多个POI数据的步骤进一步包括: 从互联网中爬取多个包括POI数据的网页; 从所述多个包括POI数据的网页中提取多个POI数据。 2. The method according to claim 1, wherein the step of further comprising a plurality of POI data obtained from the Internet: crawled web pages comprise a plurality of POI data from the Internet; POI from the plurality comprises a plurality of POI data extracted webpage data.
3.根据权利要求1-2任一项所述的方法,其特征在于,所述从互联网中爬取多个包括POI数据的网页的步骤进一步包括: 从互联网中爬取包括地址关键词的多个网页对应的多个URL ; 对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合; 从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 3. The method according to any one of claims 1-2, wherein the step of including a plurality of pages of the POI data from crawling the Internet further comprising: crawling comprises a multi-address keywords from the Internet URL corresponding to a plurality of pages; URL the URL clustering the plurality of cluster pattern, having the same structural characteristics for the same set of pattern; selected from the plurality of pattern set comprising a plurality of POI data includes web pages set pattern, and extracts POI data including a plurality of pages from the pattern set.
4.根据权利要求1-3任一项所述的方法,其特征在于,所述从所述多个包括POI数据的网页中提取多个POI数据的步骤进一步包括: 基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板; 基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。 4. The method according to any one of claims 1-3, wherein said web from said plurality of POI data comprises the step of extracting a plurality of POI data further comprising: a plurality of the same pattern based on the set of URL corresponding to a plurality of web pages include structural features of the POI data, the POI data to generate corresponding sets the extraction template pattern;, pages comprising a plurality of POI data extracted from the plurality of POI data based on the template is extracted POI data.
5.根据权利要求1-4中任一项所述的方法,其特征在于,所述爬取包括地址信息的多个网页页面的步骤进一步包括: 从互联网中爬取包括地址关键词的多个网页页面; 提取所述多个网页页面中与所述地址关键词相关联的多个文本信息; 从所述多个文本信息中提取相应网页页面的地址信息。 5. The method according to any one of claims 1-4, wherein said step of crawled web page includes a plurality of address information further comprises: taking keyword from the Internet address comprises a plurality of climb web page; extracting the plurality of plurality of text in web page with the address information associated keyword; address information is extracted from the web page corresponding to the plurality of text information.
6.根据权利要求1-5中任一项所述的方法,其特征在于,该方法还包括: 当判断该网页页面中包括多个名称信息时,分别计算所述多个名称信息与该网页页面的地址信息之间的文本距离; 将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息; 其中,所述根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称的步骤进一步包括: 将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对; 比对一致时,确定该网页页面中包括该POI数据的POI名称。 6. The method claimed in any one of the preceding claims, characterized in that, the method further comprising: when it is determined that the web page includes a plurality of name information, name information calculates a plurality of the page the distance between the address information of the text page; name text information of the minimum distance corresponding to the determined name information and address information corresponding to the web page; wherein, the basis of the POI data corresponding to the POI name to find web pages, the web page to determine whether the name of the POI comprising the step of further comprising the POI data: POI data corresponding to the POI name of the web page address information corresponding to the name information for comparison; than when the agreement is determined that the web page includes the name of the POI POI data.
7.一种用于确定网页页面中包括兴趣点POI数据的装置,其特征在于,包括: POI数据获取模块,用于从互联网中获取多个POI数据; 网页页面爬取模块,用于爬取包括地址信息的多个网页页面; 经玮度信息归一化模块,用于将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经玮度信息; 经玮度信息匹配模块,用于基于同一经玮度信息,在所述多个POI数据的经玮度信息与多个网页页面中经玮度信息中进行匹配; 网页页面包括POI名称确定模块,用于对于具有相同经玮度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称; 网页页面包括POI数据确定模块,用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。 A web page includes a point of interest POI data means for determining, characterized by comprising: POI data acquisition module, for acquiring a plurality of POI data from the Internet; web page crawling module for crawling a plurality of web pages include address information; Wei degree information by the normalization module, the address information for address information of the plurality of POI data and the plurality of web pages are included into a degree meridian Wei information; degree information matching module by Wei, Wei was based on the same information, the matching degree information in by Wei Wei degree information via the plurality of POI data and the plurality of web pages; web page comprises determining POI name module, configured to, for data for the POI and a web page with the same degree by Wei POI data based on the information corresponding to the POI name lookup web page, it is determined whether the web page includes the name of the POI data of the POI; webpage comprising POI data determining module, configured to, when the web page includes the name of the POI POI data, determine that the web page includes a point of interest POI data.
8.根据权利要求7所述的装置,其特征在于,所述POI数据获取模块具体包括: 网页爬取单元,用于从互联网中爬取多个包括POI数据的网页; POI数据提取单元,用于从所述多个包括POI数据的网页中提取多个POI数据。 8. The apparatus according to claim 7, wherein said POI data acquisition module comprises: a web crawling unit for crawled from Internet web page including a plurality of POI data; POI data extracting unit, with in the web page including a plurality of POI data extracted from the plurality of POI data.
9.根据权利要求7-8任一项所述的装置,其特征在于, 所述网页爬取单元具体用于从互联网中爬取包括地址关键词的多个网页对应的多个URL ;对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。 9. Device according to any one of claims 7-8, wherein said unit is configured to crawl web crawling multiple web pages corresponding to the URL comprising a plurality of keywords from the Internet address; of their URL the URL pattern clustering said plurality of clusters, having the same structural characteristics for the same set of pattern; screened pattern comprising a plurality of sets of POI data including web pages from a plurality of pattern collections, and is extracted from the pattern set POI data includes multiple pages.
10.根据权利要求7-9任一项所述的装置,其特征在于, 所述POI数据提取单元具体用于基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。 10. Device according to any one of claims 7-9, characterized in that, the POI corresponding to data extraction unit is configured based on the same set of pattern pages belonging to the plurality of structural features comprises a plurality of URL page POI data , generating the set pattern extraction POI data corresponding to the template; template based on the extraction POI data, including web pages POI data extracted from the plurality of the plurality of POI data.
CN201510148638.4A 2015-03-31 2015-03-31 Web pages comprising data and a point of interest poi means for determining CN104699835B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 Web pages comprising data and a point of interest poi means for determining

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 Web pages comprising data and a point of interest poi means for determining
PCT/CN2015/099580 WO2016155386A1 (en) 2015-03-31 2015-12-29 Method and device for determining whether webpage comprises point of interest (poi) data

Publications (2)

Publication Number Publication Date
CN104699835A true CN104699835A (en) 2015-06-10
CN104699835B CN104699835B (en) 2016-09-28

Family

ID=53346955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 Web pages comprising data and a point of interest poi means for determining

Country Status (2)

Country Link
CN (1) CN104699835B (en)
WO (1) WO2016155386A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933171A (en) * 2015-06-30 2015-09-23 百度在线网络技术(北京)有限公司 Method and device for associating data of interest point
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN105160032A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105160031A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Mining method and device for map point of interest (POI) data
CN105243136A (en) * 2015-09-30 2016-01-13 北京奇虎科技有限公司 Method and apparatus for mining point of interest (POI) data in internet
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105279246A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for judging whether webpage contains specified point of interest POI
CN105320752A (en) * 2015-09-30 2016-02-10 北京奇虎科技有限公司 Point of interest data mining method and apparatus
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN105550330A (en) * 2015-12-21 2016-05-04 北京奇虎科技有限公司 Point of interest (POI) information sorting method and system
CN105608112A (en) * 2015-12-10 2016-05-25 北京奇虎科技有限公司 Method and apparatus for measuring quality of map POI data
WO2016155386A1 (en) * 2015-03-31 2016-10-06 北京奇虎科技有限公司 Method and device for determining whether webpage comprises point of interest (poi) data
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN102591867A (en) * 2011-01-07 2012-07-18 清华大学 Searching service method based on mobile device position
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678629A (en) * 2013-12-19 2014-03-26 北京大学 Search engine method and system sensitive to geographical position

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963962B (en) * 2009-07-23 2014-02-26 高德软件有限公司 Interest point data association method and device
CN102142003B (en) * 2010-07-30 2013-04-24 华为软件技术有限公司 Method and device for providing point of interest information
CN103514234B (en) * 2012-06-30 2018-10-16 北京百度网讯科技有限公司 One kind of page information extraction method and apparatus
CN104699835B (en) * 2015-03-31 2016-09-28 北京奇虎科技有限公司 Web pages comprising data and a point of interest poi means for determining

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN102591867A (en) * 2011-01-07 2012-07-18 清华大学 Searching service method based on mobile device position
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678629A (en) * 2013-12-19 2014-03-26 北京大学 Search engine method and system sensitive to geographical position

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016155386A1 (en) * 2015-03-31 2016-10-06 北京奇虎科技有限公司 Method and device for determining whether webpage comprises point of interest (poi) data
CN104933171A (en) * 2015-06-30 2015-09-23 百度在线网络技术(北京)有限公司 Method and device for associating data of interest point
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)
CN105160032A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105160031A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Mining method and device for map point of interest (POI) data
CN105243136A (en) * 2015-09-30 2016-01-13 北京奇虎科技有限公司 Method and apparatus for mining point of interest (POI) data in internet
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105279246A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for judging whether webpage contains specified point of interest POI
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN105243136B (en) * 2015-09-30 2019-02-19 北京奇虎科技有限公司 A kind of method and apparatus of point of interest POI data in excavation internet
CN105320752B (en) * 2015-09-30 2018-12-07 北京奇虎科技有限公司 A method and apparatus for excavating point of interest data
CN105320752A (en) * 2015-09-30 2016-02-10 北京奇虎科技有限公司 Point of interest data mining method and apparatus
CN105160032B (en) * 2015-09-30 2019-05-31 北京奇虎科技有限公司 The determination method and device of the confidence level of interest point data in a kind of website
CN105608112A (en) * 2015-12-10 2016-05-25 北京奇虎科技有限公司 Method and apparatus for measuring quality of map POI data
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN105550330A (en) * 2015-12-21 2016-05-04 北京奇虎科技有限公司 Point of interest (POI) information sorting method and system
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device

Also Published As

Publication number Publication date
WO2016155386A1 (en) 2016-10-06
CN104699835B (en) 2016-09-28

Similar Documents

Publication Publication Date Title
JP5462361B2 (en) Query parsing for the map search
JP5680063B2 (en) Landmark from the collection of digital photos
Schulz et al. A multi-indicator approach for geolocalization of tweets
Mooney et al. Towards quality metrics for OpenStreetMap
Ji et al. Mining city landmarks from blogs by graph modeling
US7941420B2 (en) Method for organizing structurally similar web pages from a web site
JP2010055618A (en) Method and system for providing search based on topic
CN1581170A (en) Method and system for comparing files of two computers
WO2014000576A1 (en) Network searching method and network searching system
CN101751458A (en) Network public sentiment monitoring system and method
CN101313300B (en) Local Search
WO2008014702A1 (en) Method and system of extracting new words
CN103678637A (en) Method and device for acquiring test question information
CN103927354A (en) Interactive searching and recommending method and device
CN102135967B (en) Webpage keywords extracting method, device and system
CN101350012A (en) Method and system for matching address
CN100514323C (en) System and method for automatically extracting by-line information
JP6141490B2 (en) Method and system for extracting web page information
CN102622445A (en) User interest perception based webpage push system and webpage push method
CN101231661B (en) Method and system for digging object grade knowledge
CN101369276A (en) Evidence obtaining method for Web browser caching data
JP6196316B2 (en) Adjust delivery of content based on the user post
Ahlers et al. Location-based Web search
US9448999B2 (en) Method and device to detect similar documents
Han et al. A stacking-based approach to twitter user geolocation prediction

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model