WO2016155386A1 - Method and device for determining whether webpage comprises point of interest (poi) data - Google Patents

Method and device for determining whether webpage comprises point of interest (poi) data Download PDF

Info

Publication number
WO2016155386A1
WO2016155386A1 PCT/CN2015/099580 CN2015099580W WO2016155386A1 WO 2016155386 A1 WO2016155386 A1 WO 2016155386A1 CN 2015099580 W CN2015099580 W CN 2015099580W WO 2016155386 A1 WO2016155386 A1 WO 2016155386A1
Authority
WO
WIPO (PCT)
Prior art keywords
poi data
poi
webpage
information
name
Prior art date
Application number
PCT/CN2015/099580
Other languages
French (fr)
Chinese (zh)
Inventor
王智广
魏少俊
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2016155386A1 publication Critical patent/WO2016155386A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Definitions

  • the present invention relates to the field of computer technology, and in particular, to a method and apparatus for determining POI data including points of interest in a web page.
  • a POI Point Of Interest
  • the POI data includes address information and a POI name.
  • the traditional POI data acquisition method requires the technician to use a precise surveying and mapping instrument to obtain the latitude and longitude information of each POI, and then mark it down. This method is time-consuming and laborious, resulting in a small amount of POI data obtained through the acquisition, geographic information. It is difficult for the system to provide a high level of service based on a small amount of POI data.
  • the webpage containing POI data can be collected from the Internet, and the POI data is extracted from the collected webpage for use by the geographic information system, labor and time are greatly saved.
  • the Internet is full of fake POI data.
  • the content of the blog page contains “original address: http://xxx.xxx.xxx/xxx”. Although it contains the word “address”, the address is a network address or The URL (Uniform Resoure Locator) is not the geographical address information in the POI data; thus, the proportion of the fake POI data in the collected POI data is high.
  • the present invention is directed to the disadvantages of the prior art, and provides a method and apparatus for determining POI data including POIs in a web page to solve the problem of collecting more false POI data in the prior art.
  • the present invention provides a method for determining a POI data including a POI in a webpage page, including:
  • the POI name corresponding to the POI data is searched in the webpage page to determine whether the POI name of the POI data is included in the webpage page;
  • the web page When the POI name of the POI data is included in the web page, it is determined that the web page includes the POI data of the POI.
  • the present invention further provides an apparatus for determining a POI data including a POI in a webpage page, including:
  • a POI data acquisition module configured to acquire multiple POI data from the Internet
  • a webpage crawling module for crawling a plurality of webpage pages including address information
  • a latitude and longitude information normalization module configured to normalize address information in the plurality of POI data and address information included in the plurality of webpage pages into latitude and longitude information
  • the latitude and longitude information matching module is configured to perform matching on the latitude and longitude information of the plurality of POI data and the latitude and longitude information in the plurality of webpage pages based on the same latitude and longitude information;
  • the webpage includes a POI name determining module, configured to search for the POI data and the webpage page having the same latitude and longitude information, and search for the POI of the POI data in the webpage page according to the POI name corresponding to the POI data. name;
  • the webpage includes a POI data determining module, configured to determine, when the webpage includes the POI name of the POI data, the webpage page includes the POI data of the POI.
  • a computer program comprising computer readable code that, when executed on a computing device, causes the computing device to perform the method for determining a web page as described above
  • the method of including POI data of interest points is included in the page.
  • a computer readable medium wherein the computer program described above is stored.
  • the address information is normalized into latitude and longitude information, and the geographical address information can be filtered out. Due to the uniqueness of the latitude and longitude, the accuracy of the matching result based on the latitude and longitude information is much higher than the existing text-based information.
  • the accuracy of the matching results which facilitates subsequent collection avoidance POI data to the fake address information; based on the latitude and longitude information of the POI data and the latitude and longitude information in the webpage page, further determining whether the POI name of the POI data is included in the webpage page to accurately determine whether the POI data is included in the same
  • the webpage page facilitates the subsequent determination of the accuracy of the collected POI data according to the authority and accuracy of the content recorded on the webpage, thereby providing a reliable guarantee for collecting large quantities of highly accurate POI data in the Internet. .
  • 1a is a schematic flowchart of a method for determining POI data including a POI in a webpage page according to an embodiment of the present invention
  • FIG. 1b is a schematic diagram of a webpage including multiple POI data according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a framework for determining an internal structure of an apparatus for including POI data of a point of interest in a webpage page according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a framework of an internal structure of a POI data acquiring module according to an embodiment of the present invention
  • Figure 4 shows schematically a block diagram of a computing device for performing the method according to the invention
  • Fig. 5 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • FIG. 1a is a schematic flowchart diagram of a method for determining POI data including POI in a web page according to the present invention.
  • S101 Acquire multiple POI data from the Internet;
  • S102 Crawl a plurality of webpage pages including address information;
  • S103 normalize address information in the plurality of POI data and address information included in the plurality of webpage pages into latitude and longitude respectively Information;
  • S104 matching latitude and longitude information of the plurality of POI data with latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information;
  • S105 for the POI data and the webpage page having the same latitude and longitude information, according to the POI corresponding to the POI data The name is searched in the webpage page to determine whether the POI name of the POI data is included in the webpage page.
  • S106 When the webpage page includes the POI name of the POI data, it is determined that the webpage page includes the POI data of the POI.
  • the method for determining the POI data of the interest point in the webpage of the present invention normalizes the address information into the latitude and longitude information, and can filter the address information of the geographical location, and the matching result based on the latitude and longitude information is due to the uniqueness of the latitude and longitude.
  • the accuracy is much higher than the accuracy of the existing text-based matching result, thereby facilitating the subsequent avoidance of collecting the data of the fake address information; on the basis of matching the latitude and longitude information of the POI data with the latitude and longitude information in the webpage page, Further determining whether the POI name of the POI data is included in the webpage page to accurately determine whether the POI data is included in the same webpage page, thereby facilitating subsequent determination of the collected POI data according to the authority and accuracy of the content recorded on the webpage page.
  • the accuracy provides a reliable guarantee for collecting high-accuracy POI data in the Internet in large quantities.
  • S101 Acquire multiple POI data from the Internet.
  • a plurality of web pages including POI data are crawled from the Internet using a program of a web crawler; and then a plurality of POI data are extracted from a plurality of web pages including POI data.
  • the POI data includes address information and a POI name; preferably, the POI data may also include a contact, a zip code, a network label, and the like.
  • the inventors of the present invention have found that there are such web pages in the Internet, in which the content of each web page contains one or more POI data, and the address information in the POI data includes address keywords such as "address"; and these
  • the page structure feature URL format of the web page, and the location and format of the POI data in the web page are regular. In other words, POI data can be quickly extracted from these web pages in a unified way.
  • a plurality of URLs (Uniform Resole Locators) corresponding to a plurality of web pages including address keywords such as “address” can be crawled from the Internet; and pattern aggregation is performed on the plurality of URLs obtained by the crawling A class that clusters URLs with the same structural characteristics into the same pattern set.
  • URLs Uniform Resole Locators
  • a web page including only one POI data is obtained, and URLs of all web pages including only one POI data are obtained; pattern clustering is performed on all obtained URLs, and the same structural features are used. URL clustering is the same pattern set.
  • the URL http://www.aibang.com/detail/1537772035-1606201508 includes only the POI data of "Epson (China) Co., Ltd.” and the URL is http: The page of //www.aibang.com/detail/152928073-419169481 only includes the POI data of “Beijing Wangfu Chinese Medicine and Western Medicine Hospital”.
  • These two URLs have the same structural characteristics www.aibang.com/detail/*, Where * is a wildcard for any character; therefore, the two URLs can be clustered into the same pattern set; that is, all URLs in the pattern set have the same structural feature www.aibang.com/detail/*.
  • a webpage including a plurality of POI data obtaining URLs of all webpages including multiple POI data; pattern clustering all acquired URLs, having the same structural features The URLs are clustered into the same pattern collection.
  • the webpage with the URL www.dianping.com/topic/s_c_2_120_r14_x540/p7 includes the POI titles "boy london", “COACH” and "Milan shop”. Multiple POI data such as Sanlitun); the webpage with the URL www.dianping.com/topic/s_c_2_120_r14_x540/p6 also includes multiple POI data; obtain all URLs whose structural features conform to www.dianping.com/topic/* Where * is a wildcard for any character; pattern clustering is performed for all URLs obtained, and the URLs in the same pattern set obtained by clustering have the structural characteristics www.dianping.com/topic/*.
  • a pattern set including a plurality of web pages including POI data is filtered from a plurality of pattern sets, and a plurality of web pages including POI data are extracted from the pattern set.
  • extracting the plurality of POI data from the plurality of web pages including the POI data may include:
  • a POI data extraction template corresponding to the pattern set is generated based on page structure features of a plurality of URLs belonging to the same pattern set corresponding to a plurality of web pages including POI data. Specifically, for each URL belonging to the same pattern set, a POI data extraction template corresponding to the pattern set is generated according to the format and location of the POI data in each webpage corresponding to the URL.
  • a plurality of POI data are extracted from a plurality of web pages including POI data. Specifically, for each webpage in the same pattern set, for the webpage corresponding to the URL, the format of the POI data in the template and the location of the plurality of POI data in the webpage are extracted according to the generated POI data, and more information is extracted from the webpage. POI data.
  • S102 Crawling a plurality of webpage pages including address information.
  • a web crawler-like program is used to crawl a plurality of webpage pages including address keywords from the Internet.
  • the text content of the webpage page is extracted, and an address keyword that may include address information, such as “address”, “located” or “located in”, is searched for in the text content; and text near the address keyword is extracted.
  • Fragment; segment the text segment according to the set separator and the length of the segment, such as the text length of the text segment from the address keyword is greater than the set threshold, and/or the separator of the text segment (such as a space, a comma, The period, etc., divides the text segment; in the segmentation result, the text segment between the segmentation (for example, the separator) and the address keyword is used as the text information associated with the address keyword in the web page.
  • the address information is extracted from the text information as the address information of the web page.
  • S103 Normalize the address information in the plurality of POI data and the address information included in the plurality of webpage pages into the latitude and longitude information.
  • a geographic information base including address information, latitude and longitude information, and correspondence between address information and latitude and longitude information including provinces, cities, counties (districts), towns, and roads in the country is obtained in advance.
  • the address information in the geographic information base may include address information indicating multiple expression forms of the same geographical address; for example, "6 Jiuxianqiao Road, Chaoyang District", "Beijing Chaoyang Jiuxianqiao No. 6" and "Chaoyang District” Multiple address information such as “Jiuxianqiao No. 6" means the same geographical address.
  • the address information in the plurality of POI data is respectively normalized into latitude and longitude information of the plurality of POI data.
  • the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the POI data.
  • S104 Matching the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information.
  • determining whether there is a webpage page whose latitude and longitude information is consistent with the latitude and longitude information of the POI data in each webpage page and if yes, determining that the POI data matches the webpage page, that is, determining The POI data has the same latitude and longitude information as the web page page, otherwise the POI data is ignored.
  • the accuracy of the matching result based on the latitude and longitude information is much higher than the accuracy of the existing text-based matching result, so that more accurate POI data can be collected according to the more accurate matching result.
  • the matching based on the latitude and longitude information is equivalent to matching the plurality of geographic information corresponding to the latitude and longitude information, thereby expanding the matching range, and facilitating subsequent collection of more POI data.
  • S105 Search for the POI data and the webpage page having the same latitude and longitude information according to the POI name corresponding to the POI data, and determine whether the POI name of the POI data is included in the webpage page.
  • the name information is judged. Whether to match the POI name in the POI data: if so, the POI name including the POI data in the web page page is determined; otherwise, the POI data is ignored.
  • the POI name in the POI data matches the name information in the webpage page, thereby determining the POI name including the POI data in the webpage page.
  • the POI name in the POI data is “Qihu 360”, and the name information in the webpage page is “Beijing Qihoo Technology Co., Ltd.”, which can be confirmed as The POI name of the POI data is included in the web page.
  • the text distance between the plurality of name information and the address information of the webpage page is separately calculated.
  • the name information corresponding to the minimum text distance is determined as the name information corresponding to the address information in the web page.
  • the text distance may be the number of characters between the name information and the address information.
  • the POI name corresponding to the POI data is compared with the name information corresponding to the address information in the webpage page. When the comparison is consistent, it is determined that the POI name of the POI data is included in the webpage page.
  • the webpage page includes the POI data of the POI.
  • the webpage page is determined to include the POI name and address information of the POI.
  • a schematic diagram of a framework for determining a POI data of a point of interest in a webpage page includes: a POI data acquisition module 201, a webpage crawling module 202, and a latitude and longitude information normalization module. 203.
  • the POI data obtaining module 201 is configured to acquire a plurality of POI data from the Internet.
  • the webpage crawling module 202 is configured to crawl a plurality of webpage pages including address information.
  • the webpage crawling module 202 crawls a plurality of webpage pages including address keywords from the Internet; extracts a plurality of text information associated with the address keywords in the plurality of webpage pages; and extracts corresponding information from the plurality of textual information The address information of the web page.
  • the latitude and longitude information normalization module 203 is configured to normalize the address information in the plurality of POI data and the address information included in the plurality of web page pages into the latitude and longitude information.
  • the latitude and longitude information normalization module 203 normalizes the address information in the plurality of POI data into the latitude and longitude information of the plurality of POI data.
  • the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the POI data.
  • the pre-acquired geographic information database includes address information, latitude and longitude information, and correspondence between the address information and the latitude and longitude information of provinces, cities, counties (districts), towns, and roads in the country.
  • the latitude and longitude information normalization module 203 normalizes the address information included in the plurality of webpage pages into the latitude and longitude information of the plurality of webpage pages.
  • the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the webpage page.
  • the latitude and longitude information matching module 204 is configured to match the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information. Specifically, the latitude and longitude information matching module 204 determines, for each POI data, whether there is a webpage page whose latitude and longitude information is consistent with the latitude and longitude information of the POI data in each webpage page, and if yes, determines that the POI data is related to the webpage page. Matching, that is, determining that the POI data has the same latitude and longitude information as the web page page, otherwise, the POI data is ignored.
  • the webpage includes a POI name determining module 205, configured to search for the POI data and the webpage page having the same latitude and longitude information, and search for the POI of the POI data in the webpage page according to the POI name corresponding to the POI data. name.
  • the web page The facet determining module 205 includes, for the POI data and the webpage page having the same latitude and longitude information, all the name information is found from the webpage page; and for each name information found, it is determined whether the name information is related to the POI data.
  • the POI name matches: if so, the POI name of the POI data is determined in the web page; otherwise, the POI data is ignored.
  • the web page includes a POI data determining module 206 for determining that the web page includes the POI data of the POI when the POI name of the POI data is included in the web page.
  • the schematic diagram of the internal structure of the POI data acquisition module 201 is as shown in FIG. 3, and includes a webpage crawling unit 301 and a POI data extracting unit 302.
  • the webpage crawling unit 301 is configured to crawl a plurality of webpages including POI data from the Internet.
  • the webpage crawling unit 301 crawls a plurality of URLs corresponding to the plurality of webpages including the address keyword from the Internet; performs pattern clustering on the plurality of URLs, and clusters the URLs having the same structural feature into the same pattern set; A pattern set including a plurality of web pages including POI data is filtered from a plurality of pattern sets, and a plurality of web pages including POI data are extracted from the pattern set.
  • the POI data extracting unit 302 is configured to extract a plurality of POI data from a plurality of web pages including POI data.
  • the POI data extracting unit 302 is specifically configured to generate a POI data extraction template corresponding to the pattern set based on page structure features of the plurality of URLs corresponding to the plurality of URLs in the same pattern set, and extract the template based on the POI data. Extracting a plurality of POI data from a plurality of web pages including POI data.
  • the apparatus for determining the POI data included in the webpage page of the present invention further includes: a webpage page name information determining module 207.
  • the name information determining module 207 in the webpage is configured to separately calculate a text distance between the plurality of name information and the address information of the webpage page when the webpage page includes the plurality of name information; and the minimum text distance corresponds to The name information is determined to be name information corresponding to the address information in the web page.
  • the webpage including the POI name determining module 205 is further configured to compare the POI name corresponding to the POI data with the name information corresponding to the address information in the webpage page; when the comparison is consistent, determining that the webpage page includes the The POI name of the POI data.
  • the address information is normalized into latitude and longitude information, which can be filtered.
  • the geographic address information it is advantageous to avoid the subsequent collection of the fake POI data of the address information; and further, the POI name including the POI data in the webpage page is further determined on the basis that the latitude and longitude information of the POI data matches the latitude and longitude information in the webpage page, It is beneficial to avoid collecting some POI data whose latitude and longitude information or POI name cannot be matched, and the POI data which cannot be matched is often less accurate, so that it can be conveniently collected according to the POI data of the POI included in the determined webpage. More accurate POI data.
  • the present invention includes apparatus related to performing one or more of the operations described herein. These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. These devices have computer programs stored therein that are selectively activated or reconfigured.
  • Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any Types of disks (including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory) , EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or light card.
  • a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.
  • each block of the block diagrams and/or block diagrams and/or flow diagrams and combinations of blocks in the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions. .
  • these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method.
  • steps, measures, and solutions in the various operations, methods, and processes that have been discussed in the present invention may be alternated, changed, combined, or deleted. Further, other steps, measures, and schemes of the various operations, methods, and processes that have been discussed in the present invention may be alternated, modified, rearranged, decomposed, combined, or deleted. Further, the steps, measures, and solutions in the prior art having various operations, methods, and processes disclosed in the present invention may also be alternated, changed, rearranged, decomposed, combined, or deleted.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the means for determining POI data in a web page including point of interest POI data in accordance with an embodiment of the present invention.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • Figure 4 schematically illustrates a block diagram of a computing device for performing the method in accordance with the present invention.
  • the computing device conventionally includes a processor 410 and a computer program product or computer readable medium in the form of a memory 420.
  • the memory 420 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 420 has a memory space 430 for program code 431 for performing any of the method steps described above.
  • storage space 430 for program code may include various program code 431 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such computer program products are typically portable or fixed storage units as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 420 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit comprises computer readable code 431' for performing the steps of the method according to the invention, ie code that can be read by a processor, such as 410, which, when executed by the computing device, causes the calculation The device performs the various steps in the methods described above.
  • the present invention is applicable to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations.
  • Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
  • the computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system.
  • program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types.
  • the computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network.
  • program modules may be located on a local or remote computing system storage medium including storage devices.

Abstract

A method and device for determining whether a webpage comprises point of interest (POI) data. The method comprises: acquiring multiple pieces of POI data from the Internet (S101); crawling multiple webpages comprising address information (S102); separately normalizing address information in the multiple pieces of POI data and the address information comprised in the multiple webpages to be longitude and latitude information (S103); matching longitude and latitude information of the multiple pieces of POI data with that of the multiple webpages (S104); for the POI data and the webpages having same longitude and latitude information, searching in the webpages according to POI names corresponding to the POI data, so as to determine whether the webpages comprise the POI names of the POI data (S105); and when the webpages comprise the POI names of the POI data, determining that the webpages comprise the point of interest (POI) data (S106). The method and the device help to subsequently determine the accuracy of collected POI data according to the accuracy of content recorded by the webpages, thereby subsequently providing a reliable guarantee for collecting accurate POI data in the Internet in a large scale.

Description

用于确定网页页面中包括兴趣点POI数据的方法及装置Method and apparatus for determining POI data including point of interest in a web page 技术领域Technical field
本发明涉及计算机技术领域,具体而言,本发明涉及一种用于确定网页页面中包括兴趣点POI数据的方法及装置。The present invention relates to the field of computer technology, and in particular, to a method and apparatus for determining POI data including points of interest in a web page.
背景技术Background technique
在地理信息系统中,一个POI(Point Of Interest,兴趣点)可以是一栋房子、一个商铺、一个邮筒、一个公交站等。POI数据包括地址信息和POI名称。In a geographic information system, a POI (Point Of Interest) can be a house, a shop, a mail box, a bus stop, and the like. The POI data includes address information and a POI name.
传统的POI数据采集方法,需要技术人员采用精密的测绘仪器去获取每个POI的经纬度信息,然后再标记下来,这种方法比较费时费力,导致通过采集得到的POI数据的数量很少,地理信息系统很难根据数量很少的POI数据来提供高水平的服务。The traditional POI data acquisition method requires the technician to use a precise surveying and mapping instrument to obtain the latitude and longitude information of each POI, and then mark it down. This method is time-consuming and laborious, resulting in a small amount of POI data obtained through the acquisition, geographic information. It is difficult for the system to provide a high level of service based on a small amount of POI data.
互联网上存在着大量的POI数据,如果能从互联网上收集包含POI数据的网页,从收集的网页中提取出这些POI数据供地理信息系统使用,则会大大节省人力和时间。但是互联网上充斥着大量虚假的POI数据,比如博客网页内容中包含“原文地址:http://xxx.xxx.xxx/xxx”,虽然包含“地址”字样,但该地址是网络地址或者说是URL(Uniform Resoure Locator,统一资源定位器),并不是POI数据中的地理地址信息;从而导致收集到的POI数据中虚假的POI数据的比例较高。There is a large amount of POI data on the Internet. If the webpage containing POI data can be collected from the Internet, and the POI data is extracted from the collected webpage for use by the geographic information system, labor and time are greatly saved. However, the Internet is full of fake POI data. For example, the content of the blog page contains “original address: http://xxx.xxx.xxx/xxx”. Although it contains the word “address”, the address is a network address or The URL (Uniform Resoure Locator) is not the geographical address information in the POI data; thus, the proportion of the fake POI data in the collected POI data is high.
发明内容Summary of the invention
本发明针对现有技术的缺点,提出一种用于确定网页页面中包括兴趣点POI数据的方法和装置,用以解决现有技术存在的收集较多虚假的POI数据问题。The present invention is directed to the disadvantages of the prior art, and provides a method and apparatus for determining POI data including POIs in a web page to solve the problem of collecting more false POI data in the prior art.
本发明根据一个方面,提供了一种用于确定网页页面中包括兴趣点POI数据的方法,包括:According to an aspect, the present invention provides a method for determining a POI data including a POI in a webpage page, including:
从互联网中获取多个POI数据;Obtain multiple POI data from the Internet;
爬取包括地址信息的多个网页页面; Crawling multiple web pages including address information;
将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经纬度信息;Normalizing the address information in the plurality of POI data and the address information included in the plurality of web page pages into latitude and longitude information;
基于同一经纬度信息,在所述多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配;And matching the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information;
对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;For the POI data and the webpage page having the same latitude and longitude information, the POI name corresponding to the POI data is searched in the webpage page to determine whether the POI name of the POI data is included in the webpage page;
当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。When the POI name of the POI data is included in the web page, it is determined that the web page includes the POI data of the POI.
本发明根据另一个方面,还提供了一种用于确定网页页面中包括兴趣点POI数据的装置,包括:According to another aspect, the present invention further provides an apparatus for determining a POI data including a POI in a webpage page, including:
POI数据获取模块,用于从互联网中获取多个POI数据;a POI data acquisition module, configured to acquire multiple POI data from the Internet;
网页页面爬取模块,用于爬取包括地址信息的多个网页页面;a webpage crawling module for crawling a plurality of webpage pages including address information;
经纬度信息归一化模块,用于将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经纬度信息;a latitude and longitude information normalization module, configured to normalize address information in the plurality of POI data and address information included in the plurality of webpage pages into latitude and longitude information;
经纬度信息匹配模块,用于基于同一经纬度信息,在所述多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配;The latitude and longitude information matching module is configured to perform matching on the latitude and longitude information of the plurality of POI data and the latitude and longitude information in the plurality of webpage pages based on the same latitude and longitude information;
网页页面包括POI名称确定模块,用于对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;The webpage includes a POI name determining module, configured to search for the POI data and the webpage page having the same latitude and longitude information, and search for the POI of the POI data in the webpage page according to the POI name corresponding to the POI data. name;
网页页面包括POI数据确定模块,用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。The webpage includes a POI data determining module, configured to determine, when the webpage includes the POI name of the POI data, the webpage page includes the POI data of the POI.
根据本发明的又一个方面,提出了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上文所述的用于确定网页页面中包括兴趣点POI数据的方法。According to still another aspect of the present invention, a computer program is provided, comprising computer readable code that, when executed on a computing device, causes the computing device to perform the method for determining a web page as described above The method of including POI data of interest points is included in the page.
根据本发明的再一个方面,提出了一种计算机可读介质,其中存储了上述的计算机程序。According to still another aspect of the present invention, a computer readable medium is proposed, wherein the computer program described above is stored.
本发明的技术方案中,将地址信息归一化为经纬度信息,可以滤除非地理的地址信息,由于经纬度的唯一性,基于经纬度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而有利于后续避免收集 到虚假地址信息的POI数据;在POI数据的经纬度信息与网页页面中的经纬度信息相匹配的基础上,进一步确定网页页面中是否包括POI数据的POI名称,来准确判断POI数据是否被包含在同一网页页面中,有利于后续根据网页页面所记载内容的权威性和准确性,来确定收集到的POI数据的准确性,进而为大批量地收集互联网中的准确度较高的POI数据提供可靠保证。In the technical solution of the present invention, the address information is normalized into latitude and longitude information, and the geographical address information can be filtered out. Due to the uniqueness of the latitude and longitude, the accuracy of the matching result based on the latitude and longitude information is much higher than the existing text-based information. The accuracy of the matching results, which facilitates subsequent collection avoidance POI data to the fake address information; based on the latitude and longitude information of the POI data and the latitude and longitude information in the webpage page, further determining whether the POI name of the POI data is included in the webpage page to accurately determine whether the POI data is included in the same The webpage page facilitates the subsequent determination of the accuracy of the collected POI data according to the authority and accuracy of the content recorded on the webpage, thereby providing a reliable guarantee for collecting large quantities of highly accurate POI data in the Internet. .
本发明附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本发明的实践了解到。The additional aspects and advantages of the invention will be set forth in part in the description which follows.
附图说明DRAWINGS
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1a为本发明实施例的用于确定网页页面中包括兴趣点POI数据的方法的流程示意图;1a is a schematic flowchart of a method for determining POI data including a POI in a webpage page according to an embodiment of the present invention;
图1b为本发明实施例的包括多个POI数据的网页的示意图;FIG. 1b is a schematic diagram of a webpage including multiple POI data according to an embodiment of the present invention; FIG.
图2为本发明实施例的用于确定网页页面中包括兴趣点POI数据的装置的内部结构的框架示意图;2 is a schematic diagram of a framework for determining an internal structure of an apparatus for including POI data of a point of interest in a webpage page according to an embodiment of the present invention;
图3为本发明实施例的POI数据获取模块的内部结构的框架示意图;3 is a schematic diagram of a framework of an internal structure of a POI data acquiring module according to an embodiment of the present invention;
图4示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及Figure 4 shows schematically a block diagram of a computing device for performing the method according to the invention;
图5示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。Fig. 5 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
具体实施例Specific embodiment
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能解释为对本发明的限制。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步 骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。The singular forms "a", "an", "the" It should be further understood that the phrase "comprising", used in the context of the present invention, refers to the presence of the features, integers, and steps. The operation, elements, and/or components of the present invention are not intended to be exhaustive or to be added to one or more other features, integers, steps, operations, components, components and/or combinations thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element. Further, "connected" or "coupled" as used herein may include either a wireless connection or a wireless coupling. The phrase "and/or" used herein includes all or any one and all combinations of one or more of the associated listed.
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。Those skilled in the art will appreciate that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention belongs, unless otherwise defined. It should also be understood that terms such as those defined in a general dictionary should be understood to have meaning consistent with the meaning in the context of the prior art, and will not be idealized or excessive unless specifically defined as here. The formal meaning is explained.
图1a为本发明中用于确定网页页面中包括兴趣点POI数据的方法的流程示意图。FIG. 1a is a schematic flowchart diagram of a method for determining POI data including POI in a web page according to the present invention.
S101:从互联网中获取多个POI数据;S102:爬取包括地址信息的多个网页页面;S103:将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经纬度信息;S104:基于同一经纬度信息,在多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配;S105:对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;S106:当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。S101: Acquire multiple POI data from the Internet; S102: Crawl a plurality of webpage pages including address information; S103: normalize address information in the plurality of POI data and address information included in the plurality of webpage pages into latitude and longitude respectively Information; S104: matching latitude and longitude information of the plurality of POI data with latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information; S105: for the POI data and the webpage page having the same latitude and longitude information, according to the POI corresponding to the POI data The name is searched in the webpage page to determine whether the POI name of the POI data is included in the webpage page. S106: When the webpage page includes the POI name of the POI data, it is determined that the webpage page includes the POI data of the POI.
本发明的上述用于确定网页页面中包括兴趣点POI数据的方法,将地址信息归一化为经纬度信息,可以滤除非地理位置的地址信息,由于经纬度的唯一性,基于经纬度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而有利于后续避免收集虚假地址信息的数据;在POI数据的经纬度信息与网页页面中的经纬度信息相匹配的基础上,进一步确定网页页面中是否包括POI数据的POI名称,来准确判断POI数据是否被包含在同一网页页面中,有利于后续根据网页页面所记载内容的权威性和准确性,来确定收集到的POI数据的准确性,进而为大批量地收集互联网中的准确度较高的POI数据提供可靠保证。 The method for determining the POI data of the interest point in the webpage of the present invention normalizes the address information into the latitude and longitude information, and can filter the address information of the geographical location, and the matching result based on the latitude and longitude information is due to the uniqueness of the latitude and longitude. The accuracy is much higher than the accuracy of the existing text-based matching result, thereby facilitating the subsequent avoidance of collecting the data of the fake address information; on the basis of matching the latitude and longitude information of the POI data with the latitude and longitude information in the webpage page, Further determining whether the POI name of the POI data is included in the webpage page to accurately determine whether the POI data is included in the same webpage page, thereby facilitating subsequent determination of the collected POI data according to the authority and accuracy of the content recorded on the webpage page. The accuracy, in turn, provides a reliable guarantee for collecting high-accuracy POI data in the Internet in large quantities.
下面具体介绍流程示意图如图1a所示的用于确定网页页面中包括兴趣点POI数据的方法,包括如下步骤:The following is a detailed description of the process schematic diagram shown in FIG. 1a for determining a POI data including a POI in a webpage page, including the following steps:
S101:从互联网中获取多个POI数据。S101: Acquire multiple POI data from the Internet.
具体地,利用网络爬虫类的程序,从互联网中爬取多个包括POI数据的网页;随后从多个包括POI数据的网页中提取多个POI数据。POI数据包括地址信息和POI名称;优选地,POI数据还可以包括联系方式、邮编和网络标签等等。Specifically, a plurality of web pages including POI data are crawled from the Internet using a program of a web crawler; and then a plurality of POI data are extracted from a plurality of web pages including POI data. The POI data includes address information and a POI name; preferably, the POI data may also include a contact, a zip code, a network label, and the like.
本发明的发明人发现,在互联网中存在这样一些网页,它们中每个网页的内容包含有一个或者多个POI数据,POI数据中的地址信息包括“地址”等字样的地址关键词;并且这些网页的页面结构特征URL格式,以及POI数据在网页中的位置和格式是有规律性的。也就是说可以通过一种统一的方法快捷地从这些网页上提取出POI数据。The inventors of the present invention have found that there are such web pages in the Internet, in which the content of each web page contains one or more POI data, and the address information in the POI data includes address keywords such as "address"; and these The page structure feature URL format of the web page, and the location and format of the POI data in the web page are regular. In other words, POI data can be quickly extracted from these web pages in a unified way.
较佳地,可以从互联网中,爬取包括“地址”等地址关键词的多个网页对应的多个URL(Uniform Resoure Locator,统一资源定位器);对爬取得到的多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。Preferably, a plurality of URLs (Uniform Resole Locators) corresponding to a plurality of web pages including address keywords such as “address” can be crawled from the Internet; and pattern aggregation is performed on the plurality of URLs obtained by the crawling A class that clusters URLs with the same structural characteristics into the same pattern set.
更优地,对于众多包括地址关键词的网页中,只包括一个POI数据的网页,获取所有只包括一个POI数据的网页的URL;对获取的所有URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。More preferably, for a plurality of web pages including address keywords, a web page including only one POI data is obtained, and URLs of all web pages including only one POI data are obtained; pattern clustering is performed on all obtained URLs, and the same structural features are used. URL clustering is the same pattern set.
例如,众多包括地址关键词的网页中,URL为http://www.aibang.com/detail/1537772035-1606201508的网页中只包括“爱普生(中国)有限公司”这一POI数据、URL为http://www.aibang.com/detail/152928073-419169481的网页中只包括“北京王府中西医结合医院”这一POI数据,这两个URL具有相同的结构特征www.aibang.com/detail/*,其中*为通配符表示任意字符;因此,可以将这两个URL聚类到同一pattern集合中;也就是说,该pattern集合中所有的URL都具有同一结构特征www.aibang.com/detail/*。For example, in many web pages including address keywords, the URL http://www.aibang.com/detail/1537772035-1606201508 includes only the POI data of "Epson (China) Co., Ltd." and the URL is http: The page of //www.aibang.com/detail/152928073-419169481 only includes the POI data of “Beijing Wangfu Chinese Medicine and Western Medicine Hospital”. These two URLs have the same structural characteristics www.aibang.com/detail/*, Where * is a wildcard for any character; therefore, the two URLs can be clustered into the same pattern set; that is, all URLs in the pattern set have the same structural feature www.aibang.com/detail/*.
更优的,对于众多的包括地址关键词的网页中,包括多个POI数据的网页,获取所有包括多个POI数据的网页的URL;对获取的所有URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合。 Preferably, for a plurality of webpages including address keywords, a webpage including a plurality of POI data, obtaining URLs of all webpages including multiple POI data; pattern clustering all acquired URLs, having the same structural features The URLs are clustered into the same pattern collection.
例如,URL为www.dianping.com/topic/s_c_2_120_r14_x540/p7的网页,如图1b所示,该网页中包括POI标题为“boy london”、“COACH(悠唐折扣店)”和“米兰店(三里屯店)”等的多个POI数据;URL为www.dianping.com/topic/s_c_2_120_r14_x540/p6的网页中也包括多个POI数据;获取所有结构特征符合www.dianping.com/topic/*的URL,其中*为通配符表示任意字符;对获取的所有URL进行pattern聚类,聚类得到的同一pattern集合中的URL都具有结构特征www.dianping.com/topic/*。For example, the webpage with the URL www.dianping.com/topic/s_c_2_120_r14_x540/p7, as shown in FIG. 1b, includes the POI titles "boy london", "COACH" and "Milan shop". Multiple POI data such as Sanlitun); the webpage with the URL www.dianping.com/topic/s_c_2_120_r14_x540/p6 also includes multiple POI data; obtain all URLs whose structural features conform to www.dianping.com/topic/* Where * is a wildcard for any character; pattern clustering is performed for all URLs obtained, and the URLs in the same pattern set obtained by clustering have the structural characteristics www.dianping.com/topic/*.
从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。A pattern set including a plurality of web pages including POI data is filtered from a plurality of pattern sets, and a plurality of web pages including POI data are extracted from the pattern set.
较佳地,从多个包括POI数据的网页中提取多个POI数据具体可以包括:Preferably, extracting the plurality of POI data from the plurality of web pages including the POI data may include:
基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板。具体地,对于属于同一pattern集合中每个URL,根据该URL对应的各网页中POI数据的格式和位置,生成与该pattern集合相应的POI数据提取模板。A POI data extraction template corresponding to the pattern set is generated based on page structure features of a plurality of URLs belonging to the same pattern set corresponding to a plurality of web pages including POI data. Specifically, for each URL belonging to the same pattern set, a POI data extraction template corresponding to the pattern set is generated according to the format and location of the POI data in each webpage corresponding to the URL.
基于POI数据提取模板,从多个包括POI数据的网页中提取多个POI数据。具体地,对于同一pattern集合中每个URL,针对该URL对应的网页,根据生成的POI数据提取模板中的POI数据的格式、以及多个POI数据在网页中的位置,从该网页中提取多个POI数据。Based on the POI data extraction template, a plurality of POI data are extracted from a plurality of web pages including POI data. Specifically, for each webpage in the same pattern set, for the webpage corresponding to the URL, the format of the POI data in the template and the location of the plurality of POI data in the webpage are extracted according to the generated POI data, and more information is extracted from the webpage. POI data.
S102:爬取包括地址信息的多个网页页面。S102: Crawling a plurality of webpage pages including address information.
具体地,利用网络爬虫类的程序,从互联网中爬取包括地址关键词的多个网页页面。Specifically, a web crawler-like program is used to crawl a plurality of webpage pages including address keywords from the Internet.
提取多个网页页面中与地址关键词相关联的多个文本信息。Extracting a plurality of text information associated with the address keyword among the plurality of web page pages.
具体地,对于一个网页页面,提取该网页页面的文本内容,在文本内容中查找“地址”、“位于”或“坐落于”等可能包括地址信息的地址关键词;提取地址关键词附近的文本片段;根据设定的分隔符以及片段长度对文本片段进行分割,比如文本片段距离地址关键词的文本长度大于设定的阈值、和/或文本片段出现设定的分隔符(比如空格、逗号、句号等),则对文本片段进行分割;将分割结果中,分割处(例如分隔符处)与地址关键词之间的文本片段,作为该网页页面中与地址关键词相关联的文本信息。 Specifically, for a webpage page, the text content of the webpage page is extracted, and an address keyword that may include address information, such as “address”, “located” or “located in”, is searched for in the text content; and text near the address keyword is extracted. Fragment; segment the text segment according to the set separator and the length of the segment, such as the text length of the text segment from the address keyword is greater than the set threshold, and/or the separator of the text segment (such as a space, a comma, The period, etc., divides the text segment; in the segmentation result, the text segment between the segmentation (for example, the separator) and the address keyword is used as the text information associated with the address keyword in the web page.
从多个文本信息中提取相应网页页面的地址信息。Extract the address information of the corresponding web page from a plurality of text information.
具体地,对于提取自网页页面中的每个文本信息,从该文本信息中提取出地址信息,作为该网页页面的地址信息。Specifically, for each text information extracted from the web page, the address information is extracted from the text information as the address information of the web page.
S103:将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经纬度信息。S103: Normalize the address information in the plurality of POI data and the address information included in the plurality of webpage pages into the latitude and longitude information.
预先获取包括全国的省、市、县(区)、乡镇、道路等的地址信息、经纬度信息,以及地址信息与经纬度信息之间的对应关系的地理信息库。其中,地理信息库中的地址信息,可以包括表示同一地理地址的多种表达形式的地址信息;例如,“朝阳区酒仙桥路6号”、“北京市朝阳酒仙桥6号”和“朝阳区酒仙桥6号”等多个地址信息,都表示同一地理地址。A geographic information base including address information, latitude and longitude information, and correspondence between address information and latitude and longitude information including provinces, cities, counties (districts), towns, and roads in the country is obtained in advance. The address information in the geographic information base may include address information indicating multiple expression forms of the same geographical address; for example, "6 Jiuxianqiao Road, Chaoyang District", "Beijing Chaoyang Jiuxianqiao No. 6" and "Chaoyang District" Multiple address information such as "Jiuxianqiao No. 6" means the same geographical address.
具体地,将多个POI数据中的地址信息分别归一化为多个POI数据的经纬度信息。例如,对于每个POI数据中的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经纬度信息,将查找出的经纬度信息确定为该POI数据的经纬度信息。Specifically, the address information in the plurality of POI data is respectively normalized into latitude and longitude information of the plurality of POI data. For example, for the address information in each POI data, the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the POI data.
S104:基于同一经纬度信息,在多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配。S104: Matching the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information.
具体地,对于每个POI数据,判断各网页页面中,是否存在经纬度信息与该POI数据的经纬度信息相一致的网页页面,若是,则确定出该POI数据与该网页页面相匹配,即确定出该POI数据与该网页页面具有相同经纬度信息,否则,忽略该POI数据。Specifically, for each POI data, determining whether there is a webpage page whose latitude and longitude information is consistent with the latitude and longitude information of the POI data in each webpage page, and if yes, determining that the POI data matches the webpage page, that is, determining The POI data has the same latitude and longitude information as the web page page, otherwise the POI data is ignored.
由于经纬度的唯一性,基于经纬度信息的匹配结果的准确性,远高于现有的基于文本信息的匹配结果的准确性,从而后续可以根据更准确的匹配结果收集到更为准确的POI数据。而且,基于经纬度信息进行匹配,相当于基于该经纬度信息所对应的多个地理信息分别进行匹配,扩大了匹配的范围,有利于后续收集到更多的POI数据。Due to the uniqueness of latitude and longitude, the accuracy of the matching result based on the latitude and longitude information is much higher than the accuracy of the existing text-based matching result, so that more accurate POI data can be collected according to the more accurate matching result. Moreover, the matching based on the latitude and longitude information is equivalent to matching the plurality of geographic information corresponding to the latitude and longitude information, thereby expanding the matching range, and facilitating subsequent collection of more POI data.
S105:对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称。S105: Search for the POI data and the webpage page having the same latitude and longitude information according to the POI name corresponding to the POI data, and determine whether the POI name of the POI data is included in the webpage page.
具体地,对于具有相同经纬度信息的POI数据及网页页面,从该网页页面中查找出所有的名称信息;对于查找出的每个名称信息,判断该名称信息 是否与该POI数据中的POI名称相匹配:若是,确定出该网页页面中包括该POI数据的POI名称;否则,忽略该POI数据。Specifically, for the POI data and the webpage page having the same latitude and longitude information, all the name information is found from the webpage page; for each name information found, the name information is judged. Whether to match the POI name in the POI data: if so, the POI name including the POI data in the web page page is determined; otherwise, the POI data is ignored.
较佳地,对于具有相同经纬度信息的POI数据及网页页面,若该网页页面中的名称信息与该POI数据中的POI名称,虽然在文字表达上不完全一致,但是实质上表示同一POI,可以确认为该POI数据中的POI名称与该网页页面中的名称信息相匹配,从而确定出该网页页面中包括该POI数据的POI名称。Preferably, for the POI data and the webpage page having the same latitude and longitude information, if the name information in the webpage page and the POI name in the POI data are not completely consistent in the text expression, substantially indicating the same POI, It is confirmed that the POI name in the POI data matches the name information in the webpage page, thereby determining the POI name including the POI data in the webpage page.
例如,对于具有相同经纬度信息的POI数据及网页页面,该POI数据中的POI名称为“奇虎360”,而该网页页面中的名称信息为“北京奇虎科技有限公司”,可以确认为该网页页面中包括该POI数据的POI名称。For example, for POI data and webpage pages having the same latitude and longitude information, the POI name in the POI data is “Qihu 360”, and the name information in the webpage page is “Beijing Qihoo Technology Co., Ltd.”, which can be confirmed as The POI name of the POI data is included in the web page.
优选地,对于具有相同经纬度信息的POI数据及网页页面,当判断出该网页页面中包括多个名称信息时,分别计算多个名称信息与该网页页面的地址信息之间的文本距离。将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息。其中,文本距离可以是名称信息与地址信息之间的字符的数量。Preferably, for the POI data and the webpage page having the same latitude and longitude information, when it is determined that the plurality of name information is included in the webpage page, the text distance between the plurality of name information and the address information of the webpage page is separately calculated. The name information corresponding to the minimum text distance is determined as the name information corresponding to the address information in the web page. The text distance may be the number of characters between the name information and the address information.
对于具有相同经纬度信息的POI数据及网页页面,将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对。比对一致时,确定该网页页面中包括该POI数据的POI名称。For the POI data and the webpage page having the same latitude and longitude information, the POI name corresponding to the POI data is compared with the name information corresponding to the address information in the webpage page. When the comparison is consistent, it is determined that the POI name of the POI data is included in the webpage page.
具体地,对于具有相同经纬度信息的POI数据及网页页面,判断该POI数据对应的POI名称,与该网页页面中地址信息相对应的名称信息是否一致:若是,则确定该网页页面中包括该POI数据的POI名称;否则,确定该网页页面中不包括该POI数据的POI名称。Specifically, for the POI data and the webpage page having the same latitude and longitude information, determining whether the POI name corresponding to the POI data and the name information corresponding to the address information in the webpage page are consistent: if yes, determining that the webpage includes the POI The POI name of the data; otherwise, the POI name of the POI data is not included in the web page page.
S106:对于具有相同经纬度信息的POI数据及网页页面,当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。S106: For the POI data and the webpage page having the same latitude and longitude information, when the webpage page includes the POI name of the POI data, it is determined that the webpage page includes the POI data of the POI.
具体地,对于具有相同经纬度信息的POI数据及网页页面,当在上述步骤S105中确定出该网页页面中包括该POI数据的POI名称时,在本步骤中确定该网页页面包括该兴趣点POI数据,具体地确定该网页页面包括该兴趣点POI名称和地址信息。Specifically, for the POI data and the webpage page having the same latitude and longitude information, when it is determined in the foregoing step S105 that the POI name of the POI data is included in the webpage page, it is determined in the step that the webpage page includes the POI data of the POI. Specifically, the webpage page is determined to include the POI name and address information of the POI.
基于上述用于确定网页页面中包括兴趣点POI数据的方法,本发明提供 了用于确定网页页面中包括兴趣点POI数据的装置,该装置的内部结构的框架示意图如图2所示,包括:POI数据获取模块201、网页页面爬取模块202、经纬度信息归一化模块203、经纬度信息匹配模块204、网页页面包括POI名称确定模块205和网页页面包括POI数据确定模块206。Based on the above method for determining POI data including POI in a web page, the present invention provides As shown in FIG. 2, a schematic diagram of a framework for determining a POI data of a point of interest in a webpage page includes: a POI data acquisition module 201, a webpage crawling module 202, and a latitude and longitude information normalization module. 203. The latitude and longitude information matching module 204, the webpage page includes a POI name determining module 205, and the webpage page includes a POI data determining module 206.
其中,POI数据获取模块201用于从互联网中获取多个POI数据。The POI data obtaining module 201 is configured to acquire a plurality of POI data from the Internet.
网页页面爬取模块202用于爬取包括地址信息的多个网页页面。The webpage crawling module 202 is configured to crawl a plurality of webpage pages including address information.
具体地,网页页面爬取模块202从互联网中爬取包括地址关键词的多个网页页面;提取多个网页页面中与地址关键词相关联的多个文本信息;从多个文本信息中提取相应网页页面的地址信息。Specifically, the webpage crawling module 202 crawls a plurality of webpage pages including address keywords from the Internet; extracts a plurality of text information associated with the address keywords in the plurality of webpage pages; and extracts corresponding information from the plurality of textual information The address information of the web page.
经纬度信息归一化模块203用于将多个POI数据中的地址信息及多个网页页面包含的地址信息分别归一化为经纬度信息。The latitude and longitude information normalization module 203 is configured to normalize the address information in the plurality of POI data and the address information included in the plurality of web page pages into the latitude and longitude information.
具体地,经纬度信息归一化模块203将多个POI数据中的地址信息分别归一化为多个POI数据的经纬度信息。较佳地,对于每个POI数据中的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经纬度信息,将查找出的经纬度信息确定为该POI数据的经纬度信息。其中,预先获取的地理信息库中包括全国的省、市、县(区)、乡镇、道路等的地址信息、经纬度信息,以及地址信息与经纬度信息之间的对应关系。Specifically, the latitude and longitude information normalization module 203 normalizes the address information in the plurality of POI data into the latitude and longitude information of the plurality of POI data. Preferably, for the address information in each POI data, the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the POI data. The pre-acquired geographic information database includes address information, latitude and longitude information, and correspondence between the address information and the latitude and longitude information of provinces, cities, counties (districts), towns, and roads in the country.
以及,经纬度信息归一化模块203将多个网页页面包含的地址信息分别归一化为多个网页页面的经纬度信息。较佳地,对于每个网页页面包含的地址信息,从预先获取的地理信息库中查找出该地址信息所对应的经纬度信息,将查找出的经纬度信息确定为该网页页面的经纬度信息。And, the latitude and longitude information normalization module 203 normalizes the address information included in the plurality of webpage pages into the latitude and longitude information of the plurality of webpage pages. Preferably, for the address information included in each webpage page, the latitude and longitude information corresponding to the address information is searched from the pre-acquired geographic information database, and the found latitude and longitude information is determined as the latitude and longitude information of the webpage page.
经纬度信息匹配模块204用于基于同一经纬度信息,在多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配。具体地,经纬度信息匹配模块204对于每个POI数据,判断各网页页面中,是否存在经纬度信息与该POI数据的经纬度信息相一致的网页页面,若是,则确定出该POI数据与该网页页面相匹配,即确定出该POI数据与该网页页面具有相同经纬度信息,否则,忽略该POI数据。The latitude and longitude information matching module 204 is configured to match the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information. Specifically, the latitude and longitude information matching module 204 determines, for each POI data, whether there is a webpage page whose latitude and longitude information is consistent with the latitude and longitude information of the POI data in each webpage page, and if yes, determines that the POI data is related to the webpage page. Matching, that is, determining that the POI data has the same latitude and longitude information as the web page page, otherwise, the POI data is ignored.
网页页面包括POI名称确定模块205用于对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称。具体地,网页页 面包括POI名称确定模块205对于具有相同经纬度信息的POI数据及网页页面,从该网页页面中查找出所有的名称信息;对于查找出的每个名称信息,判断该名称信息是否与该POI数据中的POI名称相匹配:若是,确定出该网页页面中包括该POI数据的POI名称;否则,忽略该POI数据。The webpage includes a POI name determining module 205, configured to search for the POI data and the webpage page having the same latitude and longitude information, and search for the POI of the POI data in the webpage page according to the POI name corresponding to the POI data. name. Specifically, the web page The facet determining module 205 includes, for the POI data and the webpage page having the same latitude and longitude information, all the name information is found from the webpage page; and for each name information found, it is determined whether the name information is related to the POI data. The POI name matches: if so, the POI name of the POI data is determined in the web page; otherwise, the POI data is ignored.
网页页面包括POI数据确定模块206用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。The web page includes a POI data determining module 206 for determining that the web page includes the POI data of the POI when the POI name of the POI data is included in the web page.
优选地,POI数据获取模块201的内部结构的框架示意图如图3所示,包括:网页爬取单元301和POI数据提取单元302。Preferably, the schematic diagram of the internal structure of the POI data acquisition module 201 is as shown in FIG. 3, and includes a webpage crawling unit 301 and a POI data extracting unit 302.
其中,网页爬取单元301用于从互联网中爬取多个包括POI数据的网页。The webpage crawling unit 301 is configured to crawl a plurality of webpages including POI data from the Internet.
具体地,网页爬取单元301从互联网中爬取包括地址关键词的多个网页对应的多个URL;对多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。Specifically, the webpage crawling unit 301 crawls a plurality of URLs corresponding to the plurality of webpages including the address keyword from the Internet; performs pattern clustering on the plurality of URLs, and clusters the URLs having the same structural feature into the same pattern set; A pattern set including a plurality of web pages including POI data is filtered from a plurality of pattern sets, and a plurality of web pages including POI data are extracted from the pattern set.
POI数据提取单元302用于从多个包括POI数据的网页中提取多个POI数据。The POI data extracting unit 302 is configured to extract a plurality of POI data from a plurality of web pages including POI data.
具体地,POI数据提取单元302具体用于基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;基于POI数据提取模板,从多个包括POI数据的网页中提取多个POI数据。Specifically, the POI data extracting unit 302 is specifically configured to generate a POI data extraction template corresponding to the pattern set based on page structure features of the plurality of URLs corresponding to the plurality of URLs in the same pattern set, and extract the template based on the POI data. Extracting a plurality of POI data from a plurality of web pages including POI data.
优选地,如图2所示,本发明的用于确定网页页面中包括兴趣点POI数据的装置,还包括:网页页面中名称信息确定模块207。Preferably, as shown in FIG. 2, the apparatus for determining the POI data included in the webpage page of the present invention further includes: a webpage page name information determining module 207.
网页页面中名称信息确定模块207用于当判断该网页页面中包括多个名称信息时,分别计算多个名称信息与该网页页面的地址信息之间的文本距离;将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息。The name information determining module 207 in the webpage is configured to separately calculate a text distance between the plurality of name information and the address information of the webpage page when the webpage page includes the plurality of name information; and the minimum text distance corresponds to The name information is determined to be name information corresponding to the address information in the web page.
此时,网页页面包括POI名称确定模块205还用于将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对;比对一致时,确定该网页页面中包括该POI数据的POI名称。At this time, the webpage including the POI name determining module 205 is further configured to compare the POI name corresponding to the POI data with the name information corresponding to the address information in the webpage page; when the comparison is consistent, determining that the webpage page includes the The POI name of the POI data.
本发明实施例的技术方案中,将地址信息归一化为经纬度信息,可以滤 除非地理的地址信息,从而有利于后续避免收集地址信息虚假的POI数据;在POI数据的经纬度信息与网页页面中的经纬度信息相匹配的基础上,进一步确定网页页面中包括POI数据的POI名称,有利于后续避免收集一些经纬度信息或者POI名称无法匹配的POI数据,而无法匹配的POI数据的准确性往往较低,从而可以便于后续根据确定出的网页页面中包括的兴趣点POI数据,收集到更为准确的POI数据。In the technical solution of the embodiment of the present invention, the address information is normalized into latitude and longitude information, which can be filtered. Unless the geographic address information is used, it is advantageous to avoid the subsequent collection of the fake POI data of the address information; and further, the POI name including the POI data in the webpage page is further determined on the basis that the latitude and longitude information of the POI data matches the latitude and longitude information in the webpage page, It is beneficial to avoid collecting some POI data whose latitude and longitude information or POI name cannot be matched, and the POI data which cannot be matched is often less accurate, so that it can be conveniently collected according to the POI data of the POI included in the determined webpage. More accurate POI data.
本技术领域技术人员可以理解,本发明包括涉及用于执行本申请中所述操作中的一项或多项的设备。这些设备可以为所需的目的而专门设计和制造,或者也可以包括通用计算机中的已知设备。这些设备具有存储在其内的计算机程序,这些计算机程序选择性地激活或重构。这样的计算机程序可以被存储在设备(例如,计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中,所述计算机可读介质包括但不限于任何类型的盘(包括软盘、硬盘、光盘、CD-ROM、和磁光盘)、ROM(Read-Only Memory,只读存储器)、RAM(Random Access Memory,随即存储器)、EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦可编程只读存储器)、闪存、磁性卡片或光线卡片。也就是,可读介质包括由设备(例如,计算机)以能够读的形式存储或传输信息的任何介质。Those skilled in the art will appreciate that the present invention includes apparatus related to performing one or more of the operations described herein. These devices may be specially designed and manufactured for the required purposes, or may also include known devices in a general purpose computer. These devices have computer programs stored therein that are selectively activated or reconfigured. Such computer programs may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and coupled to a bus, respectively, including but not limited to any Types of disks (including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory) , EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic card or light card. That is, a readable medium includes any medium that is stored or transmitted by a device (eg, a computer) in a readable form.
本技术领域技术人员可以理解,可以用计算机程序指令来实现这些结构图和/或框图和/或流图中的每个框以及这些结构图和/或框图和/或流图中的框的组合。本技术领域技术人员可以理解,可以将这些计算机程序指令提供给通用计算机、专业计算机或其他可编程数据处理方法的处理器来实现,从而通过计算机或其他可编程数据处理方法的处理器来执行本发明公开的结构图和/或框图和/或流图的框或多个框中指定的方案。Those skilled in the art will appreciate that each block of the block diagrams and/or block diagrams and/or flow diagrams and combinations of blocks in the block diagrams and/or block diagrams and/or flow diagrams can be implemented by computer program instructions. . Those skilled in the art will appreciate that these computer program instructions can be implemented by a general purpose computer, a professional computer, or a processor of other programmable data processing methods, such that the processor is executed by a computer or other programmable data processing method. The blocks of the disclosed structure and/or block diagrams and/or flow diagrams or blocks specified in the various blocks.
本技术领域技术人员可以理解,本发明中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。进一步地,具有本发明中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。进一步地,现有技术中的具有与本发明中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。 Those skilled in the art can understand that the steps, measures, and solutions in the various operations, methods, and processes that have been discussed in the present invention may be alternated, changed, combined, or deleted. Further, other steps, measures, and schemes of the various operations, methods, and processes that have been discussed in the present invention may be alternated, modified, rearranged, decomposed, combined, or deleted. Further, the steps, measures, and solutions in the prior art having various operations, methods, and processes disclosed in the present invention may also be alternated, changed, rearranged, decomposed, combined, or deleted.
以上所述仅是本发明的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。The above is only a part of the embodiments of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. It should be considered as the scope of protection of the present invention.
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的用于确定网页页面中包括兴趣点POI数据的装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or all of the means for determining POI data in a web page including point of interest POI data in accordance with an embodiment of the present invention. Some or all of the features of the part. The invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
例如,图4示意性地示出了用于执行根据本发明的方法的计算设备的框图。该计算设备传统上包括处理器410和以存储器420形式的计算机程序产品或者计算机可读介质。存储器420可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器420具有用于执行上述方法中的任何方法步骤的程序代码431的存储空间430。例如,用于程序代码的存储空间430可以包括分别用于实现上面的方法中的各种步骤的各个程序代码431。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图5所述的便携式或者固定存储单元。该存储单元可以具有与图4的计算设备中的存储器420类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括用于执行根据本发明的方法步骤的计算机可读代码431’,即可以由例如诸如410之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。For example, Figure 4 schematically illustrates a block diagram of a computing device for performing the method in accordance with the present invention. The computing device conventionally includes a processor 410 and a computer program product or computer readable medium in the form of a memory 420. The memory 420 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM. Memory 420 has a memory space 430 for program code 431 for performing any of the method steps described above. For example, storage space 430 for program code may include various program code 431 for implementing various steps in the above methods, respectively. The program code can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to FIG. The storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 420 in the computing device of FIG. The program code can be compressed, for example, in an appropriate form. Typically, the storage unit comprises computer readable code 431' for performing the steps of the method according to the invention, ie code that can be read by a processor, such as 410, which, when executed by the computing device, causes the calculation The device performs the various steps in the methods described above.
应该注意的是上述实施例对本发明进行说明而不是对本发明进行 限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包括”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above embodiments illustrate the invention rather than the invention. Limitations, and alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as a limitation. The word 'comprising' does not exclude the presence of the elements or steps that are not recited in the claims. The word "a" or "an" The invention can be implemented by means of hardware comprising several distinct elements and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by the same hardware item. The use of the words first, second, and third does not indicate any order. These words can be interpreted as names.
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。In addition, it should be noted that the language used in the specification has been selected for the purpose of readability and teaching, and is not intended to be construed or limited. Therefore, many modifications and changes will be apparent to those skilled in the art without departing from the scope of the invention. The disclosure of the present invention is intended to be illustrative, and not restrictive, and the scope of the invention is defined by the appended claims.
本发明可以应用于计算机系统/服务器,其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于:个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境,等等。The present invention is applicable to computer systems/servers that can operate with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments, and/or configurations suitable for use with computer systems/servers include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, based on Microprocessor systems, set-top boxes, programmable consumer electronics, networked personal computers, small computer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the above, and the like.
计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常,程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等,它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施,分布式云计算环境中,任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中,程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。The computer system/server can be described in the general context of computer system executable instructions (such as program modules) being executed by a computer system. Generally, program modules may include routines, programs, target programs, components, logic, data structures, and the like that perform particular tasks or implement particular abstract data types. The computer system/server can be implemented in a distributed cloud computing environment where tasks are performed by remote processing devices that are linked through a communication network. In a distributed cloud computing environment, program modules may be located on a local or remote computing system storage medium including storage devices.
本文中所称的“一个实施例”、“实施例”或者“一个或者多个 实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。 "One embodiment", "embodiment" or "one or more" as referred to herein The embodiment means that the specific features, structures or characteristics described in connection with the embodiments are included in at least one embodiment of the invention. In addition, it is noted that the examples of the words "in one embodiment" are not necessarily all referring to the same Example.

Claims (14)

  1. 一种用于确定网页页面中包括兴趣点POI数据的方法,其特征在于,包括:A method for determining a POI data including a POI in a webpage page, comprising:
    从互联网中获取多个POI数据;Obtain multiple POI data from the Internet;
    爬取包括地址信息的多个网页页面;Crawling multiple web pages including address information;
    将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经纬度信息;Normalizing the address information in the plurality of POI data and the address information included in the plurality of web page pages into latitude and longitude information;
    基于同一经纬度信息,在所述多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配;And matching the latitude and longitude information of the plurality of POI data with the latitude and longitude information of the plurality of webpage pages based on the same latitude and longitude information;
    对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;For the POI data and the webpage page having the same latitude and longitude information, the POI name corresponding to the POI data is searched in the webpage page to determine whether the POI name of the POI data is included in the webpage page;
    当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。When the POI name of the POI data is included in the web page, it is determined that the web page includes the POI data of the POI.
  2. 根据权利要求1所述的方法,其特征在于,所述从互联网中获取多个POI数据的步骤进一步包括:The method according to claim 1, wherein the step of acquiring a plurality of POI data from the Internet further comprises:
    从互联网中爬取多个包括POI数据的网页;Crawling multiple web pages including POI data from the Internet;
    从所述多个包括POI数据的网页中提取多个POI数据。Extracting a plurality of POI data from the plurality of web pages including POI data.
  3. 根据权利要求1-2任一项所述的方法,其特征在于,所述从互联网中爬取多个包括POI数据的网页的步骤进一步包括:The method according to any one of claims 1-2, wherein the step of crawling a plurality of web pages including POI data from the Internet further comprises:
    从互联网中爬取包括地址关键词的多个网页对应的多个URL;Crawling multiple URLs corresponding to multiple web pages including address keywords from the Internet;
    对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;Performing pattern clustering on the plurality of URLs, and clustering URLs having the same structural feature into the same pattern set;
    从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。A pattern set including a plurality of web pages including POI data is filtered from a plurality of pattern sets, and a plurality of web pages including POI data are extracted from the pattern set.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述从所述多个包括POI数据的网页中提取多个POI数据的步骤进一步包括:The method according to any one of claims 1 to 3, wherein the step of extracting a plurality of POI data from the plurality of web pages including POI data further comprises:
    基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;Generating a POI data extraction template corresponding to the pattern set based on page structure features of the plurality of URLs belonging to the same pattern corresponding to the plurality of web pages including the POI data;
    基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。 Extracting a plurality of POI data from the plurality of web pages including POI data based on the POI data extraction template.
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述爬取包括地址信息的多个网页页面的步骤进一步包括:The method according to any one of claims 1 to 4, wherein the step of crawling a plurality of webpage pages including address information further comprises:
    从互联网中爬取包括地址关键词的多个网页页面;Crawling multiple web pages including address keywords from the Internet;
    提取所述多个网页页面中与所述地址关键词相关联的多个文本信息;Extracting a plurality of text information associated with the address keyword among the plurality of webpage pages;
    从所述多个文本信息中提取相应网页页面的地址信息。Extracting address information of the corresponding webpage page from the plurality of text information.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,该方法还包括:The method according to any one of claims 1 to 5, further comprising:
    当判断该网页页面中包括多个名称信息时,分别计算所述多个名称信息与该网页页面的地址信息之间的文本距离;When it is determined that the plurality of name information is included in the webpage page, calculating a text distance between the plurality of name information and the address information of the webpage page respectively;
    将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息;Determining the name information corresponding to the minimum text distance as the name information corresponding to the address information in the webpage page;
    其中,所述根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称的步骤进一步包括:The step of determining, according to the POI name corresponding to the POI data, the POI name of the POI data in the webpage page, further includes:
    将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对;Comparing the POI name corresponding to the POI data with the name information corresponding to the address information in the webpage page;
    比对一致时,确定该网页页面中包括该POI数据的POI名称。When the comparison is consistent, it is determined that the POI name of the POI data is included in the webpage page.
  7. 一种用于确定网页页面中包括兴趣点POI数据的装置,其特征在于,包括:An apparatus for determining a POI data including a POI in a webpage page, comprising:
    POI数据获取模块,用于从互联网中获取多个POI数据;a POI data acquisition module, configured to acquire multiple POI data from the Internet;
    网页页面爬取模块,用于爬取包括地址信息的多个网页页面;a webpage crawling module for crawling a plurality of webpage pages including address information;
    经纬度信息归一化模块,用于将所述多个POI数据中的地址信息及所述多个网页页面包含的地址信息分别归一化为经纬度信息;a latitude and longitude information normalization module, configured to normalize address information in the plurality of POI data and address information included in the plurality of webpage pages into latitude and longitude information;
    经纬度信息匹配模块,用于基于同一经纬度信息,在所述多个POI数据的经纬度信息与多个网页页面中经纬度信息中进行匹配;The latitude and longitude information matching module is configured to perform matching on the latitude and longitude information of the plurality of POI data and the latitude and longitude information in the plurality of webpage pages based on the same latitude and longitude information;
    网页页面包括POI名称确定模块,用于对于具有相同经纬度信息的POI数据及网页页面,根据该POI数据对应的POI名称在该网页页面中进行查找,确定该网页页面中是否包括该POI数据的POI名称;The webpage includes a POI name determining module, configured to search for the POI data and the webpage page having the same latitude and longitude information, and search for the POI of the POI data in the webpage page according to the POI name corresponding to the POI data. name;
    网页页面包括POI数据确定模块,用于当该网页页面中包括该POI数据的POI名称时,确定该网页页面包括该兴趣点POI数据。The webpage includes a POI data determining module, configured to determine, when the webpage includes the POI name of the POI data, the webpage page includes the POI data of the POI.
  8. 根据权利要求7所述的装置,其特征在于,所述POI数据获取模块具体包括: The apparatus according to claim 7, wherein the POI data acquisition module specifically comprises:
    网页爬取单元,用于从互联网中爬取多个包括POI数据的网页;a webpage crawling unit, configured to crawl a plurality of webpages including POI data from the Internet;
    POI数据提取单元,用于从所述多个包括POI数据的网页中提取多个POI数据。a POI data extracting unit, configured to extract a plurality of POI data from the plurality of web pages including POI data.
  9. 根据权利要求7-8任一项所述的装置,其特征在于,Device according to any of claims 7-8, characterized in that
    所述网页爬取单元具体用于从互联网中爬取包括地址关键词的多个网页对应的多个URL;对所述多个URL进行pattern聚类,将具有相同结构特征的URL聚类为同一pattern集合;从多个pattern集合中筛选出包括多个包括POI数据的网页的pattern集合,并从该pattern集合中提取多个包括POI数据的网页。The webpage crawling unit is specifically configured to: crawl a plurality of URLs corresponding to the plurality of webpages including the address keyword from the Internet; perform pattern clustering on the plurality of URLs, and cluster the URLs having the same structural feature into the same a pattern set; filtering a pattern set including a plurality of web pages including POI data from a plurality of pattern sets, and extracting a plurality of web pages including POI data from the pattern set.
  10. 根据权利要求7-9任一项所述的装置,其特征在于,A device according to any one of claims 7-9, wherein
    所述POI数据提取单元具体用于基于属于同一pattern集合中多个URL对应多个包括POI数据的网页的页面结构特征,生成与该pattern集合相应的POI数据提取模板;基于所述POI数据提取模板,从所述多个包括POI数据的网页中提取多个POI数据。The POI data extraction unit is specifically configured to generate a POI data extraction template corresponding to the pattern set based on a page structure feature of a plurality of URLs corresponding to the plurality of URLs in the same pattern set, and extract the template based on the POI data. Extracting a plurality of POI data from the plurality of web pages including POI data.
  11. 根据权利要求7-10中任一所述的装置,其特征在于,A device according to any one of claims 7-10, wherein
    所述网页页面爬取模块具体用于从互联网中爬取包括地址关键词的多个网页页面;提取所述多个网页页面中与所述地址关键词相关联的多个文本信息;从所述多个文本信息中提取相应网页页面的地址信息。The web page crawling module is specifically configured to: crawl a plurality of webpage pages including an address keyword from the Internet; and extract a plurality of text information associated with the address keyword in the plurality of webpage pages; The address information of the corresponding webpage page is extracted from the plurality of text information.
  12. 根据权利要求7-10中任一所述的装置,其特征在于,还包括:网页页面中名称信息确定模块;The device according to any one of claims 7 to 10, further comprising: a name information determining module in the webpage;
    所述网页页面中名称信息确定模块用于当判断该网页页面中包括多个名称信息时,分别计算所述多个名称信息与该网页页面的地址信息之间的文本距离;将最小的文本距离所对应的名称信息,确定为与该网页页面中地址信息相对应的名称信息;以及The name information determining module in the webpage is configured to calculate a text distance between the plurality of name information and address information of the webpage page when determining that the webpage includes a plurality of name information; and the minimum text distance The corresponding name information is determined as the name information corresponding to the address information in the webpage page;
    所述网页页面包括POI名称确定模块还用于将该POI数据对应的POI名称与该网页页面中地址信息相对应的名称信息进行比对;比对一致时,确定该网页页面中包括该POI数据的POI名称。 The webpage including the POI name determining module is further configured to compare the POI name corresponding to the POI data with the name information corresponding to the address information in the webpage page; when the comparison is consistent, determining that the webpage page includes the POI data The POI name.
  13. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-6中的任一项所述的用于确定网页页面中包括兴趣点POI数据的方法。A computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform the determining a web page according to any one of claims 1-6 A method of including POI data of points of interest.
  14. 一种计算机可读介质,其中存储了如权利要求13所述的计算机程序。 A computer readable medium storing the computer program of claim 13.
PCT/CN2015/099580 2015-03-31 2015-12-29 Method and device for determining whether webpage comprises point of interest (poi) data WO2016155386A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510148638.4 2015-03-31
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 For determining that Webpage includes the method and device of point of interest POI data

Publications (1)

Publication Number Publication Date
WO2016155386A1 true WO2016155386A1 (en) 2016-10-06

Family

ID=53346955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/099580 WO2016155386A1 (en) 2015-03-31 2015-12-29 Method and device for determining whether webpage comprises point of interest (poi) data

Country Status (2)

Country Link
CN (1) CN104699835B (en)
WO (1) WO2016155386A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699835B (en) * 2015-03-31 2016-09-28 北京奇虎科技有限公司 For determining that Webpage includes the method and device of point of interest POI data
CN104933171B (en) * 2015-06-30 2019-06-18 百度在线网络技术(北京)有限公司 Interest point data association method and device
CN105117425B (en) * 2015-07-31 2022-03-08 北京奇虎科技有限公司 Method and device for selecting point of interest (POI) data
CN105279246A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for judging whether webpage contains specified point of interest POI
CN105279249B (en) * 2015-09-30 2019-06-21 北京奇虎科技有限公司 The determination method and device of the confidence level of interest point data in a kind of website
CN105160031A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Mining method and device for map point of interest (POI) data
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN105320752B (en) * 2015-09-30 2018-12-07 北京奇虎科技有限公司 A kind of method for digging and device of interest point data
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)
CN105160032B (en) * 2015-09-30 2019-05-31 北京奇虎科技有限公司 The determination method and device of the confidence level of interest point data in a kind of website
CN105243136B (en) * 2015-09-30 2019-02-19 北京奇虎科技有限公司 A kind of method and apparatus of point of interest POI data in excavation internet
CN105608112A (en) * 2015-12-10 2016-05-25 北京奇虎科技有限公司 Method and apparatus for measuring quality of map POI data
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN105550330B (en) * 2015-12-21 2020-09-11 北京奇虎科技有限公司 Method and system for ordering POI (Point of interest) information
CN106708952B (en) * 2016-11-25 2019-11-19 北京神州绿盟信息安全科技股份有限公司 A kind of Webpage clustering method and device
CN112000495B (en) * 2020-10-27 2021-02-12 博泰车联网(南京)有限公司 Method, electronic device and storage medium for point of interest information management

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN101963962A (en) * 2009-07-23 2011-02-02 高德软件有限公司 Interest point data association method and device
CN102142003A (en) * 2010-07-30 2011-08-03 华为软件技术有限公司 Method and device for providing point of interest information
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591867B (en) * 2011-01-07 2015-05-27 清华大学 Searching service method based on mobile device position
CN102841920B (en) * 2012-06-30 2017-05-10 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678629B (en) * 2013-12-19 2016-09-28 北京大学 The search engine method of a kind of geographical position sensitivity and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN101963962A (en) * 2009-07-23 2011-02-02 高德软件有限公司 Interest point data association method and device
CN102142003A (en) * 2010-07-30 2011-08-03 华为软件技术有限公司 Method and device for providing point of interest information
CN103514234A (en) * 2012-06-30 2014-01-15 北京百度网讯科技有限公司 Method and device for extracting page information
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Also Published As

Publication number Publication date
CN104699835B (en) 2016-09-28
CN104699835A (en) 2015-06-10

Similar Documents

Publication Publication Date Title
WO2016155386A1 (en) Method and device for determining whether webpage comprises point of interest (poi) data
US11698261B2 (en) Method, apparatus, computer device and storage medium for determining POI alias
Schulz et al. A multi-indicator approach for geolocalization of tweets
Lieberman et al. STEWARD: architecture of a spatio-textual search engine
US11526769B2 (en) Encoding knowledge graph entries with searchable geotemporal values for evaluating transitive geotemporal proximity of entity mentions
CN103049575A (en) Topic-adaptive academic conference searching system
CN102549571A (en) Landmarks from digital photo collections
CN108304423A (en) A kind of information identifying method and device
CN105069076A (en) Method and apparatus for determining address information in home page of official website
WO2020052312A1 (en) Positioning method and apparatus, electronic device, and readable storage medium
CN103514234A (en) Method and device for extracting page information
CN104899243A (en) Method and apparatus for detecting accuracy of POI (Point of Interest) data
CN102646124A (en) Method for automatically identifying address information
US8954438B1 (en) Structured metadata extraction
Srivastava et al. A geocoding framework powered by delivery data
Intagorn et al. Learning boundaries of vague places from noisy annotations
WO2016107352A1 (en) System and method for determining poi name and for determining validity of poi information
CN111460054B (en) Address data processing method and device, equipment and storage medium
CN105117425B (en) Method and device for selecting point of interest (POI) data
CN106095808B (en) A kind of method and apparatus that MDB file fragmentation restores
US20150269268A1 (en) Search server and search method
KR20090085135A (en) Aggregation syndication platform
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device
De Rouck et al. Georeferencing Wikipedia pages using language models from Flickr
Awamura et al. Location name disambiguation exploiting spatial proximity and temporal consistency

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15887335

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15887335

Country of ref document: EP

Kind code of ref document: A1