CN104699835A - Method and device used for determining webpages including POI (point of interest) data - Google Patents

Method and device used for determining webpages including POI (point of interest) data Download PDF

Info

Publication number
CN104699835A
CN104699835A CN201510148638.4A CN201510148638A CN104699835A CN 104699835 A CN104699835 A CN 104699835A CN 201510148638 A CN201510148638 A CN 201510148638A CN 104699835 A CN104699835 A CN 104699835A
Authority
CN
China
Prior art keywords
webpage
poi data
poi
latitude
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510148638.4A
Other languages
Chinese (zh)
Other versions
CN104699835B (en
Inventor
王智广
魏少俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510148638.4A priority Critical patent/CN104699835B/en
Publication of CN104699835A publication Critical patent/CN104699835A/en
Priority to PCT/CN2015/099580 priority patent/WO2016155386A1/en
Application granted granted Critical
Publication of CN104699835B publication Critical patent/CN104699835B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Abstract

The invention provides a method and a device used for determining webpages including POI (point of interest) data. The method includes: acquiring multiple POI data from the Internet; crawling multiple webpages including address information; normalizing address information in the POI data and the address information included in the webpages to be longitude and latitude information respectively; matching the longitude and latitude information of the POI data with that of the webpages; for the POI data and the webpages identical in longitude and latitude information, seeking in the webpages according to POI names corresponding to the POI data to determine whether the POI names of the POI data are included in the webpages or not; if yes, determining that the webpages include the POI data. The method and the device are conducive to subsequently determining accuracy of collected POI data according to accuracy of content recorder by the webpages, and subsequently providing a reliable guarantee for collecting accurate POI data in the Internet in a large scale.

Description

For determining that Webpage comprises method and the device of point of interest POI data
Technical field
The present invention relates to field of computer technology, specifically, the present invention relates to a kind of for determining that Webpage comprises method and the device of point of interest POI data.
Background technology
In Geographic Information System, POI (Point Of Interest, point of interest) can be a house, retail shop, mailbox, a bus station etc.POI data comprises address information and POI title.
Traditional POI data acquisition method, technician is needed to adopt accurate instrument of surveying and mapping to go to obtain the latitude and longitude information of each POI, and then mark, this Measures compare is wasted time and energy, the quantity resulting through the POI data collected is little, and Geographic Information System is difficult to the POI data little according to quantity and provides high-caliber service.
Internet also exists a large amount of POI data, if the webpage of POI data can be comprised from interconnected online collection, from the webpage collected, extract these POI data for Geographic Information System, then can greatly save manpower and time.But internet is flooded with a large amount of false POI data, comprise in such as blog web page content " original text address: http://xxx.xxx.xxx/xxx ", although comprise " address " printed words, but this address is the network address or perhaps URL (Uniform Resoure Locator, uniform resource locator), be not the geographic address information in POI data; Thus cause the ratio of POI data false in the POI data collected higher.
Summary of the invention
The present invention is directed to the shortcoming of prior art, proposing a kind of for determining that Webpage comprises the method and apparatus of point of interest POI data, in order to solve the POI data problem of the more falseness of collection that prior art exists.
The present invention, according to an aspect, providing a kind of for determining that Webpage comprises the method for point of interest POI data, comprising:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
The present invention, according to another aspect, additionally providing a kind of for determining that Webpage comprises the device of point of interest POI data, comprising:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
In technical scheme of the present invention, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic, due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus be conducive to the follow-up POI data avoiding collecting address dummy information; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine the POI title whether comprising POI data in Webpage further, accurately judge whether POI data is comprised in same Webpage, be conducive to the follow-up authority according to Webpage described content and accuracy, determine the accuracy of the POI data collected, and then provide Reliable guarantee for the POI data of collecting accuracy in internet in large quantity higher.
The aspect that the present invention adds and advantage will part provide in the following description, and these will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:
Fig. 1 a be the embodiment of the present invention for determining that Webpage comprises the schematic flow sheet of the method for point of interest POI data;
Fig. 1 b is the schematic diagram comprising the webpage of multiple POI data of the embodiment of the present invention;
Fig. 2 be the embodiment of the present invention for determining that Webpage comprises the block schematic illustration of the inner structure of the device of point of interest POI data;
Fig. 3 is the block schematic illustration of the inner structure of the POI data acquisition module of the embodiment of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
Those skilled in the art of the present technique are appreciated that unless expressly stated, and singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording used in instructions of the present invention " comprises " and refers to there is described feature, integer, step, operation, element and/or assembly, but does not get rid of and exist or add other features one or more, integer, step, operation, element, assembly and/or their group.Should be appreciated that, when we claim element to be " connected " or " coupling " to another element time, it can be directly connected or coupled to other elements, or also can there is intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or wirelessly to couple.Wording "and/or" used herein comprises one or more whole or arbitrary unit listing item be associated and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, and all terms used herein (comprising technical term and scientific terminology), have the meaning identical with the general understanding of the those of ordinary skill in field belonging to the present invention.It should also be understood that, those terms defined in such as general dictionary, should be understood to that there is the meaning consistent with the meaning in the context of prior art, unless and by specific definitions as here, otherwise can not explain by idealized or too formal implication.
Fig. 1 a is for determining that Webpage comprises the schematic flow sheet of the method for point of interest POI data in the present invention.
S101: obtain multiple POI data from internet; S102: crawl the multiple Webpages comprising address information; S103: the address information that the address information in multiple POI data and multiple Webpage comprise is normalized to latitude and longitude information respectively; S104: based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data; S105: for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage; S106: when this Webpage comprises the POI title of this POI data, determines that this Webpage comprises this point of interest POI data.
Of the present invention above-mentioned for determining that Webpage comprises the method for point of interest POI data, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic position, due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus be conducive to follow-up data of avoiding collecting address dummy information; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine the POI title whether comprising POI data in Webpage further, accurately judge whether POI data is comprised in same Webpage, be conducive to the follow-up authority according to Webpage described content and accuracy, determine the accuracy of the POI data collected, and then provide Reliable guarantee for the POI data of collecting accuracy in internet in large quantity higher.
Lower mask body introduce schematic flow sheet as shown in Figure 1a for determining that Webpage comprises the method for point of interest POI data, comprise the steps:
S101: obtain multiple POI data from internet.
Particularly, utilize the program of web crawlers class, from internet, crawl multiple webpage comprising POI data; Multiple POI data is extracted subsequently from multiple comprising the webpage of POI data.POI data comprises address information and POI title; Preferably, POI data can also comprise contact method, postcode and web tab etc.
The present inventor finds, in internet, there are some webpages like this, in them, the content of each webpage includes one or more POI data, and the address information in POI data comprises the address keyword of printed words such as " addresses "; And the page structure feature URL form of these webpages, and the position of POI data in webpage and form are regular.That is POI data can be extracted from these webpages quickly by a kind of unified method.
Preferably, can from internet, crawl multiple URL (Uniform Resoure Locator, uniform resource locator) that multiple webpages of comprising address keyword such as " addresses " are corresponding; Carrying out pattern cluster to crawling the multiple URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
More preferably, comprise in the webpage of address keyword for numerous, only include the webpage of a POI data, obtain all URL only including the webpage of a POI data; Carrying out pattern cluster to all URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
Such as, numerously comprise in the webpage of address keyword, URL be only include " Epson (China) company limited " this POI data in the webpage of http://www.aibang.com/detail/1537772035-1606201508, URL only includes " mansion of a prince, Beijing hospitals of traditional Chinese and western medicine " this POI data in the webpage of http://www.aibang.com/detail/152928073-419169481, these two URL have identical architectural feature www.aibang.com/detail/*, and wherein * is that asterisk wildcard represents any character; Therefore, can by these two URL clusters in same pattern set; That is, in this pattern set, all URL have same structure feature www.aibang.com/detail/*.
More excellent, comprise in the webpage of address keyword for numerous, comprise the webpage of multiple POI data, obtain all URL comprising the webpage of multiple POI data; Carrying out pattern cluster to all URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
Such as, URL is the webpage of www.dianping.com/topic/s_c_2_120_r14_x540/p7, as shown in Figure 1 b, this webpage comprises multiple POI data that POI title is " boy london ", " COACH (long-drawn-out Tang discount store) " and " meter Lan Dian (three Li Tun shops) " etc.; URL also comprises multiple POI data in the webpage of www.dianping.com/topic/s_c_2_120_r14_x540/p6; Obtain the URL that all architectural features meet www.dianping.com/topic/*, wherein * is that asterisk wildcard represents any character; Carry out pattern cluster to all URL obtained, the URL in the same pattern set that cluster obtains has architectural feature www.dianping.com/topic/*.
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
Preferably, extract multiple POI data specifically can comprise from multiple comprising the webpage of POI data:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template.Particularly, for belonging to each URL in same pattern set, according to form and the position of POI data in each webpage that this URL is corresponding, generation is gathered corresponding POI data to this pattern and is extracted template.
Extract template based on POI data, extract multiple POI data from multiple comprising the webpage of POI data.Particularly, for each URL in same pattern set, for the webpage that this URL is corresponding, extract form and the position of multiple POI data in webpage of the POI data in template according to the POI data generated, from this webpage, extract multiple POI data.
S102: crawl the multiple Webpages comprising address information.
Particularly, utilize the program of web crawlers class, from internet, crawl the multiple Webpages comprising address keyword.
Extract the multiple text messages be associated with address keyword in multiple Webpage.
Particularly, for a Webpage, extract the content of text of this Webpage, in content of text, search the address keyword that " address ", " being positioned at " or " being seated " etc. may comprise address information; Extract the text fragments near address keyword; According to the separator set and fragment length, text fragments is split, the such as text size of text fragments distance address keyword is greater than the threshold value of setting and/or the separator (such as space, comma, fullstop etc.) of text fragments appearance setting, then split text fragments; By in segmentation result, the text fragments between segmentation portion (such as separator place) and address keyword, as the text message be associated with address keyword in this Webpage.
The address information of the corresponding web page page is extracted from multiple text message.
Particularly, for extracting each text message in Webpage, from text information, address information is extracted, as the address information of this Webpage.
S103: the address information that the address information in multiple POI data and multiple Webpage comprise is normalized to latitude and longitude information respectively.
Obtain address information, the latitude and longitude information of province, city, county (district), small towns, road etc. comprising the whole nation in advance, and the geographical information library of corresponding relation between address information and latitude and longitude information.Wherein, the address information in geographical information library, can comprise the address information of the multiple expression-form representing same geographical address; Such as, multiple address information such as " No. 6, Jiuxianqiao Road, Chaoyang District ", " No. 6, Chaoyang, Beijing winebibber's bridge " and " No. 6, Chaoyang District winebibber's bridge ", all represents same geographical address.
Particularly, the address information in multiple POI data is normalized to respectively the latitude and longitude information of multiple POI data.Such as, for the address information in each POI data, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this POI data.
The address information comprised by multiple Webpage is normalized to the latitude and longitude information of multiple Webpage respectively.Preferably, for the address information that each Webpage comprises, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this Webpage.
S104: based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data.
Particularly, for each POI data, judge in each Webpage, whether there is the Webpage that latitude and longitude information is consistent with the latitude and longitude information of this POI data, if so, then determine that this POI data and this Webpage match, namely determine that this POI data has identical latitude and longitude information with this Webpage, otherwise, ignore this POI data.
Due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus follow-uply can collect POI data more accurately according to matching result more accurately.And, mate based on latitude and longitude information, be equivalent to mate respectively based on the multiple geography information corresponding to this latitude and longitude information, expand the scope of coupling, be conducive to later collection to more POI data.
S105: for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage.
Particularly, for POI data and the Webpage with identical latitude and longitude information, from this Webpage, all name informations are found out; For each name information found out, judge whether this name information matches with the POI title in this POI data: if so, determine that this Webpage comprises the POI title of this POI data; Otherwise, ignore this POI data.
Preferably, for POI data and the Webpage with identical latitude and longitude information, if the POI title in the name information in this Webpage and this POI data, although not quite identical on literal expression, but represent in fact same POI, the POI title in this POI data can be confirmed as and the name information in this Webpage matches, thus determine that this Webpage comprises the POI title of this POI data.
Such as, for POI data and the Webpage with identical latitude and longitude information, POI name in this POI data is called " Qihoo 360 ", and the name information in this Webpage is " Beijing Qihu Technology Co., Ltd. ", can confirm as the POI title that this Webpage comprises this POI data.
Preferably, for POI data and the Webpage with identical latitude and longitude information, when judging that this Webpage comprises multiple name information, calculate the text distance between multiple name information and the address information of this Webpage respectively.By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage.Wherein, text distance can be the quantity of the character between name information and address information.
For POI data and the Webpage with identical latitude and longitude information, the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared.When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
Particularly, for POI data and the Webpage with identical latitude and longitude information, judge the POI title that this POI data is corresponding, whether the name information corresponding with address information in this Webpage be consistent: if so, then determine that this Webpage comprises the POI title of this POI data; Otherwise, determine the POI title not comprising this POI data in this Webpage.
S106: for POI data and the Webpage with identical latitude and longitude information, when this Webpage comprises the POI title of this POI data, determines that this Webpage comprises this point of interest POI data.
Particularly, for POI data and the Webpage with identical latitude and longitude information, when determining that in above-mentioned steps S105 this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data in this step, determine that this Webpage comprises this point of interest POI title and address information particularly.
Based on above-mentioned for determining that Webpage comprises the method for point of interest POI data, the invention provides for determining that Webpage comprises the device of point of interest POI data, the block schematic illustration of the inner structure of this device as shown in Figure 2, comprising: POI data acquisition module 201, Webpage crawl module 202, latitude and longitude information normalization module 203, latitude and longitude information matching module 204, Webpage comprises POI title determination module 205 and Webpage comprises POI data determination module 206.
Wherein, POI data acquisition module 201 for obtaining multiple POI data from internet.
Webpage crawls module 202 for crawling the multiple Webpages comprising address information.
Particularly, Webpage crawls module 202 from internet, crawls the multiple Webpages comprising address keyword; Extract the multiple text messages be associated with address keyword in multiple Webpage; The address information of the corresponding web page page is extracted from multiple text message.
Latitude and longitude information normalization module 203 is normalized to latitude and longitude information respectively for the address information address information in multiple POI data and multiple Webpage comprised.
Particularly, the address information in multiple POI data is normalized to the latitude and longitude information of multiple POI data by latitude and longitude information normalization module 203 respectively.Preferably, for the address information in each POI data, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this POI data.Wherein, the geographical information library obtained in advance comprises address information, the latitude and longitude information of the province, city, county (district), small towns, road etc. in the whole nation, and the corresponding relation between address information and latitude and longitude information.
And the address information that multiple Webpage comprises by latitude and longitude information normalization module 203 is normalized to the latitude and longitude information of multiple Webpage respectively.Preferably, for the address information that each Webpage comprises, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this Webpage.
Latitude and longitude information matching module 204, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data.Particularly, latitude and longitude information matching module 204 is for each POI data, judge in each Webpage, whether there is the Webpage that latitude and longitude information is consistent with the latitude and longitude information of this POI data, if so, then determine that this POI data and this Webpage match, namely determine that this POI data has identical latitude and longitude information with this Webpage, otherwise, ignore this POI data.
Webpage comprises POI title determination module 205 for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage.Particularly, Webpage comprises POI title determination module 205 for POI data and the Webpage with identical latitude and longitude information, from this Webpage, find out all name informations; For each name information found out, judge whether this name information matches with the POI title in this POI data: if so, determine that this Webpage comprises the POI title of this POI data; Otherwise, ignore this POI data.
When Webpage comprises POI data determination module 206 for comprising the POI title of this POI data when this Webpage, determine that this Webpage comprises this point of interest POI data.
Preferably, the block schematic illustration of the inner structure of POI data acquisition module 201 as shown in Figure 3, comprising: webpage crawls unit 301 and POI data extraction unit 302.
Wherein, webpage crawls unit 301 for crawling multiple webpage comprising POI data from internet.
Particularly, webpage crawls unit 301 from internet, crawls multiple URL corresponding to multiple webpages of comprising address keyword; Carrying out pattern cluster to multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
POI data extraction unit 302 is for extracting multiple POI data from multiple comprising in the webpage of POI data.
Particularly, POI data extraction unit 302 specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on POI data, extract multiple POI data from multiple comprising the webpage of POI data.
Preferably, as shown in Figure 2, of the present invention for determining that Webpage comprises the device of point of interest POI data, also comprise: name information determination module 207 in Webpage.
In Webpage, name information determination module 207 is for when judging that this Webpage comprises multiple name information, calculates the text distance between multiple name information and the address information of this Webpage respectively; By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage.
Now, Webpage comprises POI title determination module 205 and also compares for the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage; When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
In the technical scheme of the embodiment of the present invention, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic, thus be conducive to the follow-up POI data avoiding collected address information falseness; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine that Webpage comprises the POI title of POI data further, be conducive to the follow-up POI data avoided collecting some latitude and longitude information or POI title and cannot mate, and the accuracy of the POI data that cannot mate is often lower, thus so that the point of interest POI data that comprises of Webpage that follow-up basis is determined, POI data more accurately can be collected.
Those skilled in the art of the present technique are appreciated that the one or more equipment that the present invention includes and relate to for performing in operation described in the application.These equipment for required object and specialized designs and manufacture, or also can comprise the known device in multi-purpose computer.These equipment have storage computer program within it, and these computer programs optionally activate or reconstruct.Such computer program can be stored in equipment (such as, computing machine) in computer-readable recording medium or be stored in and be suitable for store electrons instruction and be coupled in the medium of any type of bus respectively, described computer-readable medium includes but not limited to that the dish of any type (comprises floppy disk, hard disk, CD, CD-ROM, and magneto-optic disk), ROM (Read-Only Memory, ROM (read-only memory)), RAM (Random Access Memory, storer immediately), EPROM (Erasable Programmable Read-Only Memory, Erarable Programmable Read only Memory), EEPROM (Electrically Erasable ProgrammableRead-Only Memory, EEPROM (Electrically Erasable Programmable Read Only Memo)), flash memory, magnetic card or light card.Namely, computer-readable recording medium comprises and being stored or any medium of transmission information with the form that can read by equipment (such as, computing machine).
Those skilled in the art of the present technique are appreciated that the combination that can realize the frame in each frame in these structural drawing and/or block diagram and/or flow graph and these structural drawing and/or block diagram and/or flow graph with computer program instructions.Those skilled in the art of the present technique are appreciated that, the processor that these computer program instructions can be supplied to multi-purpose computer, special purpose computer or other programmable data disposal routes realizes, thus is performed the scheme of specifying in the frame of structural drawing disclosed by the invention and/or block diagram and/or flow graph or multiple frame by the processor of computing machine or other programmable data disposal routes.
Those skilled in the art of the present technique are appreciated that various operations, method, the step in flow process, measure, the scheme discussed in the present invention can be replaced, changes, combines or delete.Further, there is various operations, method, other steps in flow process, measure, the scheme discussed in the present invention also can be replaced, change, reset, decompose, combine or delete.Further, of the prior art have also can be replaced with the step in operation various disclosed in the present invention, method, flow process, measure, scheme, changed, reset, decomposed, combined or deleted.
The above is only some embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.
The invention discloses A1, a kind of for determining that Webpage comprises the method for point of interest POI data, comprising:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
A2, method according to claim A1, it is characterized in that, the described step obtaining multiple POI data from internet comprises further:
Multiple webpage comprising POI data is crawled from internet;
Multiple POI data is extracted from described multiple comprising the webpage of POI data.
A3, method according to claim A1 or A2, is characterized in that, describedly from internet, crawls multiple step comprising the webpage of POI data comprise further:
Multiple URL that multiple webpages of comprising address keyword are corresponding are crawled from internet;
Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature;
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
A4, method according to any one of claim A1-A3, is characterized in that, describedly comprises further from described multiple step extracting multiple POI data the webpage of POI data that comprises:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template;
Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
A5, method according to any one of claim A1-A4, is characterized in that, described in crawl the multiple Webpages comprising address information step comprise further:
The multiple Webpages comprising address keyword are crawled from internet;
Extract the multiple text messages be associated with described address keyword in described multiple Webpage;
The address information of the corresponding web page page is extracted from described multiple text message.
A6, method according to any one of claim A1-A5, it is characterized in that, the method also comprises:
When judging that this Webpage comprises multiple name information, calculate the text distance between described multiple name information and the address information of this Webpage respectively;
By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage;
Wherein, the described POI title corresponding according to this POI data is searched in this Webpage, determines that the step of the POI title whether comprising this POI data in this Webpage comprises further:
The name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared;
When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
The invention discloses A7, a kind of for determining that Webpage comprises the device of point of interest POI data, comprising:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
A8, device according to claim A7, it is characterized in that, described POI data acquisition module specifically comprises:
Webpage crawls unit, for crawling multiple webpage comprising POI data from internet;
POI data extraction unit, for extracting multiple POI data from described multiple comprising in the webpage of POI data.
A9, device according to claim A7 or A8, is characterized in that,
Described webpage crawls unit specifically for crawling multiple URL corresponding to multiple webpages of comprising address keyword from internet; Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
A10, device according to any one of claim A7-A9, is characterized in that,
Described POI data extraction unit specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
A11, device according to any one of claim A7-A10, is characterized in that,
Described Webpage crawls module specifically for crawling the multiple Webpages comprising address keyword from internet; Extract the multiple text messages be associated with described address keyword in described multiple Webpage; The address information of the corresponding web page page is extracted from described multiple text message.
A12, device according to any one of claim A7-A11, is characterized in that, also comprise: name information determination module in Webpage;
In described Webpage, name information determination module is used for when judging that this Webpage comprises multiple name information, calculates the text distance between described multiple name information and the address information of this Webpage respectively; By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage; And
Described Webpage comprises POI title determination module and also compares for the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage; When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.

Claims (10)

1., for determining that Webpage comprises a method for point of interest POI data, it is characterized in that, comprise:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
2. method according to claim 1, is characterized in that, the described step obtaining multiple POI data from internet comprises further:
Multiple webpage comprising POI data is crawled from internet;
Multiple POI data is extracted from described multiple comprising the webpage of POI data.
3. the method according to any one of claim 1-2, is characterized in that, describedly from internet, crawls multiple step comprising the webpage of POI data comprise further:
Multiple URL that multiple webpages of comprising address keyword are corresponding are crawled from internet;
Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature;
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
4. the method according to any one of claim 1-3, is characterized in that, describedly comprises further from described multiple step extracting multiple POI data the webpage of POI data that comprises:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template;
Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
5. the method according to any one of claim 1-4, is characterized in that, described in crawl the multiple Webpages comprising address information step comprise further:
The multiple Webpages comprising address keyword are crawled from internet;
Extract the multiple text messages be associated with described address keyword in described multiple Webpage;
The address information of the corresponding web page page is extracted from described multiple text message.
6. the method according to any one of claim 1-5, is characterized in that, the method also comprises:
When judging that this Webpage comprises multiple name information, calculate the text distance between described multiple name information and the address information of this Webpage respectively;
By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage;
Wherein, the described POI title corresponding according to this POI data is searched in this Webpage, determines that the step of the POI title whether comprising this POI data in this Webpage comprises further:
The name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared;
When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
7., for determining that Webpage comprises a device for point of interest POI data, it is characterized in that, comprise:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
8. device according to claim 7, is characterized in that, described POI data acquisition module specifically comprises:
Webpage crawls unit, for crawling multiple webpage comprising POI data from internet;
POI data extraction unit, for extracting multiple POI data from described multiple comprising in the webpage of POI data.
9. the device according to any one of claim 7-8, is characterized in that,
Described webpage crawls unit specifically for crawling multiple URL corresponding to multiple webpages of comprising address keyword from internet; Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
10. the device according to any one of claim 7-9, is characterized in that,
Described POI data extraction unit specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
CN201510148638.4A 2015-03-31 2015-03-31 For determining that Webpage includes the method and device of point of interest POI data Active CN104699835B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 For determining that Webpage includes the method and device of point of interest POI data
PCT/CN2015/099580 WO2016155386A1 (en) 2015-03-31 2015-12-29 Method and device for determining whether webpage comprises point of interest (poi) data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510148638.4A CN104699835B (en) 2015-03-31 2015-03-31 For determining that Webpage includes the method and device of point of interest POI data

Publications (2)

Publication Number Publication Date
CN104699835A true CN104699835A (en) 2015-06-10
CN104699835B CN104699835B (en) 2016-09-28

Family

ID=53346955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510148638.4A Active CN104699835B (en) 2015-03-31 2015-03-31 For determining that Webpage includes the method and device of point of interest POI data

Country Status (2)

Country Link
CN (1) CN104699835B (en)
WO (1) WO2016155386A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933171A (en) * 2015-06-30 2015-09-23 百度在线网络技术(北京)有限公司 Method and device for associating data of interest point
CN105117425A (en) * 2015-07-31 2015-12-02 北京奇虎科技有限公司 Method and apparatus for selecting interest point of POI data
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)
CN105160031A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Mining method and device for map point of interest (POI) data
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN105160032A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105243136A (en) * 2015-09-30 2016-01-13 北京奇虎科技有限公司 Method and apparatus for mining point of interest (POI) data in internet
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105279246A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for judging whether webpage contains specified point of interest POI
CN105320752A (en) * 2015-09-30 2016-02-10 北京奇虎科技有限公司 Point of interest data mining method and apparatus
CN105550330A (en) * 2015-12-21 2016-05-04 北京奇虎科技有限公司 Point of interest (POI) information sorting method and system
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN105608112A (en) * 2015-12-10 2016-05-25 北京奇虎科技有限公司 Method and apparatus for measuring quality of map POI data
WO2016155386A1 (en) * 2015-03-31 2016-10-06 北京奇虎科技有限公司 Method and device for determining whether webpage comprises point of interest (poi) data
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device
CN112000495B (en) * 2020-10-27 2021-02-12 博泰车联网(南京)有限公司 Method, electronic device and storage medium for point of interest information management

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN102591867A (en) * 2011-01-07 2012-07-18 清华大学 Searching service method based on mobile device position
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678629A (en) * 2013-12-19 2014-03-26 北京大学 Search engine method and system sensitive to geographical position

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963962B (en) * 2009-07-23 2014-02-26 高德软件有限公司 Interest point data association method and device
CN102142003B (en) * 2010-07-30 2013-04-24 华为软件技术有限公司 Method and device for providing point of interest information
CN103514234B (en) * 2012-06-30 2018-10-16 北京百度网讯科技有限公司 A kind of page info extracting method and device
CN104699835B (en) * 2015-03-31 2016-09-28 北京奇虎科技有限公司 For determining that Webpage includes the method and device of point of interest POI data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080040684A1 (en) * 2006-08-14 2008-02-14 Richard Crump Intelligent Pop-Up Window Method and Apparatus
CN102591867A (en) * 2011-01-07 2012-07-18 清华大学 Searching service method based on mobile device position
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN103678629A (en) * 2013-12-19 2014-03-26 北京大学 Search engine method and system sensitive to geographical position

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016155386A1 (en) * 2015-03-31 2016-10-06 北京奇虎科技有限公司 Method and device for determining whether webpage comprises point of interest (poi) data
CN104933171A (en) * 2015-06-30 2015-09-23 百度在线网络技术(北京)有限公司 Method and device for associating data of interest point
CN105117425A (en) * 2015-07-31 2015-12-02 北京奇虎科技有限公司 Method and apparatus for selecting interest point of POI data
CN105320752B (en) * 2015-09-30 2018-12-07 北京奇虎科技有限公司 A kind of method for digging and device of interest point data
CN105243136B (en) * 2015-09-30 2019-02-19 北京奇虎科技有限公司 A kind of method and apparatus of point of interest POI data in excavation internet
CN105160032A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105243136A (en) * 2015-09-30 2016-01-13 北京奇虎科技有限公司 Method and apparatus for mining point of interest (POI) data in internet
CN105279249A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for determining confidence of point of interest data in website
CN105279246A (en) * 2015-09-30 2016-01-27 北京奇虎科技有限公司 Method and device for judging whether webpage contains specified point of interest POI
CN105320752A (en) * 2015-09-30 2016-02-10 北京奇虎科技有限公司 Point of interest data mining method and apparatus
CN105279249B (en) * 2015-09-30 2019-06-21 北京奇虎科技有限公司 The determination method and device of the confidence level of interest point data in a kind of website
CN105160032B (en) * 2015-09-30 2019-05-31 北京奇虎科技有限公司 The determination method and device of the confidence level of interest point data in a kind of website
CN105159885A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Point-of-interest name identification method and device
CN105160031A (en) * 2015-09-30 2015-12-16 北京奇虎科技有限公司 Mining method and device for map point of interest (POI) data
CN105138708A (en) * 2015-09-30 2015-12-09 北京奇虎科技有限公司 Method and device for identifying names of points of interest (POI)
CN105608112A (en) * 2015-12-10 2016-05-25 北京奇虎科技有限公司 Method and apparatus for measuring quality of map POI data
CN105550169A (en) * 2015-12-11 2016-05-04 北京奇虎科技有限公司 Method and device for identifying point of interest names based on character length
CN105550330A (en) * 2015-12-21 2016-05-04 北京奇虎科技有限公司 Point of interest (POI) information sorting method and system
CN106708952A (en) * 2016-11-25 2017-05-24 北京神州绿盟信息安全科技股份有限公司 Web page clustering method and device
CN106708952B (en) * 2016-11-25 2019-11-19 北京神州绿盟信息安全科技股份有限公司 A kind of Webpage clustering method and device
US11023540B2 (en) 2016-11-25 2021-06-01 NSFOCUS Information Technology Co., Ltd. Web page clustering method and device
CN112000495B (en) * 2020-10-27 2021-02-12 博泰车联网(南京)有限公司 Method, electronic device and storage medium for point of interest information management

Also Published As

Publication number Publication date
WO2016155386A1 (en) 2016-10-06
CN104699835B (en) 2016-09-28

Similar Documents

Publication Publication Date Title
CN104699835A (en) Method and device used for determining webpages including POI (point of interest) data
CN110020433B (en) Industrial and commercial high-management name disambiguation method based on enterprise incidence relation
US11698261B2 (en) Method, apparatus, computer device and storage medium for determining POI alias
CN103514234B (en) A kind of page info extracting method and device
CN103294781B (en) A kind of method and apparatus for processing page data
CN104899243A (en) Method and apparatus for detecting accuracy of POI (Point of Interest) data
CN101299217B (en) Method, apparatus and system for processing map information
CN105069076A (en) Method and apparatus for determining address information in home page of official website
CN109977287B (en) Method for judging identity of real estate data of different information sources
CN108304423A (en) A kind of information identifying method and device
CN109947881B (en) POI weight judging method and device, mobile terminal and computer readable storage medium
CN103853738A (en) Identification method for webpage information related region
CN102841920A (en) Method and device for extracting webpage frame information
CN109492066B (en) Method, device, equipment and storage medium for determining branch names of points of interest
CN112287566A (en) Automatic driving scene library generation method and system and electronic equipment
CN104317909A (en) Method and device for verifying data of points of interest
CN107463711A (en) A kind of tag match method and device of data
CN102646124A (en) Method for automatically identifying address information
CN101630315B (en) Quick retrieval method and system
CN104537105A (en) Automatic network physical landmark excavating method based on Web maps
CN108984640A (en) A kind of geography information acquisition methods excavated based on web data
CN105528357A (en) Webpage content extraction method based on similarity of URLs and similarity of webpage document structures
CN102654861A (en) Method and system for calculating webpage extraction accuracy
CN105159885A (en) Point-of-interest name identification method and device
CN107577744A (en) Nonstandard Address automatic matching model, matching process and method for establishing model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220803

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.