CN104699835A - Method and device used for determining webpages including POI (point of interest) data - Google Patents
Method and device used for determining webpages including POI (point of interest) data Download PDFInfo
- Publication number
- CN104699835A CN104699835A CN201510148638.4A CN201510148638A CN104699835A CN 104699835 A CN104699835 A CN 104699835A CN 201510148638 A CN201510148638 A CN 201510148638A CN 104699835 A CN104699835 A CN 104699835A
- Authority
- CN
- China
- Prior art keywords
- webpage
- poi data
- poi
- latitude
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
Abstract
The invention provides a method and a device used for determining webpages including POI (point of interest) data. The method includes: acquiring multiple POI data from the Internet; crawling multiple webpages including address information; normalizing address information in the POI data and the address information included in the webpages to be longitude and latitude information respectively; matching the longitude and latitude information of the POI data with that of the webpages; for the POI data and the webpages identical in longitude and latitude information, seeking in the webpages according to POI names corresponding to the POI data to determine whether the POI names of the POI data are included in the webpages or not; if yes, determining that the webpages include the POI data. The method and the device are conducive to subsequently determining accuracy of collected POI data according to accuracy of content recorder by the webpages, and subsequently providing a reliable guarantee for collecting accurate POI data in the Internet in a large scale.
Description
Technical field
The present invention relates to field of computer technology, specifically, the present invention relates to a kind of for determining that Webpage comprises method and the device of point of interest POI data.
Background technology
In Geographic Information System, POI (Point Of Interest, point of interest) can be a house, retail shop, mailbox, a bus station etc.POI data comprises address information and POI title.
Traditional POI data acquisition method, technician is needed to adopt accurate instrument of surveying and mapping to go to obtain the latitude and longitude information of each POI, and then mark, this Measures compare is wasted time and energy, the quantity resulting through the POI data collected is little, and Geographic Information System is difficult to the POI data little according to quantity and provides high-caliber service.
Internet also exists a large amount of POI data, if the webpage of POI data can be comprised from interconnected online collection, from the webpage collected, extract these POI data for Geographic Information System, then can greatly save manpower and time.But internet is flooded with a large amount of false POI data, comprise in such as blog web page content " original text address: http://xxx.xxx.xxx/xxx ", although comprise " address " printed words, but this address is the network address or perhaps URL (Uniform Resoure Locator, uniform resource locator), be not the geographic address information in POI data; Thus cause the ratio of POI data false in the POI data collected higher.
Summary of the invention
The present invention is directed to the shortcoming of prior art, proposing a kind of for determining that Webpage comprises the method and apparatus of point of interest POI data, in order to solve the POI data problem of the more falseness of collection that prior art exists.
The present invention, according to an aspect, providing a kind of for determining that Webpage comprises the method for point of interest POI data, comprising:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
The present invention, according to another aspect, additionally providing a kind of for determining that Webpage comprises the device of point of interest POI data, comprising:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
In technical scheme of the present invention, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic, due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus be conducive to the follow-up POI data avoiding collecting address dummy information; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine the POI title whether comprising POI data in Webpage further, accurately judge whether POI data is comprised in same Webpage, be conducive to the follow-up authority according to Webpage described content and accuracy, determine the accuracy of the POI data collected, and then provide Reliable guarantee for the POI data of collecting accuracy in internet in large quantity higher.
The aspect that the present invention adds and advantage will part provide in the following description, and these will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:
Fig. 1 a be the embodiment of the present invention for determining that Webpage comprises the schematic flow sheet of the method for point of interest POI data;
Fig. 1 b is the schematic diagram comprising the webpage of multiple POI data of the embodiment of the present invention;
Fig. 2 be the embodiment of the present invention for determining that Webpage comprises the block schematic illustration of the inner structure of the device of point of interest POI data;
Fig. 3 is the block schematic illustration of the inner structure of the POI data acquisition module of the embodiment of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
Those skilled in the art of the present technique are appreciated that unless expressly stated, and singulative used herein " ", " one ", " described " and " being somebody's turn to do " also can comprise plural form.Should be further understood that, the wording used in instructions of the present invention " comprises " and refers to there is described feature, integer, step, operation, element and/or assembly, but does not get rid of and exist or add other features one or more, integer, step, operation, element, assembly and/or their group.Should be appreciated that, when we claim element to be " connected " or " coupling " to another element time, it can be directly connected or coupled to other elements, or also can there is intermediary element.In addition, " connection " used herein or " coupling " can comprise wireless connections or wirelessly to couple.Wording "and/or" used herein comprises one or more whole or arbitrary unit listing item be associated and all combinations.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, and all terms used herein (comprising technical term and scientific terminology), have the meaning identical with the general understanding of the those of ordinary skill in field belonging to the present invention.It should also be understood that, those terms defined in such as general dictionary, should be understood to that there is the meaning consistent with the meaning in the context of prior art, unless and by specific definitions as here, otherwise can not explain by idealized or too formal implication.
Fig. 1 a is for determining that Webpage comprises the schematic flow sheet of the method for point of interest POI data in the present invention.
S101: obtain multiple POI data from internet; S102: crawl the multiple Webpages comprising address information; S103: the address information that the address information in multiple POI data and multiple Webpage comprise is normalized to latitude and longitude information respectively; S104: based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data; S105: for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage; S106: when this Webpage comprises the POI title of this POI data, determines that this Webpage comprises this point of interest POI data.
Of the present invention above-mentioned for determining that Webpage comprises the method for point of interest POI data, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic position, due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus be conducive to follow-up data of avoiding collecting address dummy information; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine the POI title whether comprising POI data in Webpage further, accurately judge whether POI data is comprised in same Webpage, be conducive to the follow-up authority according to Webpage described content and accuracy, determine the accuracy of the POI data collected, and then provide Reliable guarantee for the POI data of collecting accuracy in internet in large quantity higher.
Lower mask body introduce schematic flow sheet as shown in Figure 1a for determining that Webpage comprises the method for point of interest POI data, comprise the steps:
S101: obtain multiple POI data from internet.
Particularly, utilize the program of web crawlers class, from internet, crawl multiple webpage comprising POI data; Multiple POI data is extracted subsequently from multiple comprising the webpage of POI data.POI data comprises address information and POI title; Preferably, POI data can also comprise contact method, postcode and web tab etc.
The present inventor finds, in internet, there are some webpages like this, in them, the content of each webpage includes one or more POI data, and the address information in POI data comprises the address keyword of printed words such as " addresses "; And the page structure feature URL form of these webpages, and the position of POI data in webpage and form are regular.That is POI data can be extracted from these webpages quickly by a kind of unified method.
Preferably, can from internet, crawl multiple URL (Uniform Resoure Locator, uniform resource locator) that multiple webpages of comprising address keyword such as " addresses " are corresponding; Carrying out pattern cluster to crawling the multiple URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
More preferably, comprise in the webpage of address keyword for numerous, only include the webpage of a POI data, obtain all URL only including the webpage of a POI data; Carrying out pattern cluster to all URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
Such as, numerously comprise in the webpage of address keyword, URL be only include " Epson (China) company limited " this POI data in the webpage of http://www.aibang.com/detail/1537772035-1606201508, URL only includes " mansion of a prince, Beijing hospitals of traditional Chinese and western medicine " this POI data in the webpage of http://www.aibang.com/detail/152928073-419169481, these two URL have identical architectural feature www.aibang.com/detail/*, and wherein * is that asterisk wildcard represents any character; Therefore, can by these two URL clusters in same pattern set; That is, in this pattern set, all URL have same structure feature www.aibang.com/detail/*.
More excellent, comprise in the webpage of address keyword for numerous, comprise the webpage of multiple POI data, obtain all URL comprising the webpage of multiple POI data; Carrying out pattern cluster to all URL obtained, is that same pattern gathers by the URL cluster with identical architectural feature.
Such as, URL is the webpage of www.dianping.com/topic/s_c_2_120_r14_x540/p7, as shown in Figure 1 b, this webpage comprises multiple POI data that POI title is " boy london ", " COACH (long-drawn-out Tang discount store) " and " meter Lan Dian (three Li Tun shops) " etc.; URL also comprises multiple POI data in the webpage of www.dianping.com/topic/s_c_2_120_r14_x540/p6; Obtain the URL that all architectural features meet www.dianping.com/topic/*, wherein * is that asterisk wildcard represents any character; Carry out pattern cluster to all URL obtained, the URL in the same pattern set that cluster obtains has architectural feature www.dianping.com/topic/*.
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
Preferably, extract multiple POI data specifically can comprise from multiple comprising the webpage of POI data:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template.Particularly, for belonging to each URL in same pattern set, according to form and the position of POI data in each webpage that this URL is corresponding, generation is gathered corresponding POI data to this pattern and is extracted template.
Extract template based on POI data, extract multiple POI data from multiple comprising the webpage of POI data.Particularly, for each URL in same pattern set, for the webpage that this URL is corresponding, extract form and the position of multiple POI data in webpage of the POI data in template according to the POI data generated, from this webpage, extract multiple POI data.
S102: crawl the multiple Webpages comprising address information.
Particularly, utilize the program of web crawlers class, from internet, crawl the multiple Webpages comprising address keyword.
Extract the multiple text messages be associated with address keyword in multiple Webpage.
Particularly, for a Webpage, extract the content of text of this Webpage, in content of text, search the address keyword that " address ", " being positioned at " or " being seated " etc. may comprise address information; Extract the text fragments near address keyword; According to the separator set and fragment length, text fragments is split, the such as text size of text fragments distance address keyword is greater than the threshold value of setting and/or the separator (such as space, comma, fullstop etc.) of text fragments appearance setting, then split text fragments; By in segmentation result, the text fragments between segmentation portion (such as separator place) and address keyword, as the text message be associated with address keyword in this Webpage.
The address information of the corresponding web page page is extracted from multiple text message.
Particularly, for extracting each text message in Webpage, from text information, address information is extracted, as the address information of this Webpage.
S103: the address information that the address information in multiple POI data and multiple Webpage comprise is normalized to latitude and longitude information respectively.
Obtain address information, the latitude and longitude information of province, city, county (district), small towns, road etc. comprising the whole nation in advance, and the geographical information library of corresponding relation between address information and latitude and longitude information.Wherein, the address information in geographical information library, can comprise the address information of the multiple expression-form representing same geographical address; Such as, multiple address information such as " No. 6, Jiuxianqiao Road, Chaoyang District ", " No. 6, Chaoyang, Beijing winebibber's bridge " and " No. 6, Chaoyang District winebibber's bridge ", all represents same geographical address.
Particularly, the address information in multiple POI data is normalized to respectively the latitude and longitude information of multiple POI data.Such as, for the address information in each POI data, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this POI data.
The address information comprised by multiple Webpage is normalized to the latitude and longitude information of multiple Webpage respectively.Preferably, for the address information that each Webpage comprises, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this Webpage.
S104: based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data.
Particularly, for each POI data, judge in each Webpage, whether there is the Webpage that latitude and longitude information is consistent with the latitude and longitude information of this POI data, if so, then determine that this POI data and this Webpage match, namely determine that this POI data has identical latitude and longitude information with this Webpage, otherwise, ignore this POI data.
Due to the uniqueness of longitude and latitude, based on the accuracy of the matching result of latitude and longitude information, far above the accuracy of the existing matching result based on text message, thus follow-uply can collect POI data more accurately according to matching result more accurately.And, mate based on latitude and longitude information, be equivalent to mate respectively based on the multiple geography information corresponding to this latitude and longitude information, expand the scope of coupling, be conducive to later collection to more POI data.
S105: for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage.
Particularly, for POI data and the Webpage with identical latitude and longitude information, from this Webpage, all name informations are found out; For each name information found out, judge whether this name information matches with the POI title in this POI data: if so, determine that this Webpage comprises the POI title of this POI data; Otherwise, ignore this POI data.
Preferably, for POI data and the Webpage with identical latitude and longitude information, if the POI title in the name information in this Webpage and this POI data, although not quite identical on literal expression, but represent in fact same POI, the POI title in this POI data can be confirmed as and the name information in this Webpage matches, thus determine that this Webpage comprises the POI title of this POI data.
Such as, for POI data and the Webpage with identical latitude and longitude information, POI name in this POI data is called " Qihoo 360 ", and the name information in this Webpage is " Beijing Qihu Technology Co., Ltd. ", can confirm as the POI title that this Webpage comprises this POI data.
Preferably, for POI data and the Webpage with identical latitude and longitude information, when judging that this Webpage comprises multiple name information, calculate the text distance between multiple name information and the address information of this Webpage respectively.By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage.Wherein, text distance can be the quantity of the character between name information and address information.
For POI data and the Webpage with identical latitude and longitude information, the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared.When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
Particularly, for POI data and the Webpage with identical latitude and longitude information, judge the POI title that this POI data is corresponding, whether the name information corresponding with address information in this Webpage be consistent: if so, then determine that this Webpage comprises the POI title of this POI data; Otherwise, determine the POI title not comprising this POI data in this Webpage.
S106: for POI data and the Webpage with identical latitude and longitude information, when this Webpage comprises the POI title of this POI data, determines that this Webpage comprises this point of interest POI data.
Particularly, for POI data and the Webpage with identical latitude and longitude information, when determining that in above-mentioned steps S105 this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data in this step, determine that this Webpage comprises this point of interest POI title and address information particularly.
Based on above-mentioned for determining that Webpage comprises the method for point of interest POI data, the invention provides for determining that Webpage comprises the device of point of interest POI data, the block schematic illustration of the inner structure of this device as shown in Figure 2, comprising: POI data acquisition module 201, Webpage crawl module 202, latitude and longitude information normalization module 203, latitude and longitude information matching module 204, Webpage comprises POI title determination module 205 and Webpage comprises POI data determination module 206.
Wherein, POI data acquisition module 201 for obtaining multiple POI data from internet.
Webpage crawls module 202 for crawling the multiple Webpages comprising address information.
Particularly, Webpage crawls module 202 from internet, crawls the multiple Webpages comprising address keyword; Extract the multiple text messages be associated with address keyword in multiple Webpage; The address information of the corresponding web page page is extracted from multiple text message.
Latitude and longitude information normalization module 203 is normalized to latitude and longitude information respectively for the address information address information in multiple POI data and multiple Webpage comprised.
Particularly, the address information in multiple POI data is normalized to the latitude and longitude information of multiple POI data by latitude and longitude information normalization module 203 respectively.Preferably, for the address information in each POI data, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this POI data.Wherein, the geographical information library obtained in advance comprises address information, the latitude and longitude information of the province, city, county (district), small towns, road etc. in the whole nation, and the corresponding relation between address information and latitude and longitude information.
And the address information that multiple Webpage comprises by latitude and longitude information normalization module 203 is normalized to the latitude and longitude information of multiple Webpage respectively.Preferably, for the address information that each Webpage comprises, from the geographical information library obtained in advance, find out the latitude and longitude information corresponding to this address information, the latitude and longitude information found out is defined as the latitude and longitude information of this Webpage.
Latitude and longitude information matching module 204, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of multiple POI data.Particularly, latitude and longitude information matching module 204 is for each POI data, judge in each Webpage, whether there is the Webpage that latitude and longitude information is consistent with the latitude and longitude information of this POI data, if so, then determine that this POI data and this Webpage match, namely determine that this POI data has identical latitude and longitude information with this Webpage, otherwise, ignore this POI data.
Webpage comprises POI title determination module 205 for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage.Particularly, Webpage comprises POI title determination module 205 for POI data and the Webpage with identical latitude and longitude information, from this Webpage, find out all name informations; For each name information found out, judge whether this name information matches with the POI title in this POI data: if so, determine that this Webpage comprises the POI title of this POI data; Otherwise, ignore this POI data.
When Webpage comprises POI data determination module 206 for comprising the POI title of this POI data when this Webpage, determine that this Webpage comprises this point of interest POI data.
Preferably, the block schematic illustration of the inner structure of POI data acquisition module 201 as shown in Figure 3, comprising: webpage crawls unit 301 and POI data extraction unit 302.
Wherein, webpage crawls unit 301 for crawling multiple webpage comprising POI data from internet.
Particularly, webpage crawls unit 301 from internet, crawls multiple URL corresponding to multiple webpages of comprising address keyword; Carrying out pattern cluster to multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
POI data extraction unit 302 is for extracting multiple POI data from multiple comprising in the webpage of POI data.
Particularly, POI data extraction unit 302 specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on POI data, extract multiple POI data from multiple comprising the webpage of POI data.
Preferably, as shown in Figure 2, of the present invention for determining that Webpage comprises the device of point of interest POI data, also comprise: name information determination module 207 in Webpage.
In Webpage, name information determination module 207 is for when judging that this Webpage comprises multiple name information, calculates the text distance between multiple name information and the address information of this Webpage respectively; By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage.
Now, Webpage comprises POI title determination module 205 and also compares for the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage; When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
In the technical scheme of the embodiment of the present invention, address information is normalized to latitude and longitude information, can the address information of filtering non-geographic, thus be conducive to the follow-up POI data avoiding collected address information falseness; On the basis that latitude and longitude information in the latitude and longitude information and Webpage of POI data matches, determine that Webpage comprises the POI title of POI data further, be conducive to the follow-up POI data avoided collecting some latitude and longitude information or POI title and cannot mate, and the accuracy of the POI data that cannot mate is often lower, thus so that the point of interest POI data that comprises of Webpage that follow-up basis is determined, POI data more accurately can be collected.
Those skilled in the art of the present technique are appreciated that the one or more equipment that the present invention includes and relate to for performing in operation described in the application.These equipment for required object and specialized designs and manufacture, or also can comprise the known device in multi-purpose computer.These equipment have storage computer program within it, and these computer programs optionally activate or reconstruct.Such computer program can be stored in equipment (such as, computing machine) in computer-readable recording medium or be stored in and be suitable for store electrons instruction and be coupled in the medium of any type of bus respectively, described computer-readable medium includes but not limited to that the dish of any type (comprises floppy disk, hard disk, CD, CD-ROM, and magneto-optic disk), ROM (Read-Only Memory, ROM (read-only memory)), RAM (Random Access Memory, storer immediately), EPROM (Erasable Programmable Read-Only Memory, Erarable Programmable Read only Memory), EEPROM (Electrically Erasable ProgrammableRead-Only Memory, EEPROM (Electrically Erasable Programmable Read Only Memo)), flash memory, magnetic card or light card.Namely, computer-readable recording medium comprises and being stored or any medium of transmission information with the form that can read by equipment (such as, computing machine).
Those skilled in the art of the present technique are appreciated that the combination that can realize the frame in each frame in these structural drawing and/or block diagram and/or flow graph and these structural drawing and/or block diagram and/or flow graph with computer program instructions.Those skilled in the art of the present technique are appreciated that, the processor that these computer program instructions can be supplied to multi-purpose computer, special purpose computer or other programmable data disposal routes realizes, thus is performed the scheme of specifying in the frame of structural drawing disclosed by the invention and/or block diagram and/or flow graph or multiple frame by the processor of computing machine or other programmable data disposal routes.
Those skilled in the art of the present technique are appreciated that various operations, method, the step in flow process, measure, the scheme discussed in the present invention can be replaced, changes, combines or delete.Further, there is various operations, method, other steps in flow process, measure, the scheme discussed in the present invention also can be replaced, change, reset, decompose, combine or delete.Further, of the prior art have also can be replaced with the step in operation various disclosed in the present invention, method, flow process, measure, scheme, changed, reset, decomposed, combined or deleted.
The above is only some embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.
The invention discloses A1, a kind of for determining that Webpage comprises the method for point of interest POI data, comprising:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
A2, method according to claim A1, it is characterized in that, the described step obtaining multiple POI data from internet comprises further:
Multiple webpage comprising POI data is crawled from internet;
Multiple POI data is extracted from described multiple comprising the webpage of POI data.
A3, method according to claim A1 or A2, is characterized in that, describedly from internet, crawls multiple step comprising the webpage of POI data comprise further:
Multiple URL that multiple webpages of comprising address keyword are corresponding are crawled from internet;
Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature;
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
A4, method according to any one of claim A1-A3, is characterized in that, describedly comprises further from described multiple step extracting multiple POI data the webpage of POI data that comprises:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template;
Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
A5, method according to any one of claim A1-A4, is characterized in that, described in crawl the multiple Webpages comprising address information step comprise further:
The multiple Webpages comprising address keyword are crawled from internet;
Extract the multiple text messages be associated with described address keyword in described multiple Webpage;
The address information of the corresponding web page page is extracted from described multiple text message.
A6, method according to any one of claim A1-A5, it is characterized in that, the method also comprises:
When judging that this Webpage comprises multiple name information, calculate the text distance between described multiple name information and the address information of this Webpage respectively;
By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage;
Wherein, the described POI title corresponding according to this POI data is searched in this Webpage, determines that the step of the POI title whether comprising this POI data in this Webpage comprises further:
The name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared;
When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
The invention discloses A7, a kind of for determining that Webpage comprises the device of point of interest POI data, comprising:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
A8, device according to claim A7, it is characterized in that, described POI data acquisition module specifically comprises:
Webpage crawls unit, for crawling multiple webpage comprising POI data from internet;
POI data extraction unit, for extracting multiple POI data from described multiple comprising in the webpage of POI data.
A9, device according to claim A7 or A8, is characterized in that,
Described webpage crawls unit specifically for crawling multiple URL corresponding to multiple webpages of comprising address keyword from internet; Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
A10, device according to any one of claim A7-A9, is characterized in that,
Described POI data extraction unit specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
A11, device according to any one of claim A7-A10, is characterized in that,
Described Webpage crawls module specifically for crawling the multiple Webpages comprising address keyword from internet; Extract the multiple text messages be associated with described address keyword in described multiple Webpage; The address information of the corresponding web page page is extracted from described multiple text message.
A12, device according to any one of claim A7-A11, is characterized in that, also comprise: name information determination module in Webpage;
In described Webpage, name information determination module is used for when judging that this Webpage comprises multiple name information, calculates the text distance between described multiple name information and the address information of this Webpage respectively; By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage; And
Described Webpage comprises POI title determination module and also compares for the name information that POI title corresponding for this POI data is corresponding with address information in this Webpage; When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
Claims (10)
1., for determining that Webpage comprises a method for point of interest POI data, it is characterized in that, comprise:
Multiple POI data is obtained from internet;
Crawl the multiple Webpages comprising address information;
The address information that address information in described multiple POI data and described multiple Webpage comprise is normalized to latitude and longitude information respectively;
Based on same latitude and longitude information, mate with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
For POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
When this Webpage comprises the POI title of this POI data, determine that this Webpage comprises this point of interest POI data.
2. method according to claim 1, is characterized in that, the described step obtaining multiple POI data from internet comprises further:
Multiple webpage comprising POI data is crawled from internet;
Multiple POI data is extracted from described multiple comprising the webpage of POI data.
3. the method according to any one of claim 1-2, is characterized in that, describedly from internet, crawls multiple step comprising the webpage of POI data comprise further:
Multiple URL that multiple webpages of comprising address keyword are corresponding are crawled from internet;
Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature;
Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
4. the method according to any one of claim 1-3, is characterized in that, describedly comprises further from described multiple step extracting multiple POI data the webpage of POI data that comprises:
Based on belonging to the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL in same pattern set, generation is gathered corresponding POI data to this pattern and is extracted template;
Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
5. the method according to any one of claim 1-4, is characterized in that, described in crawl the multiple Webpages comprising address information step comprise further:
The multiple Webpages comprising address keyword are crawled from internet;
Extract the multiple text messages be associated with described address keyword in described multiple Webpage;
The address information of the corresponding web page page is extracted from described multiple text message.
6. the method according to any one of claim 1-5, is characterized in that, the method also comprises:
When judging that this Webpage comprises multiple name information, calculate the text distance between described multiple name information and the address information of this Webpage respectively;
By minimum text apart from corresponding name information, be defined as the name information corresponding with address information in this Webpage;
Wherein, the described POI title corresponding according to this POI data is searched in this Webpage, determines that the step of the POI title whether comprising this POI data in this Webpage comprises further:
The name information that POI title corresponding for this POI data is corresponding with address information in this Webpage is compared;
When comparison is consistent, determine that this Webpage comprises the POI title of this POI data.
7., for determining that Webpage comprises a device for point of interest POI data, it is characterized in that, comprise:
POI data acquisition module, for obtaining multiple POI data from internet;
Webpage crawls module, for crawling the multiple Webpages comprising address information;
Latitude and longitude information normalization module, is normalized to latitude and longitude information respectively for the address information address information in described multiple POI data and described multiple Webpage comprised;
Latitude and longitude information matching module, for based on same latitude and longitude information, mates with latitude and longitude information in multiple Webpage in the latitude and longitude information of described multiple POI data;
Webpage comprises POI title determination module, for for POI data and the Webpage with identical latitude and longitude information, the POI title corresponding according to this POI data is searched in this Webpage, determines the POI title whether comprising this POI data in this Webpage;
Webpage comprises POI data determination module, during for comprising the POI title of this POI data when this Webpage, determines that this Webpage comprises this point of interest POI data.
8. device according to claim 7, is characterized in that, described POI data acquisition module specifically comprises:
Webpage crawls unit, for crawling multiple webpage comprising POI data from internet;
POI data extraction unit, for extracting multiple POI data from described multiple comprising in the webpage of POI data.
9. the device according to any one of claim 7-8, is characterized in that,
Described webpage crawls unit specifically for crawling multiple URL corresponding to multiple webpages of comprising address keyword from internet; Carrying out pattern cluster to described multiple URL, is that same pattern gathers by the URL cluster with identical architectural feature; Filter out from multiple pattern set and comprise multiple pattern set comprising the webpage of POI data, and extract multiple webpage comprising POI data from this pattern gathers.
10. the device according to any one of claim 7-9, is characterized in that,
Described POI data extraction unit specifically for based on belong to same pattern gather in the corresponding multiple page structure feature comprising the webpage of POI data of multiple URL, generate and gather corresponding POI data to this pattern and extract template; Extract template based on described POI data, extract multiple POI data from described multiple comprising the webpage of POI data.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510148638.4A CN104699835B (en) | 2015-03-31 | 2015-03-31 | For determining that Webpage includes the method and device of point of interest POI data |
PCT/CN2015/099580 WO2016155386A1 (en) | 2015-03-31 | 2015-12-29 | Method and device for determining whether webpage comprises point of interest (poi) data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510148638.4A CN104699835B (en) | 2015-03-31 | 2015-03-31 | For determining that Webpage includes the method and device of point of interest POI data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104699835A true CN104699835A (en) | 2015-06-10 |
CN104699835B CN104699835B (en) | 2016-09-28 |
Family
ID=53346955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510148638.4A Active CN104699835B (en) | 2015-03-31 | 2015-03-31 | For determining that Webpage includes the method and device of point of interest POI data |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104699835B (en) |
WO (1) | WO2016155386A1 (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933171A (en) * | 2015-06-30 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | Method and device for associating data of interest point |
CN105117425A (en) * | 2015-07-31 | 2015-12-02 | 北京奇虎科技有限公司 | Method and apparatus for selecting interest point of POI data |
CN105138708A (en) * | 2015-09-30 | 2015-12-09 | 北京奇虎科技有限公司 | Method and device for identifying names of points of interest (POI) |
CN105160031A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Mining method and device for map point of interest (POI) data |
CN105159885A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Point-of-interest name identification method and device |
CN105160032A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Method and device for determining confidence of point of interest data in website |
CN105243136A (en) * | 2015-09-30 | 2016-01-13 | 北京奇虎科技有限公司 | Method and apparatus for mining point of interest (POI) data in internet |
CN105279249A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Method and device for determining confidence of point of interest data in website |
CN105279246A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Method and device for judging whether webpage contains specified point of interest POI |
CN105320752A (en) * | 2015-09-30 | 2016-02-10 | 北京奇虎科技有限公司 | Point of interest data mining method and apparatus |
CN105550330A (en) * | 2015-12-21 | 2016-05-04 | 北京奇虎科技有限公司 | Point of interest (POI) information sorting method and system |
CN105550169A (en) * | 2015-12-11 | 2016-05-04 | 北京奇虎科技有限公司 | Method and device for identifying point of interest names based on character length |
CN105608112A (en) * | 2015-12-10 | 2016-05-25 | 北京奇虎科技有限公司 | Method and apparatus for measuring quality of map POI data |
WO2016155386A1 (en) * | 2015-03-31 | 2016-10-06 | 北京奇虎科技有限公司 | Method and device for determining whether webpage comprises point of interest (poi) data |
CN106708952A (en) * | 2016-11-25 | 2017-05-24 | 北京神州绿盟信息安全科技股份有限公司 | Web page clustering method and device |
CN112000495B (en) * | 2020-10-27 | 2021-02-12 | 博泰车联网(南京)有限公司 | Method, electronic device and storage medium for point of interest information management |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080040684A1 (en) * | 2006-08-14 | 2008-02-14 | Richard Crump | Intelligent Pop-Up Window Method and Apparatus |
CN102591867A (en) * | 2011-01-07 | 2012-07-18 | 清华大学 | Searching service method based on mobile device position |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN103678629A (en) * | 2013-12-19 | 2014-03-26 | 北京大学 | Search engine method and system sensitive to geographical position |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101963962B (en) * | 2009-07-23 | 2014-02-26 | 高德软件有限公司 | Interest point data association method and device |
CN102142003B (en) * | 2010-07-30 | 2013-04-24 | 华为软件技术有限公司 | Method and device for providing point of interest information |
CN103514234B (en) * | 2012-06-30 | 2018-10-16 | 北京百度网讯科技有限公司 | A kind of page info extracting method and device |
CN104699835B (en) * | 2015-03-31 | 2016-09-28 | 北京奇虎科技有限公司 | For determining that Webpage includes the method and device of point of interest POI data |
-
2015
- 2015-03-31 CN CN201510148638.4A patent/CN104699835B/en active Active
- 2015-12-29 WO PCT/CN2015/099580 patent/WO2016155386A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080040684A1 (en) * | 2006-08-14 | 2008-02-14 | Richard Crump | Intelligent Pop-Up Window Method and Apparatus |
CN102591867A (en) * | 2011-01-07 | 2012-07-18 | 清华大学 | Searching service method based on mobile device position |
CN102841920A (en) * | 2012-06-30 | 2012-12-26 | 北京百度网讯科技有限公司 | Method and device for extracting webpage frame information |
CN103678629A (en) * | 2013-12-19 | 2014-03-26 | 北京大学 | Search engine method and system sensitive to geographical position |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2016155386A1 (en) * | 2015-03-31 | 2016-10-06 | 北京奇虎科技有限公司 | Method and device for determining whether webpage comprises point of interest (poi) data |
CN104933171A (en) * | 2015-06-30 | 2015-09-23 | 百度在线网络技术(北京)有限公司 | Method and device for associating data of interest point |
CN105117425A (en) * | 2015-07-31 | 2015-12-02 | 北京奇虎科技有限公司 | Method and apparatus for selecting interest point of POI data |
CN105320752B (en) * | 2015-09-30 | 2018-12-07 | 北京奇虎科技有限公司 | A kind of method for digging and device of interest point data |
CN105243136B (en) * | 2015-09-30 | 2019-02-19 | 北京奇虎科技有限公司 | A kind of method and apparatus of point of interest POI data in excavation internet |
CN105160032A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Method and device for determining confidence of point of interest data in website |
CN105243136A (en) * | 2015-09-30 | 2016-01-13 | 北京奇虎科技有限公司 | Method and apparatus for mining point of interest (POI) data in internet |
CN105279249A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Method and device for determining confidence of point of interest data in website |
CN105279246A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Method and device for judging whether webpage contains specified point of interest POI |
CN105320752A (en) * | 2015-09-30 | 2016-02-10 | 北京奇虎科技有限公司 | Point of interest data mining method and apparatus |
CN105279249B (en) * | 2015-09-30 | 2019-06-21 | 北京奇虎科技有限公司 | The determination method and device of the confidence level of interest point data in a kind of website |
CN105160032B (en) * | 2015-09-30 | 2019-05-31 | 北京奇虎科技有限公司 | The determination method and device of the confidence level of interest point data in a kind of website |
CN105159885A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Point-of-interest name identification method and device |
CN105160031A (en) * | 2015-09-30 | 2015-12-16 | 北京奇虎科技有限公司 | Mining method and device for map point of interest (POI) data |
CN105138708A (en) * | 2015-09-30 | 2015-12-09 | 北京奇虎科技有限公司 | Method and device for identifying names of points of interest (POI) |
CN105608112A (en) * | 2015-12-10 | 2016-05-25 | 北京奇虎科技有限公司 | Method and apparatus for measuring quality of map POI data |
CN105550169A (en) * | 2015-12-11 | 2016-05-04 | 北京奇虎科技有限公司 | Method and device for identifying point of interest names based on character length |
CN105550330A (en) * | 2015-12-21 | 2016-05-04 | 北京奇虎科技有限公司 | Point of interest (POI) information sorting method and system |
CN106708952A (en) * | 2016-11-25 | 2017-05-24 | 北京神州绿盟信息安全科技股份有限公司 | Web page clustering method and device |
CN106708952B (en) * | 2016-11-25 | 2019-11-19 | 北京神州绿盟信息安全科技股份有限公司 | A kind of Webpage clustering method and device |
US11023540B2 (en) | 2016-11-25 | 2021-06-01 | NSFOCUS Information Technology Co., Ltd. | Web page clustering method and device |
CN112000495B (en) * | 2020-10-27 | 2021-02-12 | 博泰车联网(南京)有限公司 | Method, electronic device and storage medium for point of interest information management |
Also Published As
Publication number | Publication date |
---|---|
WO2016155386A1 (en) | 2016-10-06 |
CN104699835B (en) | 2016-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104699835A (en) | Method and device used for determining webpages including POI (point of interest) data | |
CN110020433B (en) | Industrial and commercial high-management name disambiguation method based on enterprise incidence relation | |
US11698261B2 (en) | Method, apparatus, computer device and storage medium for determining POI alias | |
CN103514234B (en) | A kind of page info extracting method and device | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
CN104899243A (en) | Method and apparatus for detecting accuracy of POI (Point of Interest) data | |
CN101299217B (en) | Method, apparatus and system for processing map information | |
CN105069076A (en) | Method and apparatus for determining address information in home page of official website | |
CN109977287B (en) | Method for judging identity of real estate data of different information sources | |
CN108304423A (en) | A kind of information identifying method and device | |
CN109947881B (en) | POI weight judging method and device, mobile terminal and computer readable storage medium | |
CN103853738A (en) | Identification method for webpage information related region | |
CN102841920A (en) | Method and device for extracting webpage frame information | |
CN109492066B (en) | Method, device, equipment and storage medium for determining branch names of points of interest | |
CN112287566A (en) | Automatic driving scene library generation method and system and electronic equipment | |
CN104317909A (en) | Method and device for verifying data of points of interest | |
CN107463711A (en) | A kind of tag match method and device of data | |
CN102646124A (en) | Method for automatically identifying address information | |
CN101630315B (en) | Quick retrieval method and system | |
CN104537105A (en) | Automatic network physical landmark excavating method based on Web maps | |
CN108984640A (en) | A kind of geography information acquisition methods excavated based on web data | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
CN102654861A (en) | Method and system for calculating webpage extraction accuracy | |
CN105159885A (en) | Point-of-interest name identification method and device | |
CN107577744A (en) | Nonstandard Address automatic matching model, matching process and method for establishing model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220803 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |