CN105160031A - Mining method and device for map point of interest (POI) data - Google Patents

Mining method and device for map point of interest (POI) data Download PDF

Info

Publication number
CN105160031A
CN105160031A CN201510642102.8A CN201510642102A CN105160031A CN 105160031 A CN105160031 A CN 105160031A CN 201510642102 A CN201510642102 A CN 201510642102A CN 105160031 A CN105160031 A CN 105160031A
Authority
CN
China
Prior art keywords
poi
poi data
data
website
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510642102.8A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510642102.8A priority Critical patent/CN105160031A/en
Publication of CN105160031A publication Critical patent/CN105160031A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a mining method and device for map point of interest (POI) data. The method comprises steps of mining POI data in a POI data providing website so as to obtain a POI data set; extracting and mining geographical location data included in one or a plurality of government websites; and verifying the correctness of the POI data in the POI data set according to the geographical location data extracted and mined from one or a plurality of government websites. The mining method combines the characteristic of 'high data accuracy but high mining difficulty' of the government website with the characteristic of 'low data mining difficulty but low data accuracy' of the POI data providing website, takes the POI data mined from the POI data providing website as initial POI data, and takes the geographical location data included in the government website as standard data, thereby realizing high efficiency, high quality and high yield of the mining method for the map POI data, and solving the problem that the POI data mined from the internet includes a lot of dirty data, wrong data and invalid data.

Description

A kind of method for digging of map point of interest POI data and device
Technical field
The present invention relates to data mining technology field, be specifically related to a kind of method for digging and device of map point of interest POI data.
Background technology
POI (point of interest, PointofInterest) refers to that user is interested or concerning the specific geographic position point having practical use user; In Geographic Information System, POI can be a house, retail shop, mailbox, a bus station etc.
Traditional geographical information collection method needs map mapping worker to adopt accurate instrument of surveying and mapping to remove to obtain the longitude and latitude of each point of interest, and then mark, and this mode wastes time and energy.Owing to internet also existing various POI data, if these data can be excavated from internet, can greatly save manpower and time.
But the POI data on internet is various, be wherein flooded with a large amount of dirty datas, invalid data and misdata.In order to ensure the accuracy of POI data, the POI data to excavating from internet is needed to be further processed.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or a kind of method for digging of map point of interest POI data solved the problem at least in part and corresponding device.
According to one aspect of the present invention, provide a kind of method for digging of map point of interest POI data, the method comprises:
Excavation POI data provides the POI data in website, obtains POI data collection;
Extract the geographic position data comprised in one or more government website;
By the geographic position data extracted from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
Alternatively, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information;
The geographic position data comprised in the one or more government website of described extraction, by the geographic position data extracted from described one or more government website, verify that the correctness of the POI data that described POI data is concentrated comprises:
First from government website, excavate address information;
For each address information, judge described POI data concentrates whether there is the POI address identical with this address information;
If existed, from the source page of this address information, extract corresponding name information;
Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
Alternatively, described determine that corresponding POI data is wrong after, the method comprises further:
POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
Alternatively, before described excavation POI data provides the POI data in website, the method comprises further:
Excavate the multiple webpages comprising POI data associative key;
URL form according to described multiple webpage carries out cluster to webpage;
Choose and comprise the more cluster of effective POI data, provide website as POI data.
Alternatively, described excavation POI data provides the POI data in website, obtains POI data collection and comprises:
There is provided website for a POI data, provide the structure of web page feature in website to formulate the template excavating POI data according to this POI data;
Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
Alternatively, before the geographic position data comprised in the one or more government website of described extraction, the method comprises further: the suffix excavating website is the government website of " .gov.cn ".
Alternatively, described address information of excavating from government website comprises:
Create address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road;
Word process is cut to the web page contents in government website;
For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
Alternatively, described for each address information, judge that described POI data is concentrated and whether be there is the POI address identical with this address information and comprise:
Resolve the longitude and latitude of this address information;
Resolve the longitude and latitude of the POI address that described POI data is concentrated;
Contrast the longitude and latitude of this address information and the longitude and latitude of POI address;
If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
Alternatively, the described name information extracting correspondence from the source page of this address information comprises:
If comprise the name information that multiple corresponding longitude and latitude is identical in the source page of this address information, then calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as described corresponding name information.
According to another aspect of the present invention, provide a kind of excavating gear of map point of interest POI data, this device comprises:
Excavate unit, being suitable for excavating POI data provides POI data in website, obtains POI data collection;
Authentication unit, is suitable for extracting the geographic position data comprised in one or more government website; And the geographic position data be suitable for by extracting from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
Alternatively, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information;
Described authentication unit, is suitable for first from government website, excavating address information; For each address information, judge described POI data concentrates whether there is the POI address identical with this address information; If existed, from the source page of this address information, extract corresponding name information; Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
Alternatively, described authentication unit, after determining that corresponding POI data is wrong, is further adapted for the POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
Alternatively, described excavation unit, is further adapted for the multiple webpages excavating and comprise POI data associative key; URL form according to described multiple webpage carries out cluster to webpage; Choose and comprise the more cluster of effective POI data, provide website as POI data.
Alternatively, described excavation unit, is suitable for providing website for a POI data, provides the structure of web page feature in website to formulate the template excavating POI data according to this POI data; Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
Alternatively, described authentication unit, before extracting the geographic position data comprised in one or more government website, is further adapted for the government website that the suffix excavating website is " .gov.cn ".
Alternatively, described authentication unit, be suitable for creating address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road; And be suitable for cutting word process to the web page contents in government website; For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
Alternatively, described authentication unit, is suitable for, for each address information, resolving the longitude and latitude of this address information; Resolve the longitude and latitude of the POI address that described POI data is concentrated; And be suitable for contrasting the longitude and latitude of this address information and the longitude and latitude of POI address; If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
Alternatively, described authentication unit, be suitable for for an address information, when comprising the identical name information of multiple corresponding longitude and latitude in the source page of this address information, calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as the name information corresponding with this address information.
From the above, the geographic position data of technical scheme provided by the invention by extracting from one or more government website, checking provides the correctness of the POI data excavated website from POI data, achieve the filtration to POI data, make the POI data finally obtained have higher accuracy.The feature of government website " data accuracy is high but excavation difficulty is large " and POI data are provided website by this programme, and " data volume is large, data mining difficulty is low but data accuracy is low " feature combine, the POI data that obtains is excavated as initial p OI data to provide website from POI data, the geographic position data comprised in government website is as normal data, achieve the high-level efficiency of the excavation scheme of map point of interest POI data, high-quality and high yield, overcome the POI data gone out from web mining in prior art and there is a large amount of dirty data, the problem of misdata and invalid data, have great importance.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Figure 1A shows the partial schematic diagram of the webpage in a government website according to an embodiment of the invention;
Figure 1B shows the partial schematic diagram of the webpage in a government website in accordance with another embodiment of the present invention;
Fig. 2 A shows the partial schematic diagram that POI data according to an embodiment of the invention provides the webpage on website;
Fig. 2 B shows the partial schematic diagram that POI data in accordance with another embodiment of the present invention provides the webpage on website;
Fig. 3 shows a kind of according to an embodiment of the invention process flow diagram of method for digging of map point of interest POI data;
Fig. 4 shows a kind of according to an embodiment of the invention schematic diagram of excavating gear of map point of interest POI data.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Under normal circumstances, the information that government website provides is more authoritative, and the accuracy rate of the geographic position data that government website provides is also higher, but the excavation difficulty of the geographic position data comprised in government website is larger.
Such as, Figure 1A shows the partial schematic diagram of the webpage in a government website according to an embodiment of the invention, and this figure particularly illustrates the partial schematic diagram comprising geographic position data in the webpage http://www.bjedu.gov.cn/publish/portal27/tab1805/info35658.htm on government website " Beijing City Education Commission " website; Figure 1B shows the partial schematic diagram of the webpage in a government website in accordance with another embodiment of the present invention, and this figure particularly illustrates the partial schematic diagram comprising geographic position data in the webpage http://www.bjguahao.gov.cn/comm/list-2-3-0-1.html on government website " Beijing's bespeak and register identical platform " website; Relatively Figure 1A and Figure 1B is known, the form of the geographic position data comprised in different government website in webpage is completely different with position, the not any rule of tool, therefore, cannot utilize unified method from government website, directly to excavate geographic position data as POI data.That is, because government website has the feature of " data accuracy is high but excavation difficulty is large ", directly a large amount of correct POI data cannot be obtained by excavating government website.
Internet also exists a class POI data and website is provided, as some Yellow Page websites provide the relevant POI data of a large amount of companies, and some search service class websites provide a large amount of service class and to be correlated with POI data, these can provide the website of the relevant POI data such as a large amount of company and enterprise, dining room all can provide website as POI data.
Such as, Fig. 2 A shows the partial schematic diagram that POI data according to an embodiment of the invention provides the webpage on website, this figure particularly illustrate POI data provide website " like side net " on the partial schematic diagram of a webpage http://www.aibang.com/detail/691925157-431964635; Fig. 2 B shows the partial schematic diagram that POI data in accordance with another embodiment of the present invention provides the webpage on website, this figure particularly illustrate POI data provide website " like side net " on the partial schematic diagram of another webpage http://www.aibang.com/detail/1816542021-944045717; Comparison diagram 2A and Fig. 2 B is known, the URL address format of these two webpages is similar, the form of the POI data comprised in webpage and position have identical rule, therefore, unified method can be utilized directly to provide website from these POI data and to excavate a large amount of POI data, but because POI data provides website not authoritative, the accuracy of the POI data excavated cannot be ensured.That is, because POI data provides the feature of website " data volume is large, data mining difficulty is low but data accuracy is low ", a large amount of POI data providing website to obtain by excavating POI data cannot directly as correct POI data.
The feature of website " data volume large, data mining difficulty is low but data accuracy is low " is provided based on the feature of above-mentioned government website " data accuracy high but excavate difficulty large " and POI data, the two combines by the present invention, proposes a kind of excavation scheme of map point of interest POI data.
Fig. 3 shows a kind of according to an embodiment of the invention process flow diagram of method for digging of map point of interest POI data.As shown in Figure 3, the method comprises:
Step S310, excavation POI data provides the POI data in website, obtains POI data collection.
In this step, described POI data is concentrated and is comprised many POI data.
Step S320, extracts the geographic position data comprised in one or more government website.
Step S330, by the geographic position data extracted from one or more government website, the correctness of the POI data that checking POI data is concentrated.
Visible, the geographic position data of method shown in Fig. 3 by extracting from one or more government website, checking provides the correctness of the POI data excavated website from POI data, achieve the filtration to POI data, make the POI data finally obtained have higher accuracy.The feature of government website " data accuracy is high but excavation difficulty is large " and POI data are provided website by this programme, and " data volume is large, data mining difficulty is low but data accuracy is low " feature combine, the POI data that obtains is excavated as initial p OI data to provide website from POI data, the geographic position data comprised in government website is as normal data, achieve the high-level efficiency of the excavation scheme of map point of interest POI data, high-quality and high yield, overcome the POI data gone out from web mining in prior art and there is a large amount of dirty data, the problem of misdata and invalid data, have great importance.
In one embodiment of the invention, before the step S310 excavation POI data of method shown in Fig. 3 provides the POI data in website, the method needs first to excavate POI data provides website, specifically comprises:
Excavate the multiple webpages comprising POI data associative key; URL form according to described multiple webpage carries out cluster to webpage; Choose and comprise the more cluster of effective POI data, provide website as POI data.
In one embodiment of the invention, the step S310 of method shown in Fig. 3, excavation POI data provides the POI data in website, obtains POI data collection and comprises:
Step S311, provides website for a POI data, provides the structure of web page feature in website to formulate the template excavating POI data according to this POI data.
Step S312, is applied to this POI data and provides all webpages in website by described template, excavate this POI data and provide POI data in website, obtain POI data collection.
Such as, excavate the multiple webpages comprising keywords such as " addresses ", the URL form according to these multiple webpages carries out cluster to webpage, chooses to comprise a fairly large number of cluster of effective POI data; The form of the POI data comprised due to the webpage under each cluster has identical rule, therefore can formulate according to the feature of the structure of web page in each cluster the template excavating POI data adaptively, the template made is applied to all webpages under this cluster, therefrom extract all POI data, comprise POI title, POI address, contact method etc.
In one embodiment of the invention, before extracting at the step S320 of method shown in Fig. 3 the geographic position data comprised in one or more government website, the method needs first to excavate government website, and whether namely need the website in internet is that government website identifies and excavates.
Under normal circumstances, for given URL, if the suffix of its website is " .gov.cn ", then can think that it belongs to the website of government's type; Such as, the suffix of the website www.bjedu.gov.cn of a webpage http://www.bjedu.gov.cn/publish/portal27/tab1805/info35658.htm on the website of Beijing City Education Commission is " .gov.cn ", belongs to the webpage of government's type; The suffix of the website www.bjguahao.gov.cn of a webpage http://www.bjguahao.gov.cn/comm/list-2-3-0-1.htmlde on the website of Beijing's bespeak and register identical platform is also " .gov.cn ", also belongs to the website of government's type.
Therefore, in the particular embodiment, before extracting in step S320 the geographic position data comprised in one or more government website, the method shown in Fig. 3 comprises further: the suffix excavating website is the government website of " .gov.cn ".
In one embodiment of the invention, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information.
In the step S320-S330 of then method shown in Fig. 3, the geographic position data comprised in the one or more government website of described extraction, by the geographic position data extracted from one or more government website, the correctness of the POI data that checking POI data is concentrated comprises:
Step S331, first excavates address information from government website.
Step S332, for each address information, judges described POI data concentrates whether there is the POI address identical with this address information.
Step S333, if existed, extracts corresponding name information from the source page of this address information.
Step S334, contrasts this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determines that corresponding POI data is correct, if different, then determines that corresponding POI data is wrong.
Further, above-mentioned determine that corresponding POI data is wrong after, the method also comprises: step S335, utilizes this name information to replace POI title in corresponding POI data, using the POI data of the correspondence after replacing as correct POI data.
Such as, excavation POI data provides the POI data in website, the POI data collection obtained is: { POI data 1, POI data 2, POI data 3}, wherein, POI data 1 is (POI title 1, POI address 1), POI data 2 is (POI title 2, POI address 2), POI data 3 is (POI title 3, POI address 3), namely POI data collection is { (POI title 1, POI address 1), (POI title 2, POI address 2), (POI title 3, POI address 3) }.
First from government website, excavate multiple address information, for each address information, judge described POI data concentrates whether there is the POI address identical with this address information; For address information 1, in this example, address information 1 specifically: No. 1, Dongcheng District, Beijing during March Wusi Dajie, and the POI address 2 that POI data is concentrated specifically: No. 1, Dongcheng District Wusi Dajie, then determine that the concentrated POI address 2 of address information 1 and POI data is identical; Corresponding name information is extracted from the source page of address information 1, be called name information 1, concentrate from POI data and find the POI title 2 corresponding with POI address 2, name information 1 and POI title 2 are contrasted, if identical, then determine that POI data 2 is correct, if different, then determine that POI data 2 is wrong; When determining that POI data 2 is wrong, name information 1 is utilized to replace POI title 2, using the POI data 2 after replacement as correct POI data 2.
In the particular embodiment, above-mentioned steps S331 excavates address information and comprises from government website:
Step S331a, create address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road.
Step S331b, cuts word process to the web page contents in government website.
Step S331c, for a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavates text section as address information.
Such as, word process is cut to the web page contents in government website, for a text chunk in a webpage, as " No. 8, Xishiku Street, Xicheng District, Beijing City ", the all sub-word " Beijing " obtained after cutting word, " Xicheng District ", " No. 8, Xishiku Dajie " all can be found in address database, then excavate text section as address information.
At above-mentioned steps S332 for each address information, judge that described POI data is concentrated in the process that whether there is the POI address identical with this address information, need POI address that POI data is concentrated and the address information extracted from government website to be normalized to same form, contrast again with the form after normalization.In a specific embodiment, the form after using longitude and latitude as normalization, then step S332 comprises:
Step S332a, resolves the longitude and latitude of this address information.
Step S332b, resolves the longitude and latitude of the POI address that described POI data is concentrated.
Step S332c, contrasts the longitude and latitude of this address information and the longitude and latitude of POI address.
Step S332d, if the longitude and latitude that there is the POI address identical with the longitude and latitude of this address information, then determines the POI address that described POI data concentrates existence identical with this address information.
Visible, the present embodiment makes the contrast between address information with POI address be comparing of pair warp and weft number of degrees word, do not relate to the comparison procedure of the similarity of character string, solve POI data more simply, accurately and efficiently and concentrate the problem that whether there is the POI address identical with certain address information.
In the particular embodiment, if only comprise a name information in the source page of address information, then can think that this name information is corresponding with address information, form a geographic position data; If comprise multiple name information in the source page of address information, consider that the text distance of address information in the geographic position data shown in same webpage and name information is less, name information corresponding to address information can be extracted according to text distance.Namely above-mentioned steps S333 extracts corresponding name information and comprises from the source page of this address information: if comprise the identical name information of multiple corresponding longitude and latitude in the source page of this address information, then calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as described corresponding name information.
Fig. 4 shows a kind of according to an embodiment of the invention schematic diagram of excavating gear of map point of interest POI data.As shown in Figure 4, the excavating gear 400 of this map point of interest POI data comprises:
Excavate unit 410, being suitable for excavating POI data provides POI data in website, obtains POI data collection.
Authentication unit 420, is suitable for excavating the geographic position data extracting and comprise in one or more government website; And be suitable for, by extracting the geographic position data comprised in one or more government website, verifying the correctness of the POI data that described POI data is concentrated.
Visible, the geographic position data of device shown in Fig. 4 by extracting from one or more government website, checking provides the correctness of the POI data excavated website from POI data, achieve the filtration to POI data, make the POI data finally obtained have higher accuracy.The feature of government website " data accuracy is high but excavation difficulty is large " and POI data are provided website by this programme, and " data volume is large, data mining difficulty is low but data accuracy is low " feature combine, the POI data that obtains is excavated as initial p OI data to provide website from POI data, the geographic position data comprised in government website is as normal data, achieve the high-level efficiency of the excavation scheme of map point of interest POI data, high-quality and high yield, overcome the POI data gone out from web mining in prior art and there is a large amount of dirty data, the problem of misdata and invalid data, have great importance.
In one embodiment of the invention, the excavation unit 410 of Fig. 4 shown device, before excavation POI data provides the POI data in website, is further adapted for the multiple webpages excavating and comprise POI data associative key; URL form according to described multiple webpage carries out cluster to webpage; Choose and comprise the more cluster of effective POI data, provide website as POI data.
In one embodiment of the invention, the excavation unit 410 of Fig. 4 shown device, is suitable for providing website for a POI data, provides the structure of web page feature in website to formulate the template excavating POI data according to this POI data; Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
In one embodiment of the invention, the authentication unit 420 of Fig. 4 shown device, before extracting the geographic position data comprised in one or more government website, is further adapted for the government website that the suffix excavating website is " .gov.cn ".
In one embodiment of the invention, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information; The then authentication unit 420 of Fig. 4 shown device, is suitable for first from government website, excavating address information; For each address information, judge described POI data concentrates whether there is the POI address identical with this address information; If existed, from the source page of this address information, extract corresponding name information; Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
Further, authentication unit 420, after determining that corresponding POI data is wrong, is also suitable for the POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
Particularly, the process that authentication unit 420 excavates address information from government website comprises: create address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road; And be suitable for cutting word process to the web page contents in government website; For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
Particularly, authentication unit 420 judges that described POI data concentrates the process that whether there is the POI address identical with this address information to comprise: for each address information, resolve the longitude and latitude of this address information; Resolve the longitude and latitude of the POI address that described POI data is concentrated; And be suitable for contrasting the longitude and latitude of this address information and the longitude and latitude of POI address; If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
Particularly, the process that authentication unit 420 extracts corresponding name information from the source page of this address information comprises: for an address information, when comprising the identical name information of multiple corresponding longitude and latitude in the source page of this address information, calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as the name information corresponding with this address information.
It should be noted that, each embodiment of Fig. 4 shown device is corresponding identical with each embodiment above shown in Fig. 1-Fig. 3, describes in detail above, does not repeat them here.
In sum, the geographic position data of technical scheme provided by the invention by extracting from one or more government website, checking provides the correctness of the POI data excavated website from POI data, achieve the filtration to POI data, make the POI data finally obtained have higher accuracy.The feature of government website " data accuracy is high but excavation difficulty is large " and POI data are provided website by this programme, and " data volume is large, data mining difficulty is low but data accuracy is low " feature combine, the POI data that obtains is excavated as initial p OI data to provide website from POI data, the geographic position data comprised in government website is as normal data, achieve the high-level efficiency of the excavation scheme of map point of interest POI data, high-quality and high yield, overcome the POI data gone out from web mining in prior art and there is a large amount of dirty data, the problem of misdata and invalid data, have great importance.
It should be noted that:
Intrinsic not relevant to any certain computer, virtual bench or miscellaneous equipment with display at this algorithm provided.Various fexible unit also can with use based on together with this teaching.According to description above, the structure constructed required by this kind of device is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the excavating gear of the map point of interest POI data of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
The invention discloses the method for digging of A1, a kind of map point of interest POI data, wherein, the method comprises:
Excavation POI data provides the POI data in website, obtains POI data collection;
Extract the geographic position data comprised in one or more government website;
By the geographic position data extracted from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
A2, method as described in A1, wherein, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information;
The geographic position data comprised in the one or more government website of described extraction, by the geographic position data extracted from described one or more government website, verify that the correctness of the POI data that described POI data is concentrated comprises:
First from government website, excavate address information;
For each address information, judge described POI data concentrates whether there is the POI address identical with this address information;
If existed, from the source page of this address information, extract corresponding name information;
Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
A3, method as described in A2, wherein, described determine that corresponding POI data is wrong after, the method comprises further:
POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
A4, method as described in A1, wherein, before described excavation POI data provides the POI data in website, the method comprises further:
Excavate the multiple webpages comprising POI data associative key;
URL form according to described multiple webpage carries out cluster to webpage;
Choose and comprise the more cluster of effective POI data, provide website as POI data.
A5, method as described in A1, wherein, described excavation POI data provides the POI data in website, obtains POI data collection and comprises:
There is provided website for a POI data, provide the structure of web page feature in website to formulate the template excavating POI data according to this POI data;
Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
A6, method as described in A1, wherein, before the geographic position data comprised in the one or more government website of described extraction, the method comprises further: the suffix excavating website is the government website of " .gov.cn ".
A7, method as described in A2, wherein, described address information of excavating from government website comprises:
Create address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road;
Word process is cut to the web page contents in government website;
For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
A8, method as described in A2, wherein, described for each address information, judge that described POI data is concentrated and whether be there is the POI address identical with this address information and comprise:
Resolve the longitude and latitude of this address information;
Resolve the longitude and latitude of the POI address that described POI data is concentrated;
Contrast the longitude and latitude of this address information and the longitude and latitude of POI address;
If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
A9, method as described in A2, wherein, describedly from the source page of this address information, extract corresponding name information comprise:
If comprise the name information that multiple corresponding longitude and latitude is identical in the source page of this address information, then calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as described corresponding name information.
The invention also discloses the excavating gear of B10, a kind of map point of interest POI data, wherein, this device comprises:
Excavate unit, being suitable for excavating POI data provides POI data in website, obtains POI data collection;
Authentication unit, is suitable for extracting the geographic position data comprised in one or more government website; And the geographic position data be suitable for by extracting from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
B11, device as described in B10, wherein, every bar POI data that described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information;
Described authentication unit, is suitable for first from government website, excavating address information; For each address information, judge described POI data concentrates whether there is the POI address identical with this address information; If existed, from the source page of this address information, extract corresponding name information; Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
B12, device as described in B11, wherein,
Described authentication unit, after determining that corresponding POI data is wrong, is further adapted for the POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
B13, device as described in B10, wherein,
Described excavation unit, is further adapted for the multiple webpages excavating and comprise POI data associative key; URL form according to described multiple webpage carries out cluster to webpage; Choose and comprise the more cluster of effective POI data, provide website as POI data.
B14, device as described in B10, wherein,
Described excavation unit, is suitable for providing website for a POI data, provides the structure of web page feature in website to formulate the template excavating POI data according to this POI data; Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
B15, device as described in B10, wherein,
Described authentication unit, before extracting the geographic position data comprised in one or more government website, is further adapted for the government website that the suffix excavating website is " .gov.cn ".
B16, device as described in B11, wherein,
Described authentication unit, be suitable for creating address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road; And be suitable for cutting word process to the web page contents in government website; For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
B17, device as described in B11, wherein,
Described authentication unit, is suitable for, for each address information, resolving the longitude and latitude of this address information; Resolve the longitude and latitude of the POI address that described POI data is concentrated; And be suitable for contrasting the longitude and latitude of this address information and the longitude and latitude of POI address; If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
B18, device as described in B11, wherein,
Described authentication unit, be suitable for for an address information, when comprising the identical name information of multiple corresponding longitude and latitude in the source page of this address information, calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as the name information corresponding with this address information.

Claims (10)

1. a method for digging for map point of interest POI data, wherein, the method comprises:
Excavation POI data provides the POI data in website, obtains POI data collection;
Extract the geographic position data comprised in one or more government website;
By the geographic position data extracted from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
2. every bar POI data that the method for claim 1, wherein described POI data is concentrated comprises: POI title and POI address; Every bar geographic position data comprises: address information and name information;
The geographic position data comprised in the one or more government website of described extraction, by the geographic position data extracted from described one or more government website, verify that the correctness of the POI data that described POI data is concentrated comprises:
First from government website, excavate address information;
For each address information, judge described POI data concentrates whether there is the POI address identical with this address information;
If existed, from the source page of this address information, extract corresponding name information;
Contrast this name information, and the POI title that the POI address identical with this address information is corresponding, if identical, then determine that corresponding POI data is correct, if different, then determine that corresponding POI data is wrong.
3. method as claimed in claim 2, wherein, described determine that corresponding POI data is wrong after, the method comprises further:
POI title in the POI data utilizing the replacement of this name information corresponding, using the POI data of the correspondence after replacement as correct POI data.
4., the method for claim 1, wherein before described excavation POI data provides the POI data in website, the method comprises further:
Excavate the multiple webpages comprising POI data associative key;
URL form according to described multiple webpage carries out cluster to webpage;
Choose and comprise the more cluster of effective POI data, provide website as POI data.
5. the method for claim 1, wherein described excavation POI data provides the POI data in website, obtains POI data collection and comprises:
There is provided website for a POI data, provide the structure of web page feature in website to formulate the template excavating POI data according to this POI data;
Described template is applied to this POI data and all webpages in website are provided, excavate this POI data and POI data in website is provided, obtain POI data collection.
6. before the geographic position data the method for claim 1, wherein comprised in the one or more government website of described extraction, the method comprises further: the suffix excavating website is the government website of " .gov.cn ".
7. method as claimed in claim 2, wherein, described address information of excavating from government website comprises:
Create address database, this address database comprises: the address date of the province in the whole nation, city, county (district), small towns and road;
Word process is cut to the web page contents in government website;
For a text chunk in a webpage, if all sub-word obtained after cutting word all hits described address database, then excavate text section as address information.
8. method as claimed in claim 2, wherein, described for each address information, judges that described POI data is concentrated and whether be there is the POI address identical with this address information and comprise:
Resolve the longitude and latitude of this address information;
Resolve the longitude and latitude of the POI address that described POI data is concentrated;
Contrast the longitude and latitude of this address information and the longitude and latitude of POI address;
If there is the longitude and latitude of the POI address identical with the longitude and latitude of this address information, then determine the POI address that described POI data concentrates existence identical with this address information.
9. method as claimed in claim 2, wherein, the described name information extracting correspondence from the source page of this address information comprises:
If comprise the name information that multiple corresponding longitude and latitude is identical in the source page of this address information, then calculate this address information and the text distance of each name information in webpage, to get with text between this address information apart from minimum name information as described corresponding name information.
10. an excavating gear for map point of interest POI data, wherein, this device comprises:
Excavate unit, being suitable for excavating POI data provides POI data in website, obtains POI data collection;
Authentication unit, is suitable for extracting the geographic position data comprised in one or more government website; And the geographic position data be suitable for by extracting from described one or more government website, verify the correctness of the POI data that described POI data is concentrated.
CN201510642102.8A 2015-09-30 2015-09-30 Mining method and device for map point of interest (POI) data Pending CN105160031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510642102.8A CN105160031A (en) 2015-09-30 2015-09-30 Mining method and device for map point of interest (POI) data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510642102.8A CN105160031A (en) 2015-09-30 2015-09-30 Mining method and device for map point of interest (POI) data

Publications (1)

Publication Number Publication Date
CN105160031A true CN105160031A (en) 2015-12-16

Family

ID=54800887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510642102.8A Pending CN105160031A (en) 2015-09-30 2015-09-30 Mining method and device for map point of interest (POI) data

Country Status (1)

Country Link
CN (1) CN105160031A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608153A (en) * 2015-12-18 2016-05-25 晶赞广告(上海)有限公司 Universal POI information association method
CN105893544A (en) * 2016-03-31 2016-08-24 东南大学 Method for generating urban space big data map on basis of POI commercial form data
CN107368480A (en) * 2016-05-11 2017-11-21 中国移动通信集团辽宁有限公司 A kind of interest point data type of error positioning, repeat recognition methods and device
CN107491537A (en) * 2017-08-23 2017-12-19 北京百度网讯科技有限公司 POI data excavation, information retrieval method, device, equipment and medium
CN107562958A (en) * 2017-09-30 2018-01-09 百度在线网络技术(北京)有限公司 Map interest point failure method for digging, device, server and storage medium
CN107656913A (en) * 2017-09-30 2018-02-02 百度在线网络技术(北京)有限公司 Map point of interest address extraction method, apparatus, server and storage medium
CN108021656A (en) * 2017-12-01 2018-05-11 百度在线网络技术(北京)有限公司 Compare generation method, device, server and the storage medium of coordinate
CN108846111A (en) * 2018-06-22 2018-11-20 阿里巴巴集团控股有限公司 A kind of method and device detecting store location correctness
CN108959609A (en) * 2018-07-16 2018-12-07 阿里巴巴集团控股有限公司 The update method and device of store address
CN110647607A (en) * 2018-12-29 2020-01-03 北京奇虎科技有限公司 POI data verification method and device based on picture identification
CN110647883A (en) * 2018-12-29 2020-01-03 北京奇虎科技有限公司 Method and device for mining point of interest (POI) data
CN111813819A (en) * 2020-07-13 2020-10-23 南通市测绘院有限公司 Space-time big data-based place name and address online matching method
CN111858543A (en) * 2019-04-26 2020-10-30 中国移动通信集团河北有限公司 Quality evaluation method and device of commercial map and computing equipment
CN112016326A (en) * 2020-09-25 2020-12-01 北京百度网讯科技有限公司 Map area word recognition method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140143171A1 (en) * 2009-11-09 2014-05-22 United Parcel Service Of America, Inc. Enhanced location information for points of interest
CN104216895A (en) * 2013-05-31 2014-12-17 高德软件有限公司 Method and device for generating POI data
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data
CN104866542A (en) * 2015-05-05 2015-08-26 腾讯科技(深圳)有限公司 POI data verification method and device
CN104899243A (en) * 2015-03-31 2015-09-09 北京奇虎科技有限公司 Method and apparatus for detecting accuracy of POI (Point of Interest) data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140143171A1 (en) * 2009-11-09 2014-05-22 United Parcel Service Of America, Inc. Enhanced location information for points of interest
CN104216895A (en) * 2013-05-31 2014-12-17 高德软件有限公司 Method and device for generating POI data
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data
CN104899243A (en) * 2015-03-31 2015-09-09 北京奇虎科技有限公司 Method and apparatus for detecting accuracy of POI (Point of Interest) data
CN104866542A (en) * 2015-05-05 2015-08-26 腾讯科技(深圳)有限公司 POI data verification method and device

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105608153A (en) * 2015-12-18 2016-05-25 晶赞广告(上海)有限公司 Universal POI information association method
CN105893544B (en) * 2016-03-31 2019-07-12 东南大学 A method of city space big data map is generated based on POI industry situation data
CN105893544A (en) * 2016-03-31 2016-08-24 东南大学 Method for generating urban space big data map on basis of POI commercial form data
CN107368480A (en) * 2016-05-11 2017-11-21 中国移动通信集团辽宁有限公司 A kind of interest point data type of error positioning, repeat recognition methods and device
CN107368480B (en) * 2016-05-11 2021-05-04 中国移动通信集团辽宁有限公司 Method and device for locating and repeatedly identifying error types of point of interest data
CN107491537A (en) * 2017-08-23 2017-12-19 北京百度网讯科技有限公司 POI data excavation, information retrieval method, device, equipment and medium
CN107656913A (en) * 2017-09-30 2018-02-02 百度在线网络技术(北京)有限公司 Map point of interest address extraction method, apparatus, server and storage medium
CN107562958A (en) * 2017-09-30 2018-01-09 百度在线网络技术(北京)有限公司 Map interest point failure method for digging, device, server and storage medium
CN107656913B (en) * 2017-09-30 2021-03-23 百度在线网络技术(北京)有限公司 Map interest point address extraction method, map interest point address extraction device, server and storage medium
CN108021656A (en) * 2017-12-01 2018-05-11 百度在线网络技术(北京)有限公司 Compare generation method, device, server and the storage medium of coordinate
CN108021656B (en) * 2017-12-01 2020-10-13 百度在线网络技术(北京)有限公司 Method and device for generating comparison coordinates, server and storage medium
CN108846111A (en) * 2018-06-22 2018-11-20 阿里巴巴集团控股有限公司 A kind of method and device detecting store location correctness
CN108846111B (en) * 2018-06-22 2020-04-24 阿里巴巴集团控股有限公司 Method and device for detecting correctness of position of shop
CN108959609A (en) * 2018-07-16 2018-12-07 阿里巴巴集团控股有限公司 The update method and device of store address
CN108959609B (en) * 2018-07-16 2021-09-21 创新先进技术有限公司 Shop address updating method and device
CN110647607A (en) * 2018-12-29 2020-01-03 北京奇虎科技有限公司 POI data verification method and device based on picture identification
CN110647883A (en) * 2018-12-29 2020-01-03 北京奇虎科技有限公司 Method and device for mining point of interest (POI) data
CN111858543A (en) * 2019-04-26 2020-10-30 中国移动通信集团河北有限公司 Quality evaluation method and device of commercial map and computing equipment
CN111858543B (en) * 2019-04-26 2024-03-19 中国移动通信集团河北有限公司 Quality assessment method and device for commercial map and computing equipment
CN111813819A (en) * 2020-07-13 2020-10-23 南通市测绘院有限公司 Space-time big data-based place name and address online matching method
CN111813819B (en) * 2020-07-13 2022-07-22 南通市测绘院有限公司 Space-time big data-based place name and address online matching method
CN112016326A (en) * 2020-09-25 2020-12-01 北京百度网讯科技有限公司 Map area word recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105160031A (en) Mining method and device for map point of interest (POI) data
CN105224660A (en) A kind of disposal route of map point of interest POI data and device
CN104572955A (en) System and method for determining POI name based on clustering
CN103019879B (en) The disposal route of browser crash info and system
CN105183908B (en) A kind of classification method and device of point of interest POI data
CN105808609A (en) Discrimination method and equipment of point-of-information data redundancy
CN104699835A (en) Method and device used for determining webpages including POI (point of interest) data
CN105468583A (en) Entity relationship obtaining method and device
Chuang et al. Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN103678509A (en) Method and device for generating webpage template
CA2752119A1 (en) Unique referencing scheme identifier for location
CN105159885A (en) Point-of-interest name identification method and device
CN103678510A (en) Method and device for providing visualized label for webpage
CN105069079B (en) Method and device for screening POI (Point of interest) data
CN105279246A (en) Method and device for judging whether webpage contains specified point of interest POI
CN105159921A (en) Method and apparatus for de-duplicating point-of-interest (POI) data in map
CN105138708A (en) Method and device for identifying names of points of interest (POI)
CN105515882A (en) Website security detection method and website security detection device
CN105279249B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN105159940A (en) Geographic information mining method, apparatus and server
US20220292253A1 (en) Automated structured data object creation and location integration into multiple location applications
CN105320752A (en) Point of interest data mining method and apparatus
CN105160032B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN104462519A (en) Search query method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151216