CN105975477A - Method for automatically constructing place name data sets on basis of network - Google Patents

Method for automatically constructing place name data sets on basis of network Download PDF

Info

Publication number
CN105975477A
CN105975477A CN201610214120.0A CN201610214120A CN105975477A CN 105975477 A CN105975477 A CN 105975477A CN 201610214120 A CN201610214120 A CN 201610214120A CN 105975477 A CN105975477 A CN 105975477A
Authority
CN
China
Prior art keywords
data
address
name
network
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610214120.0A
Other languages
Chinese (zh)
Other versions
CN105975477B (en
Inventor
张莹
何慧
马苗苗
王竹晓
刘少文
李超鹏
杜立明
文丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN201610214120.0A priority Critical patent/CN105975477B/en
Publication of CN105975477A publication Critical patent/CN105975477A/en
Application granted granted Critical
Publication of CN105975477B publication Critical patent/CN105975477B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Abstract

The invention discloses a method for automatically constructing place name data sets on the basis of a network, and belongs to the technical field of computer application. The method for automatically constructing the place name data sets on the basis of the network comprises the following steps: extracting geographic spatial data from a google database by using a google search engine API; 2, filtering unrelated webpages in the extracted data; 3, importing output of the step 2 and extracting geographic information; and 4, selecting a geographic coding tool, converting extracted address information into geographic coordinates and marking the geographic coordinates on a map. The method disclosed in the invention gives full play to the advantages of a data extraction module search engine, and the geographic information is searched from the webpages by using proper search query keywords. In a webpage filter module, the useless interference data is excluded by adopting a filter algorithm. By effectively and dynamically extracting the geographic information from unstructured data sources such as the webpages, the data can have high integrity and timeliness at the same time. The method has high practical value.

Description

A kind of method automatically building geographical name data collection based on network
Technical field
The invention belongs to Computer Applied Technology field, automatically build geographical name data based on network particularly to one The method of collection.
Background technology
Place name suffers from important effect, such as, at mobile device end group in position in building various geographic application The service put.For these demands, it is desirable to be able to automatically set up the technical method of geographical name data collection.Nowadays net On network, the existing geography information amount having a value is huge and grows with each passing day, how to obtain from network accurately, Geography information sets up geographical name data collection timely is a difficult problem instantly.
Spatial data source can be divided into two classes: structurized data source and unstructured data sources.Many research people Member has retrieved data from structurized Data Source.They are by link (such as DBpedia) or lead to Cross mutual mode (such as LinkedGeoData, Wikimapia and OpenStreetMap, LinkedGeoData provides a SPARQL interface, and Wikimapia and OpenStreetMap provides One RESTful API) downloading geographical data file.The Data Source of these well-formeds provides static letter Breath, but up-to-date change cannot be presented.And, such as the number in OpenStreetMap and Wikimapia According to, it is that individual adds, does not verify through authoritative institution.And the data of Google Maps are to test through manual Card has higher precision.The most therefore, the data in Google Maps update slowly, because verifying new ground square tube Often need to take some time as cost.
By contrast, the unstructured nature of webpage is easy to geography information and is changed in real time, and up-to-date geography information is past There is provided toward on webpage.
Most information retrieval based on place name based on webpage is required for processing fuzzy place name, because from net The place name extracted in Ye often produces ambiguity.Such as, a place name also has non-geographic meaning, because local warp Often can name with object, people, external feature, or historical factor.On the other hand, many local meetings are with more Famous local name is named, and two different places may have identical title.Some scholars profit Use supervised learning technology, by the co-occurrence model from wikipedia, solve place name ambiguity Problem.
In sum, at present, in the research building geographical name data collection, lacking can be from unstructured data sources In effective solve place name ambiguity problem on the premise of, the geographic information data of the effective dynamically change of extraction Method.
Summary of the invention
The purpose of the present invention is to propose to a kind of method automatically building geographical name data collection based on network, its feature exists In, automatically build geographical name data collection based on network and comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling On map.
Described step 1 specifically includes following steps:
Step A1: extract street name from OSM (OpenSreetMap), i.e. downloading OSM data becomes one Individual XML file, it is to be made up of node, road and three original data types of dependency, and each is original Data type all designs a series of label, and each label is basically by one, and (k, v) to composition; Wherein OSM is a leading VGI (Volunteered Geographic Information) project, purport Towards the ground platform that can freely edit of all volunteers in creating a worldwide, it has super at present Crossing 1,600,000 registration users, the registration user of nearly 30% is made that the contribution of reality to this project;
Step A2: determine search key word, the key word of search engine inquiry is made up of three parts, i.e. street name Title, city name and business type, wherein street name is got by upper step A1, and business type is the most manually Welcome business type is provided, is then increased the type of disappearance by the result of map denotation below;
Step A3: selected search engine, extracts geographical spatial data, this geographical space from network search engines Data depend on the operation principle of search engine, the operation principle of search engine be divided into collection information, arrangement information, Accept inquiry and visualization;According to the difference of way of search, it is divided into again full-text search, directory index and Meta Search Engine. Described step 2 specifically to select concrete filter algorithm to filter returning result according to specific objective: From the result that search engine returns, filter many undesired data, extract the data wanted, at the number returned According to including substantial amounts of real estate list, these sources of houses mainly comprise home address;Here, search is used to draw Hold up the house sold recently that any one property firm website of search is comprised, obtained Search Results bag Containing all of property firm network address.Then, the URL of these websites is extracted;It follows that in the knot of step 1 In Guo, the network address of these property firms is fallen in automatic fitration, it is to avoid be resolved to from property firm from step 1 Useless geography information on website causes the waste in time and resource.
Described step 3 comprises the steps:
Step C1: in the extraction process of address, has two kinds of situations, the first situation be whole address information all In a line, the second situation is that address information is in multirow;
Step C2: in the case of the first of step C1, it is judged that a line in webpage whether with numeral beginning, Comprise city name, and the length of row should be less than given marginal value, if the length of this row surmounts given Marginal value, this row also has the probability of other address informations the most very little;
Step C3: in the case of the second of step C1, the method same by step C2 is come Differentiate and after two row are linked to be together, whether represent an address: if the first row starts with numeral, the second row contains City name, on the premise of the length of two row is less than given marginal value, this two row is extracted together as address;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage Only one of which, this site title is exactly location name, and this situation probability is at a relatively high;If comprised multiplely Location, then return address list, i other words, when the address extracted in a webpage is more than one, this page In face, all addresses all extract, and return in address list;
Step C5: on the premise of the list of step C4 return address, searches each address in list Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address Identical, then corresponding web site title is recognized as place name;
Step C6: last, each address from address list, obtain corresponding place name.
Described step 4 comprises the steps:
Step D1: upload data set to a kind of geocoding instrument, make data occur in above;
Step D2: geocoding instrument detects position data automatically, and represents in tag form;
Step D3: clicking on label, corresponding information will present;
Step D4: automatically detect position data according to step D2 geocoding instrument, and represent in tag form Select the data that the data that can show maybe can not be shown, or select to be shown in what manner.
A kind of side automatically building geographical name data collection based on network that the beneficial effects of the present invention present invention proposes Method, gives full play to the advantage of data extraction module search engine, by appropriate search keywords from webpage Retrieval geography information.In home page filter module, filter algorithm is used to get rid of those useless interference data. From this unstructured data sources of webpage, effectively extract geography information dynamically, make data have height simultaneously Integrity degree and real-time.Thus overcome most geographic information data collection and be from structurized data Source, data are sufficiently complete, and the shortcoming that real-time is poor;This method has the highest practical value.
Accompanying drawing explanation
Fig. 1 is the operation principle schematic diagram of search engine.
Fig. 2 is retrieval geography information schematic diagram from webpage.
Fig. 3 is that the geography information of google search engine is marked at schematic diagram on map.
Fig. 4 is that the address information extracted is converted into geographical coordinate and is marked at schematic diagram on map.
Detailed description of the invention
The present invention proposes a kind of method automatically building geographical name data collection based on network, below in conjunction with the accompanying drawings and real Execute example to be explained.
It is illustrated in figure 1 the operation principle schematic diagram of search engine., one has four modules, obtains including data Take (collection information), Web page filters (arrangement information), information extraction (acceptance inquiry) and visualization four Individual module.Data acquisition is to obtain related web page according to key word from Web page;Web page filtering module is The incoherent page is removed, such as real estate homepage etc. from the page obtained;Information extraction is from obtaining Web page extracts the geography information such as address, place name;Visualization is to be shown in ground by the geography information obtained In figure, it is simple to compare and search.
In this approach, solve the problem how choosing inquiry vocabulary, and at the base of given inquiry vocabulary On plinth, delete useless feedback result, filter out useful network address.After extracting network address, in different feelings Carry out different parsings under condition, when the most complete geography information is present in a line or multirow, carry out different parsings Mode.On this basis, when the address no matter parsed is one or more, finally can extract useful Place name and corresponding geography information, and carry out visual presentation by geocoding.Below in embodiment, choosing Take 10 area conduct tests, by accurate rate, recall rate and F value (accurate rate and the mediation of recall rate Average) calculating, finally reaching a conclusion is: with the geography information in Google's panoramic table for ground truth On the premise of, compare OSM and Google Maps, and the method is significantly better than the former, holds with the latter generally Flat, and in some region, also it is better than the latter, illustrate that the method has the highest practical value.
Embodiment,
This method automatically builds geographical name data collection based on network comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling On map.Concrete operations are as follows:
Step 1: this step includes tetra-concrete operation steps of A1-A4, first by step A1 and the result of A2 It is attributed to following table:
Step A3, selected search engine, extracts geographical spatial data, because it is the biggest from network search engines Depend on the operation principle of search engine in degree, select google search engine here;
Step A4 is with two kinds of search key word<City Name><Place Types>and<Street Names><City Name><Place Types>scans in google search engine, it appeared that the latter Search Results more complete, so taking the latter's (as shown in Figure 3).
Step 2: filter out incoherent webpage from the data extracted;
Here, use search engine to search for any one property firm website comprised living of selling recently Residence, lifts a wherein example: search for " 2117 Tondolea LN " in Google, wraps in first page of return Containing property firm website, such as Zillow, Redfin, Movoto, Trulia's, Realtor, RE/MAX Etc..Then, extract URL, www.zillow.com and the www.redfin.com etc. of these websites, connect Getting off, in the result of step 1, the network address of these property firms is fallen in automatic fitration.
Step 3: the output of steps for importing 2, extracting geographic information (as shown in Figure 2);
Described step 3 includes six concrete steps of C1-C6, is described below:
Step C1: in the extraction process of address, can run into two kinds of situations, and the first situation is whole address information All in a line, the second situation is that address information is in multirow;
Step C2: in step C1 in the case of the first, it is judged that whether a line in webpage is with numeral beginning, bag Containing city name, set marginal value as 100;
Step C3: in the case of step C1 the second, differentiates two row even by method same in step C2 An address whether is represented after becoming together;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage Only one of which, it is assumed that this site title is exactly location name, if comprising multiple address, then returns to ground Location list;
Step C5: on the premise of the list of step C4 return address, searches each address in list Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address Identical, then corresponding web site title is recognized as place name;
Step C4, C5 pseudo-code as follows:
Step C6: last, obtains corresponding place name in each address from address list.
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling On map (as shown in Figure 4);Concrete operation step is as follows:
Step D1: upload data set to Google Fusion Tables (a kind of geocoding instrument), upload Data it would appear that above;
Step D2:Google Fusion Tables detects position data automatically, and with referred to as Map of The label form of<location column name>represents;
Step D3: clicking on label, corresponding information will present.
Step D4: which data can be selected to show, which will not be shown, it is also possible to select in what manner Showing, define two kinds of stitch the most at work, the blue stitch of band Y label illustrates and walks from information extraction The correct place name obtained in rapid, and the red stitch of band N label presents is those information being filtered off, Such as wait to rent or house for sale.
In sum, the present invention advantage of data extraction module search engine, crucial with appropriate search inquiry Geography information retrieved from webpage in word;Those are got rid of useless by home page filter module one filter algorithm of proposition Interference data.It extracts positional information module and uses existing algorithm to extract useful information, such as location name Claim and address.By visualization model, the most extracted place name on map is visualized, it is therefore an objective to comment Estimate the effect of the geographical name data collection produced.

Claims (5)

1. the method automatically building geographical name data collection based on network, it is characterised in that based on the automatic structure of network Build geographical name data collection to comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling On map.
A kind of method automatically building geographical name data collection based on network, its feature exists Following steps are specifically included in, described step 1:
Step A1: extract street name from OSM (OpenSreetMap), i.e. downloading OSM data becomes one Individual XML file, it is to be made up of node, road and three original data types of dependency, and each is original Data type all designs a series of label, and each label is basically by one, and (k, v) to composition; Wherein OSM is a leading VGI (Volunteered Geographic Information) project, purport Towards the ground platform that can freely edit of all volunteers in creating a worldwide, it has super at present Crossing 1,600,000 registration users, the registration user of nearly 30% is made that the contribution of reality to this project;
Step A2: determine search key word, the key word of search engine inquiry is made up of three parts, i.e. street name Title, city name and business type, wherein street name is got by upper step A1, and business type is the most manually Welcome business type is provided, is then increased the type of disappearance by the result of map denotation below;
Step A3: selected search engine, extracts geographical spatial data, this geographical space from network search engines Data depend on the operation principle of search engine, the working method of search engine be divided into collection information, arrangement information, Accept inquiry;According to the difference of way of search, it is divided into again full-text search, directory index and Meta Search Engine.
A kind of method automatically building geographical name data collection based on network, its feature exists In, described step 2 specifically to select concrete filter algorithm to carry out returning result according to specific objective Filter: filter many undesired data from the result that search engine returns, extracts the data wanted, and is returning Data include substantial amounts of real estate list, these sources of houses mainly comprise home address;Here, use is searched The house sold recently that any one property firm website of rope engine search is comprised, obtained search knot Fruit comprises all of property firm network address, then, extracts the URL of these websites;It follows that in step 1 Result in automatic fitration fall the network address of these property firms, it is to avoid be resolved to from real estate from step 1 Useless geography information in company's site causes the waste in time and resource.
A kind of method automatically building geographical name data collection based on network, its feature exists In, described step 3 comprises the steps:
Step C1: in the extraction process of address, has two kinds of situations, the first situation be whole address information all In a line, the second situation is that address information is in multirow;
Step C2: in the case of the first of step C1, it is judged that a line in webpage whether with numeral beginning, Comprise city name, and the length of row should be less than given marginal value, if the length of this row surmounts given Marginal value, this row also has the probability of other address informations the most very little;
Step C3: in the case of the second of step C1, the method same by step C2 is come Differentiate and after two row are linked to be together, whether represent an address: if the first row starts with numeral, the second row contains City name, on the premise of the length of two row is less than given marginal value, this two row is extracted together as address;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage Only one of which, this site title is exactly location name, and this situation probability is at a relatively high;If comprised multiplely Location, then return address list, i other words, when the address extracted in a webpage is more than one, this page In face, all addresses all extract, and return in address list;
Step C5: on the premise of the list of step C4 return address, searches each address in list Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address Identical, then corresponding web site title is recognized as place name;
Step C6: last, each address from address list, obtain corresponding place name.
A kind of method automatically building geographical name data collection based on network, its feature exists In, described step 4 comprises the steps:
Step D1: upload data set to a kind of geocoding instrument, make data occur in above;
Step D2: geocoding instrument detects position data automatically, and represents in tag form;
Step D3: clicking on label, corresponding information will present;
Step D4: automatically detect position data according to step D2 geocoding instrument, and represent in tag form Select the data that the data that can show maybe can not be shown, or select to be shown in what manner.
CN201610214120.0A 2016-04-07 2016-04-07 A method of constructing geographical name data collection automatically based on network Expired - Fee Related CN105975477B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610214120.0A CN105975477B (en) 2016-04-07 2016-04-07 A method of constructing geographical name data collection automatically based on network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610214120.0A CN105975477B (en) 2016-04-07 2016-04-07 A method of constructing geographical name data collection automatically based on network

Publications (2)

Publication Number Publication Date
CN105975477A true CN105975477A (en) 2016-09-28
CN105975477B CN105975477B (en) 2019-11-08

Family

ID=56989512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610214120.0A Expired - Fee Related CN105975477B (en) 2016-04-07 2016-04-07 A method of constructing geographical name data collection automatically based on network

Country Status (1)

Country Link
CN (1) CN105975477B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153865A (en) * 2017-12-22 2018-06-12 中山市小榄企业服务有限公司 A kind of network application acquisition system of internet
CN108334579A (en) * 2018-01-25 2018-07-27 孙如江 Place name identification number encoder, coding method and equipment based on space-time business
CN108984640A (en) * 2018-06-22 2018-12-11 华北电力大学 A kind of geography information acquisition methods excavated based on web data
CN109974726A (en) * 2017-12-28 2019-07-05 北京搜狗科技发展有限公司 A kind of road state determines method and device
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046746A1 (en) * 2007-08-29 2013-02-21 Enpulz, L.L.C. Search engine with geographical verification processing
CN105335468A (en) * 2015-09-28 2016-02-17 北京信息科技大学 Geographic position entity normalized method based on Baidu map API

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130046746A1 (en) * 2007-08-29 2013-02-21 Enpulz, L.L.C. Search engine with geographical verification processing
CN105335468A (en) * 2015-09-28 2016-02-17 北京信息科技大学 Geographic position entity normalized method based on Baidu map API

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宁登鹏: "垂直搜索引擎中的多元化信息融合检索研究", 《中国优秀硕士学位论文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153865A (en) * 2017-12-22 2018-06-12 中山市小榄企业服务有限公司 A kind of network application acquisition system of internet
CN109974726A (en) * 2017-12-28 2019-07-05 北京搜狗科技发展有限公司 A kind of road state determines method and device
CN108334579A (en) * 2018-01-25 2018-07-27 孙如江 Place name identification number encoder, coding method and equipment based on space-time business
CN108984640A (en) * 2018-06-22 2018-12-11 华北电力大学 A kind of geography information acquisition methods excavated based on web data
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method

Also Published As

Publication number Publication date
CN105975477B (en) 2019-11-08

Similar Documents

Publication Publication Date Title
Purves et al. The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet
Punjani et al. Template-based question answering over linked geospatial data
CN103514234B (en) A kind of page info extracting method and device
CN105975477B (en) A method of constructing geographical name data collection automatically based on network
CN109657068B (en) Cultural relic knowledge graph generation and visualization method for intelligent museum
CN108446368A (en) A kind of construction method and equipment of Packaging Industry big data knowledge mapping
CN102841920B (en) Method and device for extracting webpage frame information
KR101221959B1 (en) An Integrated Region-Related Information Searching System applying of Map Interface and Knowledge Processing
CN105183869A (en) Building knowledge mapping database and construction method thereof
CN101350013A (en) Method and system for searching geographical information
CN103399862B (en) Determine the method and apparatus of search index information corresponding to target query sequence
EP2131293A1 (en) Method for mapping an X500 data model onto a relational database
CN107943810A (en) The construction method of building information map
US8700624B1 (en) Collaborative search apps platform for web search
CN100470549C (en) Form locating data mining method
Souza et al. The role of gazetteers in geographic knowledge discovery on the web
Borges et al. The Web as a Data Source for Spatial Databases.
Polous et al. OpenEventMap: A volunteered location-based service
CN104881501A (en) Automatic Internet information obtaining and pushing method
Herschel et al. DataBridges: data integration for digital cities
Chen et al. Constructing a digital system of historical geographic information from the perspective of digital humanities: a case study of the historical geographic information database of Tibetan Buddhist monasteries
Sengupta et al. Developing IITB smart campusGIS grid
Hassan Modeling Infrastructure Maintenance Contracts in a Geospatial Database
Laender et al. Integrating Web data and geographic knowledge into spatial databases
Pittos et al. GreekGeoQA: A Greek Question Answering System over Linked Geospatial Data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191108

Termination date: 20200407

CF01 Termination of patent right due to non-payment of annual fee