CN105975477A - Method for automatically constructing place name data sets on basis of network - Google Patents
Method for automatically constructing place name data sets on basis of network Download PDFInfo
- Publication number
- CN105975477A CN105975477A CN201610214120.0A CN201610214120A CN105975477A CN 105975477 A CN105975477 A CN 105975477A CN 201610214120 A CN201610214120 A CN 201610214120A CN 105975477 A CN105975477 A CN 105975477A
- Authority
- CN
- China
- Prior art keywords
- data
- address
- name
- network
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
Abstract
The invention discloses a method for automatically constructing place name data sets on the basis of a network, and belongs to the technical field of computer application. The method for automatically constructing the place name data sets on the basis of the network comprises the following steps: extracting geographic spatial data from a google database by using a google search engine API; 2, filtering unrelated webpages in the extracted data; 3, importing output of the step 2 and extracting geographic information; and 4, selecting a geographic coding tool, converting extracted address information into geographic coordinates and marking the geographic coordinates on a map. The method disclosed in the invention gives full play to the advantages of a data extraction module search engine, and the geographic information is searched from the webpages by using proper search query keywords. In a webpage filter module, the useless interference data is excluded by adopting a filter algorithm. By effectively and dynamically extracting the geographic information from unstructured data sources such as the webpages, the data can have high integrity and timeliness at the same time. The method has high practical value.
Description
Technical field
The invention belongs to Computer Applied Technology field, automatically build geographical name data based on network particularly to one
The method of collection.
Background technology
Place name suffers from important effect, such as, at mobile device end group in position in building various geographic application
The service put.For these demands, it is desirable to be able to automatically set up the technical method of geographical name data collection.Nowadays net
On network, the existing geography information amount having a value is huge and grows with each passing day, how to obtain from network accurately,
Geography information sets up geographical name data collection timely is a difficult problem instantly.
Spatial data source can be divided into two classes: structurized data source and unstructured data sources.Many research people
Member has retrieved data from structurized Data Source.They are by link (such as DBpedia) or lead to
Cross mutual mode (such as LinkedGeoData, Wikimapia and OpenStreetMap,
LinkedGeoData provides a SPARQL interface, and Wikimapia and OpenStreetMap provides
One RESTful API) downloading geographical data file.The Data Source of these well-formeds provides static letter
Breath, but up-to-date change cannot be presented.And, such as the number in OpenStreetMap and Wikimapia
According to, it is that individual adds, does not verify through authoritative institution.And the data of Google Maps are to test through manual
Card has higher precision.The most therefore, the data in Google Maps update slowly, because verifying new ground square tube
Often need to take some time as cost.
By contrast, the unstructured nature of webpage is easy to geography information and is changed in real time, and up-to-date geography information is past
There is provided toward on webpage.
Most information retrieval based on place name based on webpage is required for processing fuzzy place name, because from net
The place name extracted in Ye often produces ambiguity.Such as, a place name also has non-geographic meaning, because local warp
Often can name with object, people, external feature, or historical factor.On the other hand, many local meetings are with more
Famous local name is named, and two different places may have identical title.Some scholars profit
Use supervised learning technology, by the co-occurrence model from wikipedia, solve place name ambiguity
Problem.
In sum, at present, in the research building geographical name data collection, lacking can be from unstructured data sources
In effective solve place name ambiguity problem on the premise of, the geographic information data of the effective dynamically change of extraction
Method.
Summary of the invention
The purpose of the present invention is to propose to a kind of method automatically building geographical name data collection based on network, its feature exists
In, automatically build geographical name data collection based on network and comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling
On map.
Described step 1 specifically includes following steps:
Step A1: extract street name from OSM (OpenSreetMap), i.e. downloading OSM data becomes one
Individual XML file, it is to be made up of node, road and three original data types of dependency, and each is original
Data type all designs a series of label, and each label is basically by one, and (k, v) to composition;
Wherein OSM is a leading VGI (Volunteered Geographic Information) project, purport
Towards the ground platform that can freely edit of all volunteers in creating a worldwide, it has super at present
Crossing 1,600,000 registration users, the registration user of nearly 30% is made that the contribution of reality to this project;
Step A2: determine search key word, the key word of search engine inquiry is made up of three parts, i.e. street name
Title, city name and business type, wherein street name is got by upper step A1, and business type is the most manually
Welcome business type is provided, is then increased the type of disappearance by the result of map denotation below;
Step A3: selected search engine, extracts geographical spatial data, this geographical space from network search engines
Data depend on the operation principle of search engine, the operation principle of search engine be divided into collection information, arrangement information,
Accept inquiry and visualization;According to the difference of way of search, it is divided into again full-text search, directory index and Meta Search Engine.
Described step 2 specifically to select concrete filter algorithm to filter returning result according to specific objective:
From the result that search engine returns, filter many undesired data, extract the data wanted, at the number returned
According to including substantial amounts of real estate list, these sources of houses mainly comprise home address;Here, search is used to draw
Hold up the house sold recently that any one property firm website of search is comprised, obtained Search Results bag
Containing all of property firm network address.Then, the URL of these websites is extracted;It follows that in the knot of step 1
In Guo, the network address of these property firms is fallen in automatic fitration, it is to avoid be resolved to from property firm from step 1
Useless geography information on website causes the waste in time and resource.
Described step 3 comprises the steps:
Step C1: in the extraction process of address, has two kinds of situations, the first situation be whole address information all
In a line, the second situation is that address information is in multirow;
Step C2: in the case of the first of step C1, it is judged that a line in webpage whether with numeral beginning,
Comprise city name, and the length of row should be less than given marginal value, if the length of this row surmounts given
Marginal value, this row also has the probability of other address informations the most very little;
Step C3: in the case of the second of step C1, the method same by step C2 is come
Differentiate and after two row are linked to be together, whether represent an address: if the first row starts with numeral, the second row contains
City name, on the premise of the length of two row is less than given marginal value, this two row is extracted together as address;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage
Only one of which, this site title is exactly location name, and this situation probability is at a relatively high;If comprised multiplely
Location, then return address list, i other words, when the address extracted in a webpage is more than one, this page
In face, all addresses all extract, and return in address list;
Step C5: on the premise of the list of step C4 return address, searches each address in list
Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address
Identical, then corresponding web site title is recognized as place name;
Step C6: last, each address from address list, obtain corresponding place name.
Described step 4 comprises the steps:
Step D1: upload data set to a kind of geocoding instrument, make data occur in above;
Step D2: geocoding instrument detects position data automatically, and represents in tag form;
Step D3: clicking on label, corresponding information will present;
Step D4: automatically detect position data according to step D2 geocoding instrument, and represent in tag form
Select the data that the data that can show maybe can not be shown, or select to be shown in what manner.
A kind of side automatically building geographical name data collection based on network that the beneficial effects of the present invention present invention proposes
Method, gives full play to the advantage of data extraction module search engine, by appropriate search keywords from webpage
Retrieval geography information.In home page filter module, filter algorithm is used to get rid of those useless interference data.
From this unstructured data sources of webpage, effectively extract geography information dynamically, make data have height simultaneously
Integrity degree and real-time.Thus overcome most geographic information data collection and be from structurized data
Source, data are sufficiently complete, and the shortcoming that real-time is poor;This method has the highest practical value.
Accompanying drawing explanation
Fig. 1 is the operation principle schematic diagram of search engine.
Fig. 2 is retrieval geography information schematic diagram from webpage.
Fig. 3 is that the geography information of google search engine is marked at schematic diagram on map.
Fig. 4 is that the address information extracted is converted into geographical coordinate and is marked at schematic diagram on map.
Detailed description of the invention
The present invention proposes a kind of method automatically building geographical name data collection based on network, below in conjunction with the accompanying drawings and real
Execute example to be explained.
It is illustrated in figure 1 the operation principle schematic diagram of search engine., one has four modules, obtains including data
Take (collection information), Web page filters (arrangement information), information extraction (acceptance inquiry) and visualization four
Individual module.Data acquisition is to obtain related web page according to key word from Web page;Web page filtering module is
The incoherent page is removed, such as real estate homepage etc. from the page obtained;Information extraction is from obtaining
Web page extracts the geography information such as address, place name;Visualization is to be shown in ground by the geography information obtained
In figure, it is simple to compare and search.
In this approach, solve the problem how choosing inquiry vocabulary, and at the base of given inquiry vocabulary
On plinth, delete useless feedback result, filter out useful network address.After extracting network address, in different feelings
Carry out different parsings under condition, when the most complete geography information is present in a line or multirow, carry out different parsings
Mode.On this basis, when the address no matter parsed is one or more, finally can extract useful
Place name and corresponding geography information, and carry out visual presentation by geocoding.Below in embodiment, choosing
Take 10 area conduct tests, by accurate rate, recall rate and F value (accurate rate and the mediation of recall rate
Average) calculating, finally reaching a conclusion is: with the geography information in Google's panoramic table for ground truth
On the premise of, compare OSM and Google Maps, and the method is significantly better than the former, holds with the latter generally
Flat, and in some region, also it is better than the latter, illustrate that the method has the highest practical value.
Embodiment,
This method automatically builds geographical name data collection based on network comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling
On map.Concrete operations are as follows:
Step 1: this step includes tetra-concrete operation steps of A1-A4, first by step A1 and the result of A2
It is attributed to following table:
Step A3, selected search engine, extracts geographical spatial data, because it is the biggest from network search engines
Depend on the operation principle of search engine in degree, select google search engine here;
Step A4 is with two kinds of search key word<City Name><Place Types>and<Street
Names><City Name><Place Types>scans in google search engine, it appeared that the latter
Search Results more complete, so taking the latter's (as shown in Figure 3).
Step 2: filter out incoherent webpage from the data extracted;
Here, use search engine to search for any one property firm website comprised living of selling recently
Residence, lifts a wherein example: search for " 2117 Tondolea LN " in Google, wraps in first page of return
Containing property firm website, such as Zillow, Redfin, Movoto, Trulia's, Realtor, RE/MAX
Etc..Then, extract URL, www.zillow.com and the www.redfin.com etc. of these websites, connect
Getting off, in the result of step 1, the network address of these property firms is fallen in automatic fitration.
Step 3: the output of steps for importing 2, extracting geographic information (as shown in Figure 2);
Described step 3 includes six concrete steps of C1-C6, is described below:
Step C1: in the extraction process of address, can run into two kinds of situations, and the first situation is whole address information
All in a line, the second situation is that address information is in multirow;
Step C2: in step C1 in the case of the first, it is judged that whether a line in webpage is with numeral beginning, bag
Containing city name, set marginal value as 100;
Step C3: in the case of step C1 the second, differentiates two row even by method same in step C2
An address whether is represented after becoming together;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage
Only one of which, it is assumed that this site title is exactly location name, if comprising multiple address, then returns to ground
Location list;
Step C5: on the premise of the list of step C4 return address, searches each address in list
Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address
Identical, then corresponding web site title is recognized as place name;
Step C4, C5 pseudo-code as follows:
Step C6: last, obtains corresponding place name in each address from address list.
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling
On map (as shown in Figure 4);Concrete operation step is as follows:
Step D1: upload data set to Google Fusion Tables (a kind of geocoding instrument), upload
Data it would appear that above;
Step D2:Google Fusion Tables detects position data automatically, and with referred to as Map of
The label form of<location column name>represents;
Step D3: clicking on label, corresponding information will present.
Step D4: which data can be selected to show, which will not be shown, it is also possible to select in what manner
Showing, define two kinds of stitch the most at work, the blue stitch of band Y label illustrates and walks from information extraction
The correct place name obtained in rapid, and the red stitch of band N label presents is those information being filtered off,
Such as wait to rent or house for sale.
In sum, the present invention advantage of data extraction module search engine, crucial with appropriate search inquiry
Geography information retrieved from webpage in word;Those are got rid of useless by home page filter module one filter algorithm of proposition
Interference data.It extracts positional information module and uses existing algorithm to extract useful information, such as location name
Claim and address.By visualization model, the most extracted place name on map is visualized, it is therefore an objective to comment
Estimate the effect of the geographical name data collection produced.
Claims (5)
1. the method automatically building geographical name data collection based on network, it is characterised in that based on the automatic structure of network
Build geographical name data collection to comprise the steps:
Step 1: use google search engine API to extract geographical spatial data from Google data base;
Step 2: filter out incoherent webpage from the data extracted;
Step 3: the output of steps for importing 2, extracting geographic information;
Step 4: select geographical coding tools, the address information extracted is converted into geographical coordinate, then labelling
On map.
A kind of method automatically building geographical name data collection based on network, its feature exists
Following steps are specifically included in, described step 1:
Step A1: extract street name from OSM (OpenSreetMap), i.e. downloading OSM data becomes one
Individual XML file, it is to be made up of node, road and three original data types of dependency, and each is original
Data type all designs a series of label, and each label is basically by one, and (k, v) to composition;
Wherein OSM is a leading VGI (Volunteered Geographic Information) project, purport
Towards the ground platform that can freely edit of all volunteers in creating a worldwide, it has super at present
Crossing 1,600,000 registration users, the registration user of nearly 30% is made that the contribution of reality to this project;
Step A2: determine search key word, the key word of search engine inquiry is made up of three parts, i.e. street name
Title, city name and business type, wherein street name is got by upper step A1, and business type is the most manually
Welcome business type is provided, is then increased the type of disappearance by the result of map denotation below;
Step A3: selected search engine, extracts geographical spatial data, this geographical space from network search engines
Data depend on the operation principle of search engine, the working method of search engine be divided into collection information, arrangement information,
Accept inquiry;According to the difference of way of search, it is divided into again full-text search, directory index and Meta Search Engine.
A kind of method automatically building geographical name data collection based on network, its feature exists
In, described step 2 specifically to select concrete filter algorithm to carry out returning result according to specific objective
Filter: filter many undesired data from the result that search engine returns, extracts the data wanted, and is returning
Data include substantial amounts of real estate list, these sources of houses mainly comprise home address;Here, use is searched
The house sold recently that any one property firm website of rope engine search is comprised, obtained search knot
Fruit comprises all of property firm network address, then, extracts the URL of these websites;It follows that in step 1
Result in automatic fitration fall the network address of these property firms, it is to avoid be resolved to from real estate from step 1
Useless geography information in company's site causes the waste in time and resource.
A kind of method automatically building geographical name data collection based on network, its feature exists
In, described step 3 comprises the steps:
Step C1: in the extraction process of address, has two kinds of situations, the first situation be whole address information all
In a line, the second situation is that address information is in multirow;
Step C2: in the case of the first of step C1, it is judged that a line in webpage whether with numeral beginning,
Comprise city name, and the length of row should be less than given marginal value, if the length of this row surmounts given
Marginal value, this row also has the probability of other address informations the most very little;
Step C3: in the case of the second of step C1, the method same by step C2 is come
Differentiate and after two row are linked to be together, whether represent an address: if the first row starts with numeral, the second row contains
City name, on the premise of the length of two row is less than given marginal value, this two row is extracted together as address;
Step C4: whether the address that judgement extracts is more than one, if corresponding corresponding address, place in webpage
Only one of which, this site title is exactly location name, and this situation probability is at a relatively high;If comprised multiplely
Location, then return address list, i other words, when the address extracted in a webpage is more than one, this page
In face, all addresses all extract, and return in address list;
Step C5: on the premise of the list of step C4 return address, searches each address in list
Rope, in the webpage of all returns, if the webpage returned only comprises an address, and and index address
Identical, then corresponding web site title is recognized as place name;
Step C6: last, each address from address list, obtain corresponding place name.
A kind of method automatically building geographical name data collection based on network, its feature exists
In, described step 4 comprises the steps:
Step D1: upload data set to a kind of geocoding instrument, make data occur in above;
Step D2: geocoding instrument detects position data automatically, and represents in tag form;
Step D3: clicking on label, corresponding information will present;
Step D4: automatically detect position data according to step D2 geocoding instrument, and represent in tag form
Select the data that the data that can show maybe can not be shown, or select to be shown in what manner.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610214120.0A CN105975477B (en) | 2016-04-07 | 2016-04-07 | A method of constructing geographical name data collection automatically based on network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610214120.0A CN105975477B (en) | 2016-04-07 | 2016-04-07 | A method of constructing geographical name data collection automatically based on network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105975477A true CN105975477A (en) | 2016-09-28 |
CN105975477B CN105975477B (en) | 2019-11-08 |
Family
ID=56989512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610214120.0A Expired - Fee Related CN105975477B (en) | 2016-04-07 | 2016-04-07 | A method of constructing geographical name data collection automatically based on network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975477B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153865A (en) * | 2017-12-22 | 2018-06-12 | 中山市小榄企业服务有限公司 | A kind of network application acquisition system of internet |
CN108334579A (en) * | 2018-01-25 | 2018-07-27 | 孙如江 | Place name identification number encoder, coding method and equipment based on space-time business |
CN108984640A (en) * | 2018-06-22 | 2018-12-11 | 华北电力大学 | A kind of geography information acquisition methods excavated based on web data |
CN109974726A (en) * | 2017-12-28 | 2019-07-05 | 北京搜狗科技发展有限公司 | A kind of road state determines method and device |
CN112084389A (en) * | 2020-08-17 | 2020-12-15 | 上海交通大学 | Network crawler-based academic institution geographical position information extraction method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130046746A1 (en) * | 2007-08-29 | 2013-02-21 | Enpulz, L.L.C. | Search engine with geographical verification processing |
CN105335468A (en) * | 2015-09-28 | 2016-02-17 | 北京信息科技大学 | Geographic position entity normalized method based on Baidu map API |
-
2016
- 2016-04-07 CN CN201610214120.0A patent/CN105975477B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130046746A1 (en) * | 2007-08-29 | 2013-02-21 | Enpulz, L.L.C. | Search engine with geographical verification processing |
CN105335468A (en) * | 2015-09-28 | 2016-02-17 | 北京信息科技大学 | Geographic position entity normalized method based on Baidu map API |
Non-Patent Citations (1)
Title |
---|
宁登鹏: "垂直搜索引擎中的多元化信息融合检索研究", 《中国优秀硕士学位论文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153865A (en) * | 2017-12-22 | 2018-06-12 | 中山市小榄企业服务有限公司 | A kind of network application acquisition system of internet |
CN109974726A (en) * | 2017-12-28 | 2019-07-05 | 北京搜狗科技发展有限公司 | A kind of road state determines method and device |
CN108334579A (en) * | 2018-01-25 | 2018-07-27 | 孙如江 | Place name identification number encoder, coding method and equipment based on space-time business |
CN108984640A (en) * | 2018-06-22 | 2018-12-11 | 华北电力大学 | A kind of geography information acquisition methods excavated based on web data |
CN112084389A (en) * | 2020-08-17 | 2020-12-15 | 上海交通大学 | Network crawler-based academic institution geographical position information extraction method |
Also Published As
Publication number | Publication date |
---|---|
CN105975477B (en) | 2019-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Purves et al. | The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet | |
Punjani et al. | Template-based question answering over linked geospatial data | |
CN103514234B (en) | A kind of page info extracting method and device | |
CN105975477B (en) | A method of constructing geographical name data collection automatically based on network | |
CN109657068B (en) | Cultural relic knowledge graph generation and visualization method for intelligent museum | |
CN108446368A (en) | A kind of construction method and equipment of Packaging Industry big data knowledge mapping | |
CN102841920B (en) | Method and device for extracting webpage frame information | |
KR101221959B1 (en) | An Integrated Region-Related Information Searching System applying of Map Interface and Knowledge Processing | |
CN105183869A (en) | Building knowledge mapping database and construction method thereof | |
CN101350013A (en) | Method and system for searching geographical information | |
CN103399862B (en) | Determine the method and apparatus of search index information corresponding to target query sequence | |
EP2131293A1 (en) | Method for mapping an X500 data model onto a relational database | |
CN107943810A (en) | The construction method of building information map | |
US8700624B1 (en) | Collaborative search apps platform for web search | |
CN100470549C (en) | Form locating data mining method | |
Souza et al. | The role of gazetteers in geographic knowledge discovery on the web | |
Borges et al. | The Web as a Data Source for Spatial Databases. | |
Polous et al. | OpenEventMap: A volunteered location-based service | |
CN104881501A (en) | Automatic Internet information obtaining and pushing method | |
Herschel et al. | DataBridges: data integration for digital cities | |
Chen et al. | Constructing a digital system of historical geographic information from the perspective of digital humanities: a case study of the historical geographic information database of Tibetan Buddhist monasteries | |
Sengupta et al. | Developing IITB smart campusGIS grid | |
Hassan | Modeling Infrastructure Maintenance Contracts in a Geospatial Database | |
Laender et al. | Integrating Web data and geographic knowledge into spatial databases | |
Pittos et al. | GreekGeoQA: A Greek Question Answering System over Linked Geospatial Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191108 Termination date: 20200407 |
|
CF01 | Termination of patent right due to non-payment of annual fee |