CN103077201B

CN103077201B - A kind of unknown position evaluation method based on the detection of internet active iteration

Info

Publication number: CN103077201B
Application number: CN201210579579.2A
Authority: CN
Inventors: 呙维; 黄亮; 朱欣焰; 陈旭
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2012-12-27
Filing date: 2012-12-27
Publication date: 2016-03-30
Anticipated expiration: 2032-12-27
Also published as: CN103077201A

Abstract

The invention relates to an unknown location estimation method based on Internet active iterative detection. Including the following steps: 1) Check the location entered by the user, if the database query fails, use the network engine to obtain the location-related webpage collection; 2) Extract the location description in the webpage and classify it; 3) Calculate the credibility of the search result C _s , if C _s meets the threshold C _min , skip to step 5; 4) Repeat steps 1 to 3 for the fuzzy position in the search results , until the reliability rate meets the threshold or reaches the limit of times; 5) Calculate the geographic range of the location description, and fuse it to obtain the approximate geographic range of the target location; the present invention makes full use of the abundant and dynamically changing geographic knowledge resources in the Internet to estimate the geographic range of the unknown location Approximate range. Aiming at various text location descriptions in the Internet, a semantic-based multi-scale location extraction method is adopted, and the approximate geographical range of the location is estimated by the Point-Radius algorithm.

Description

An Unknown Position Estimation Method Based on Internet Active Iterative Detection

技术领域technical field

本发明涉及一种未知位置估算方法，尤其是涉及一种基于互联网主动迭代探测的未知位置估算方法。The invention relates to an unknown location estimation method, in particular to an unknown location estimation method based on Internet active iterative detection.

背景技术Background technique

随着GPS等定位技术的不断发展和完善，基于位置的服务LBS（Location-BasedService）的应用领域不断扩充，例如各种电子地图服务平台（百度地图、谷歌地图、Bing地图等）、旅游信息查询系统、日常生活兴趣点查询系统、交通查询系统、社交网络等。这些位置服务平台或系统提供位置信息查询的方法主要有两种：一种是利用GPS定位、地图操作等获取较为精确的位置坐标进行查询；另一种是利用自然语言位置描述进行查询，这种定性或者半定量的位置描述存在多种不确定度，但是比较符合人类的表达习惯和认知。面向自然语言位置查询，位置数据库需要存储位置名称与地理范围之间的映射关系，而现有位置数据库由于建设成本高、耗时长、规模受限、更新困难等原因，难以存储所有的位置名称，而是主要集中于主要地名、地址、显著性POI等重要位置的采集与保存。因此，对生活中数量庞大、显著性小、重要性相对较低的位置进行查询变得无法实现，从而与全方位、多层次、多粒度的位置服务需求相矛盾。（参考文献：古静,基于位置的信息服务应用系统研究与开发[D].西安电子科技大学,2004;夏保国,基于GIS的武汉市旅游信息查询系统的设计与实现[D].华中科技大学,2006;高威斯,基于位置的服务与城市交通导航系统的设计[D].云南大学,2011;杨煜尧等,一种基于地理位置信息的移动互联网社交模型[J].计算机研究与发展,2011;）With the continuous development and improvement of positioning technologies such as GPS, the application fields of LBS (Location-Based Service) continue to expand, such as various electronic map service platforms (Baidu Map, Google Map, Bing Map, etc.), travel information query system, daily life point of interest query system, traffic query system, social network, etc. There are two main methods for these location service platforms or systems to provide location information query: one is to use GPS positioning, map operations, etc. to obtain more accurate location coordinates for query; the other is to use natural language location description for query. There are many uncertainties in qualitative or semi-quantitative location descriptions, but they are more in line with human expression habits and cognition. For natural language location queries, location databases need to store the mapping relationship between location names and geographic ranges. However, existing location databases are difficult to store all location names due to high construction costs, long time consumption, limited scale, and difficulty in updating. Instead, it mainly focuses on the collection and preservation of important locations such as main place names, addresses, and significant POIs. Therefore, it becomes impossible to inquire about a large number of locations with little significance and relatively low importance in life, thus contradicting the all-round, multi-level, and multi-granularity location service requirements. (References: Gu Jing, Research and Development of Location-Based Information Service Application System [D]. Xidian University, 2004; Xia Baoguo, Design and Implementation of GIS-Based Wuhan Tourism Information Query System [D]. Huazhong University of Science and Technology , 2006; Gao Weisi, location-based service and design of urban traffic navigation system [D]. Yunnan University, 2011; Yang Yuyao et al., a mobile Internet social model based on geographic location information [J]. Computer Research and Development, 2011;)

互联网作为大型知识库提供了丰富的地理知识，可以作为位置服务的扩展数据源。网络搜索的位置参考信息，需要利用自然语言理解从大量文本信息中提取位置描述。自然语言理解是能够实现人与计算机之间用自然语言进行有效通信的各种理论和方法，位置描述的自然语言理解主要是对位置名称和位置关系的识别。关于位置名称的识别，已有研究侧重于提取地理命名实体或地名，主要有两种方法：一种是基于规则的方法，建立地理命名实体或地名的语料库和构造规则，采用规则匹配的方式进行识别，这种方法对概念构造规则要求严格，能够提高抽取结果的准确率，但是使查全率下降很多，难以解决模糊位置和新位置识别的问题；另一种是基于统计的方法，由于不考虑句法、语义上的信息，不可避免地对一些低频数语的获取和邻接高频词引入的噪声上存在一些问题。关于位置关系的识别，已有研究主要侧重于提取基本空间关系（拓扑关系、度量关系、方位关系等），主要有两种方法：一种是基于语句分析的方法，这种方法需要彻底理解句法结构以及句子语义，存在脆弱性和多歧义问题；一种是基于模式的方法，可以避免对语句进行彻底分析，但是由于自然语言表达的丰富性，同一信息存在多种表达方式，会使模式的数量急剧膨胀。（参考文献：乐小虬等,基于空间语义角色的自然语言空间概念提取[J].武汉大学学报·信息科学版,2005;姜琳等,地理实体概念及其位置关系的获取和验证[J].计算机科学,2007;李丽双等,基于支持向量机的中文文本中地名识别[J].大连理工学报,2007;李晗静,基于自然语言处理的空间概念建模研究[D].哈尔滨工业大学,2007;李玉森,面向GIS的地理命名实体识别研究[J].重庆邮电大学学报(自然科学版),2008;马龙,基于条件随机域模型的中文地名识别的研究[D].大连理工大学,2009;唐旭日等,基于篇章的中文地名识别研究[J].中文信息学报,2010;蒋文明，面向中文文本的空间方位关系抽取方法研究[D].南京师范大学,2010;申琪君,中文文本空间关系标注方法研究[D].南京师范大学,2010;张雪英等,基于规则的中文地址要素解析方法[J].地球信息科学学报,2010;李海光,基于位置和语义特征的中文命名实体关系抽取研究[D].合肥工业大学,2011;杜萍等,中文地名识别与歧义消除——以中国县级以上行政区划地名为例[J].遥感技术与应用,2011.）The Internet, as a large knowledge base, provides rich geographical knowledge, which can be used as an extended data source for location services. Location reference information for web searches requires the use of natural language understanding to extract location descriptions from large amounts of text information. Natural language understanding is a variety of theories and methods that can realize effective communication between humans and computers in natural language. The natural language understanding of location description is mainly to identify location names and location relationships. Regarding the recognition of location names, existing studies have focused on extracting geographically named entities or place names. There are two main methods: one is the rule-based method, which establishes a corpus and construction rules for geographically named entities or place names, and adopts rule matching. Recognition, this method has strict requirements on concept construction rules, which can improve the accuracy of the extraction results, but it makes the recall rate drop a lot, and it is difficult to solve the problem of fuzzy position and new position recognition; the other is based on statistics. Considering syntactic and semantic information, there are inevitably some problems in the acquisition of some low-frequency numerals and the noise introduced by adjacent high-frequency words. Regarding the identification of positional relationships, existing research has focused on extracting basic spatial relationships (topological relationships, metric relationships, orientation relationships, etc.), and there are two main methods: one is based on sentence analysis, which requires a thorough understanding of syntax structure and sentence semantics, there are problems of vulnerability and multiple ambiguities; one is a pattern-based method, which can avoid a thorough analysis of the sentence, but due to the richness of natural language expression, there are multiple ways of expressing the same information, which will make the pattern The number expanded rapidly. (References: Le Xiaoqiu et al., Extraction of Spatial Concepts in Natural Language Based on Spatial Semantic Roles [J]. Journal of Wuhan University Information Science Edition, 2005; Jiang Lin et al., Acquisition and Verification of Geographical Entity Concepts and Their Location Relationships [J] ].Computer Science, 2007; Li Lishuang et al., Recognition of Place Names in Chinese Text Based on Support Vector Machines [J]. Journal of Dalian Institute of Technology, 2007; Li Hanjing, Research on Spatial Concept Modeling Based on Natural Language Processing [D]. Harbin Institute of Technology, 2007; Li Yusen, Research on Geographically Named Entity Recognition for GIS [J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2008; Ma Long, Research on Chinese Place Name Recognition Based on Conditional Random Domain Model [D]. Dalian University of Technology, 2009; Tang Xuri et al., Research on text-based Chinese place name recognition [J]. Chinese Journal of Information, 2010; Jiang Wenming, Research on the extraction method of spatial orientation relationship for Chinese texts [D]. Nanjing Normal University, 2010; Shen Qijun, Chinese texts Research on Spatial Relationship Labeling Method [D]. Nanjing Normal University, 2010; Zhang Xueying et al., Rules-Based Analysis Method of Chinese Address Elements [J]. Journal of Geo-Information Science, 2010; Li Haiguang, Chinese Named Entity Relationship Based on Location and Semantic Features Extraction Research [D]. Hefei University of Technology, 2011; Du Ping et al., Chinese Place Name Recognition and Ambiguity Elimination——A Case Study of Administrative Division Place Names Above the County Level in China [J]. Remote Sensing Technology and Application, 2011.)

位置数据库存在规模受限、更新困难的问题，基于位置数据库的地理位置信息查询（尤其是模糊位置查询）会出现位置名称难以识别或者覆盖范围缺失的情况，不足以满足用户需求。互联网中蕴含了丰富的地理知识，能够提供大量感兴趣位置的描述信息用于估算“未知”位置覆盖范围。而如何从互联网中搜索位置相关的信息，并从中获取“未知”位置的近似地理范围，是本发明的主要工作。The location database has problems of limited scale and difficulty in updating. Geographical location information queries based on location databases (especially fuzzy location queries) may cause location names to be difficult to identify or coverage missing, which is not enough to meet user needs. The Internet contains a wealth of geographic knowledge, which can provide a large amount of descriptive information for locations of interest for estimating the coverage of "unknown" locations. And how to search location-related information from the Internet, and obtain the approximate geographical range of the "unknown" location therefrom, is the main work of the present invention.

发明内容Contents of the invention

本发明主要是解决现有技术所存在的技术问题；提供了一种能够充分利用互联网中数量丰富、动态变化的地理知识资源，实现对目标位置的近似范围估算。The present invention mainly solves the technical problems existing in the prior art; it provides a geographical knowledge resource that can make full use of abundant quantities and dynamic changes in the Internet to realize the approximate range estimation of the target position.

本发明的上述技术问题主要是通过下述技术方案得以解决的：Above-mentioned technical problem of the present invention is mainly solved by following technical scheme:

一种基于互联网主动迭代探测的未知位置估算方法，其特征在于，包括以下步骤：A method for estimating an unknown position based on Internet active iterative detection, characterized in that it comprises the following steps:

步骤1，检查用户输入位置查询词；若位置无法从空间数据库获取地理覆盖，则主动开始互联网迭代探测，即以目标位置为主题利用网络搜索引擎从互联网爬取目标位置相关信息；Step 1, check the user's input location query words; if the location cannot obtain geographic coverage from the spatial database, then actively start Internet iterative detection, that is, take the target location as the theme and use the web search engine to crawl the relevant information of the target location from the Internet;

步骤2，以位置查询词为主题进行初始探测，利用网络引擎从互联网中获取包含目标位置描述的网页集合；Step 2, take the location query word as the theme to carry out initial detection, and use the network engine to obtain a collection of web pages containing the description of the target location from the Internet;

步骤3，针对步骤2得到的目标位置描述的网络文档进行地理位置解析，即从网络文档中提取自然语言位置描述，所述自然语言位置描述包括参考位置和空间关系；Step 3, performing geographic location analysis on the network document described in the target position obtained in step 2, that is, extracting a natural language position description from the network document, the natural language position description including a reference position and a spatial relationship;

步骤4，采用步骤3得到的自然语言位置描述进行位置描述分类；如果位置描述的参考位置能够从位置数据库获取地理覆盖，位置描述存入精确描述集合P，否则存入模糊描述集合A；Step 4, use the natural language location description obtained in step 3 to classify the location description; if the reference location of the location description can obtain geographical coverage from the location database, the location description is stored in the precise description set P, otherwise it is stored in the fuzzy description set A;

步骤5，评估当前搜索可信率C_s；若C_s小于搜索可信阈值C_min，以模糊描述集合A中的参考位置为主题进行新一轮互联网文本搜索;若Cs大于或者等于搜索可信阈值C_min，则跳至步骤7；Step 5: Evaluate the current search credibility rate C _s ; if C _s is less than the search credibility threshold C _min , conduct a new round of Internet text search on the subject of the reference position in the fuzzy description set A; if Cs is greater than or equal to the search credibility threshold threshold C _min , skip to step 7;

步骤6，重复步骤1至步骤5，直至每轮搜索结果可信率满足阈值或者达到搜索次数限制为止；Step 6, repeat steps 1 to 5 until the credibility rate of each round of search results meets the threshold or reaches the limit of the number of searches;

步骤7，计算所有位置描述的近似地理范围及其可信度；Step 7, calculating the approximate geographic scope and credibility of all location descriptions;

步骤8，集成和提炼多个位置描述地理覆盖，获取目标位置的地理范围；Step 8, integrating and refining the geographic coverage of multiple location descriptions to obtain the geographic range of the target location;

在上述的一种基于互联网主动迭代探测的未知位置估算方法所述步骤3中，自然语言位置描述识别主要包括位置名称识别和空间关系的识别，采用基于语义的多尺度提取方法抽取自然语言位置描述，具体包括以下子步骤：In step 3 of the above-mentioned unknown location estimation method based on Internet active iterative detection, the natural language location description recognition mainly includes location name recognition and spatial relationship recognition, and the semantic-based multi-scale extraction method is used to extract the natural language location description , including the following sub-steps:

步骤3.1，建立位置描述的语料库，语料库中存储表达位置名称和空间关系的特征词汇以及位置描述的句法模式；这里，建立语料库可以通过人工归纳和机器学习的方式建立。In step 3.1, a corpus of location descriptions is established, which stores characteristic vocabulary expressing location names and spatial relationships and syntactic patterns of location descriptions; here, the establishment of a corpus can be established through manual induction and machine learning.

步骤3.2，在语料库的支持下，对网络文本进行模式匹配，获取位置描述；Step 3.2, with the support of the corpus, pattern matching is performed on the network text to obtain the location description;

步骤3.3，基于地理的和非地理的语义消除地名歧义。Step 3.3, geographic and non-geographical semantic disambiguation of place names.

在上述的一种基于互联网主动迭代探测的未知位置估算方法所述的步骤4中，利用参考位置和空间关系估算目标位置的前提是参考位置能够从位置数据库中获取精确的地理范围，设定单个位置描述按照式一进行表达，RO为参考位置名称，SR为位置空间关系，T为位置描述的发生时间，C为位置描述具有的可信度，S为参考对象RO的搜索参考；抽取结果中前K个位置描述Loc_i，并依据前提条件进行分类，当Loc_i.RO满足前提条件时，Loc_i存入精确描述集合P，否则存入模糊描述集合A；In step 4 of the above-mentioned unknown position estimation method based on Internet active iterative detection, the premise of using the reference position and spatial relationship to estimate the target position is that the reference position can obtain the precise geographical range from the position database, and set a single The location description is expressed according to Formula 1, RO is the name of the reference location, SR is the spatial relationship of the location, T is the occurrence time of the location description, C is the credibility of the location description, and S is the search reference of the reference object RO; The first K positions describe Loc _i and classify them according to the preconditions. When Loc _{i.RO meets the preconditions, Loc i} _is stored in the precise description set P, otherwise it is stored in the fuzzy description set A;

Loc＝{RO，SR，T，C，S}式一Loc={RO, SR, T, C, S} Formula 1

在上述的一种基于互联网主动迭代探测的未知位置估算方法所述的步骤5，评估当前搜索可信率C_s的具体方法是：定义搜索可信率C_s作为评价指标，搜索可信率是P中所有位置描述的可信度之和与位置描述总数之比，如式二所示，m是P中位置描述个数，K是位置描述总数，Loc_i.C是某个位置描述的可信度。In step 5 of the above-mentioned unknown position estimation method based on active iterative detection of the Internet, the specific method for evaluating the current search credibility C _s is to define the search credibility C _s as an evaluation index, and the search credibility is The ratio of the sum of the credibility of all location descriptions in P to the total number of location descriptions, as shown in Equation 2, m is the number of location descriptions in P, K is the total number of location descriptions, _Loci.C is the reliability of a location description Reliability.

$C_{s} = \frac{Σ_{i = 0}^{m - 1} {Loc}_{i} . C}{K}$ 式二 $C_{the s} = \frac{Σ_{i = 0}^{m - 1} {Loc}_{i} . C}{K}$ formula two

位置描述的可信度按照式三进行计算，其中ε是衰减参数，n是搜索次数，设定位置描述可信度在首次搜索时为1，并随着搜索次数的增加而衰减；The credibility of the location description is calculated according to Equation 3, where ε is the attenuation parameter, n is the number of searches, and the credibility of the location description is set to 1 at the first search, and decays with the increase of the number of searches;

Loc_i.C=1*(ε)ⁿ式三Loc _i .C=1*(ε) ⁿ Equation 3

当C_s满足最低可信阈值C_min时，直接输出精确描述集合P进行目标位置估算；当C_s不满足条件时，采用基于互联网多次迭代搜索的方法来保证搜索可信率，即取A中的模糊参考位置进行新一轮互联网搜索，通过网络资源先估算参考位置地理范围，进而利用参考位置估算目标位置。When C _s satisfies the minimum credible threshold C _min , the precise description set P is directly output to estimate the target position; when C _s does not meet the conditions, a method based on multiple iterative searches on the Internet is used to ensure the search credibility, that is, A A new round of Internet search is carried out based on the fuzzy reference position in the network, and the geographical range of the reference position is first estimated through network resources, and then the target position is estimated using the reference position.

在上述的一种基于互联网主动迭代探测的未知位置估算方法所述的步骤6即为模糊参考位置迭代搜索；依据步骤4和步骤5的处理，设定搜索结果采用式四表达，n是搜索次数，m是当次搜索的位置序号,P是精确描述集合，A是模糊描述集合，C_s是搜索可信率。Step 6 in the above-mentioned unknown position estimation method based on Internet active iterative detection is the iterative search of the fuzzy reference position; according to the processing of steps 4 and 5, the set search result is expressed in formula 4, and n is the number of searches , m is the position number of the current search, P is the precise description set, A is the fuzzy description set, C _s is the search credibility.

WS[n][m]={P，A，C_s}式四WS[n][m]={P, A, C _s } Formula 4

所述的迭代搜索过程包括以下子步骤：The iterative search process includes the following sub-steps:

步骤6.1，将目标位置搜索结果的模糊位置描述WS[0][0].A存入搜索集合Ｑ，设n=0，m=0；Step 6.1, store the fuzzy location description WS[0][0].A of the target location search result into the search set Q, set n=0, m=0;

步骤6.2，取Q中模糊描述集合WS[n][m].A，判断n+1是否达到搜索次数限制，如果是则退出搜索；Step 6.2, take the fuzzy description set WS[n][m].A in Q, judge whether n+1 has reached the limit of search times, if so, exit the search;

步骤6.3，依次取WS[n][m].A中位置描述Loc_i进行第n+1次搜索，获取搜索结果WS[n+1][i]，并关联到位置描述的参考对象RO搜索引用，即Loc_i.S＝WS[n+1][i]；Step 6.3, sequentially take the location description Loc _i in WS[n][m].A to perform the n+1th search, obtain the search result WS[n+1][i], and associate it with the reference object RO search of the location description Reference, ie Loc _i.S = WS[n+1][i];

步骤6.4，从Q中去掉完成搜索的模糊描述集合WS[n][m].A，检查Step 6.4, remove the searched fuzzy description set WS[n][m].A from Q, check

WS[n+1][i].C_s是否满足阈值C_min，若不满足则将WS[n+1][i].A放入搜索集合Q中；Whether WS[n+1][i].C _s satisfies the threshold C _min , if not, put WS[n+1][i].A into the search set Q;

步骤6.5，检查Q中是否存在模糊描述集合，如果有则重复步骤6.2至步骤6.4进行迭代搜索。Step 6.5, check whether there is a fuzzy description set in Q, if yes, repeat step 6.2 to step 6.4 for iterative search.

在上述的一种基于互联网主动迭代探测的未知位置估算方法所述的步骤7，由于第k搜索结果的模糊位置描述需要参考第k+1次搜索结果，采用逆序计算的方式，即从最后一次搜索开始进行地理范围计算，具体包括以下子步骤：In step 7 of the aforementioned unknown location estimation method based on Internet active iterative detection, since the fuzzy location description of the kth search result needs to refer to the k+1th search result, the calculation method is reversed, that is, from the last The search starts to calculate the geographic range, which includes the following sub-steps:

步骤7.1，定义搜索结果WS中搜索次数为n，第n次搜索位置个数为m，m=WS[n-1].size；定义地理范围集合FC存储每次搜索结果的地理范围；Step 7.1, define the number of searches in the search result WS as n, the number of search positions for the nth time is m, m=WS[n-1].size; define the geographic range set FC to store the geographic range of each search result;

步骤7.2，取第n次搜索第m个位置的搜索结果WS[n-1][m-1]；Step 7.2, take the search result WS[n-1][m-1] of the nth search for the mth position;

步骤7.3，依次取WS[n-1][m-1].P中的位置Loc_y，基于位置数据库查询参考位置坐标，利用Point-Radius算法计算地理覆盖FP(y)及其可信度CP(y);Step 7.3, sequentially take the location Loc _y in WS[n-1][m-1].P, query the reference location coordinates based on the location database, and use the Point-Radius algorithm to calculate the geographic coverage FP(y) and its reliability CP (y);

步骤7.4，依次取WS[n-1][m-1].A中的位置Loc_x,利用Loc_x.S在地理范围集合FC中查询参考位置坐标，若成功获取坐标，则利用Point-Radius算法计算地理覆盖FA(y)及其可信度CA(y);Step 7.4, sequentially take the location Loc _x in WS[n-1][m-1].A, use Loc _x .S to query the reference location coordinates in the geographical range set FC, if the coordinates are successfully obtained, use Point-Radius The algorithm calculates the geographical coverage FA(y) and its credibility CA(y);

步骤7.5，融合P和A中所有位置的地理范围，获取当次搜索结果的地理范围FC(WS[n-1][m-1]);Step 7.5, integrate the geographic ranges of all locations in P and A, and obtain the geographic range FC(WS[n-1][m-1]) of the current search result;

步骤7.6，判断m-1是否大于0；若大于0，则进行下一个搜索结果的位置计算，令m=m-1，跳至步骤b）；若小于或等于0，则进行下一步；Step 7.6, judge whether m-1 is greater than 0; if it is greater than 0, calculate the position of the next search result, let m=m-1, skip to step b); if it is less than or equal to 0, proceed to the next step;

步骤7.7，判断n-1是否大于0；若大于0，则进行前一次搜索结果的位置计算，令n=n-1，m=WS[n-1].size，跳至步骤b）；若小于或等于0，则进行下一步；Step 7.7, judge whether n-1 is greater than 0; if it is greater than 0, calculate the position of the previous search result, let n=n-1, m=WS[n-1].size, skip to step b); if If it is less than or equal to 0, proceed to the next step;

步骤7.8，输出FC(WS[0][0])。Step 7.8, output FC(WS[0][0]).

因此，本发明具有如下优点：能够充分利用互联网中数量丰富、动态变化的地理知识资源，实现对目标位置的近似范围估算。由于互联网中位置信息与非位置信息关联复杂，并且信息表达形式多样化，本发明针对互联网中的自然语言文本信息，采用基于语义的多尺度提取方法从网页文本中抽取位置描述，并利用Point-Radius算法计算目标位置的近似地理范围。。Therefore, the present invention has the following advantages: it can make full use of abundant and dynamically changing geographical knowledge resources in the Internet to realize the approximate range estimation of the target position. Due to the complex relationship between location information and non-location information in the Internet, and the diversity of information expression forms, the present invention uses a semantic-based multi-scale extraction method to extract location descriptions from webpage texts for natural language text information in the Internet, and uses Point- The Radius algorithm calculates the approximate geographic extent of the target location. .

附图说明Description of drawings

图1是互联网主动搜索方法的流程图。Fig. 1 is a flow chart of an Internet active search method.

图2基于互联网搜索结果的位置计算的流程图。Figure 2. Flowchart of location calculation based on Internet search results.

具体实施方式detailed description

下面通过实施例，并结合附图，对本发明的技术方案作进一步具体的说明。The technical solution of the present invention will be further specifically described below through the embodiments and in conjunction with the accompanying drawings.

实施例：Example:

1、理论基础。1. Theoretical basis.

1.1、地理信息检索（GeographicInformationRetrieval，GIR）。1.1. Geographic Information Retrieval (GIR).

地理信息检索是根据地理查询范围的限制，返回与地理信息查询相关的文档。基本思路是利用网络爬虫从互联网上搜索网页集合，通过命名实体识别与分类以及语法分析识别网页中的地名，从而确定查询词和文档的地理范围，最后计算文档与查询词之间的关联度（包括文本关联和空间关联）返回和排序检索结果。目前大部分地理信息检索主要是采用关键词匹配算法，检索词和网络文档中的地名都需要具有明确地理覆盖范围进行关联技术，这种方式难以适应模糊地名（例如长江中下游）的情况，因而无法直接用于基于网络搜索的未知位置估算。本发明参考地理信息检索的思路，提出了一种多尺度的迭代搜索算法（如图1），基于互联网获取未知位置相关的网络文档，并提取包含未知位置的位置描述，进而利用位置描述中的参考位置和空间关系计算出未知位置的近似地理范围。主要流程是通过元搜索从互联网上获取网页集合后，基于语义提取网页中的包含查询词的位置描述，如果位置描述不满足可信率进行查询词位置估算，则对识别的模糊位置进行新一轮的互联网检索，这个过程是一个迭代的过程，只要可信率条件不满足或者没有达到搜索限制，就不断进行网络搜索获取能够估算模糊位置地理范围的参考信息。Geographic information retrieval is to return documents related to geographic information query according to the limitation of geographic query scope. The basic idea is to use web crawlers to search webpage collections from the Internet, identify place names in webpages through named entity recognition and classification, and grammatical analysis, so as to determine the geographical range of query words and documents, and finally calculate the correlation between documents and query words ( Including text association and spatial association) to return and sort the search results. At present, most geographic information retrieval mainly adopts keyword matching algorithm. Both search words and place names in network documents need to have a clear geographical coverage for association technology. This method is difficult to adapt to the situation of vague place names (such as the middle and lower reaches of the Yangtze River). Therefore, Cannot be used directly for unknown location estimates based on web searches. With reference to the idea of geographic information retrieval, the present invention proposes a multi-scale iterative search algorithm (as shown in Figure 1), which acquires network documents related to unknown locations based on the Internet, extracts location descriptions containing unknown locations, and then uses The approximate geographic extent of the unknown location is calculated with reference to the location and the spatial relationship. The main process is to obtain the web page collection from the Internet through meta-search, and extract the location description containing the query word in the web page based on semantics. If the location description does not meet the credibility rate to estimate the location of the query word, then perform a new round of the identified fuzzy location. This process is an iterative process. As long as the credibility rate condition is not satisfied or the search limit is not reached, the network search is continuously carried out to obtain reference information that can estimate the geographic range of the fuzzy location.

1.2、位置描述地理配准（GeoreferencingLocalityDescriptions，GLD）。1.2. Location description georeferencing (GeoreferencingLocalityDescriptions, GLD).

位置描述地理配准是将位置从文本描述转换成某个坐标系统下的数值描述。理想的位置描述地理配准过程是将文本描述转成数字描述能够并映射到地图上，并且表达位置的空间范围以及位置分布的不确定度，目前比较流行的算法是Point-Radius算法和Probability算法。Point-Radius方法利用一个点以及最大误差来描述位置及其不确定度，主要考虑的不确定度来源包括参考位置（参考位置的空间范围、大地基准、坐标精度、地图比例尺）和空间关系（距离关系不确定度以及方向关系的不确定度），所有不确定度度量投影到一个维度作为目标位置的最大误差，以点和最大误差作为半径构成的圆形区域表达目标位置。Probability方法采用不确定度概率密度表面来表达目标位置及其不确定度，主要考虑不确定度来源包括目标对象的空间分布、空间关系的不精确和模糊性、参考对象的不完整性、以及位置描述本身的不确定度。Point-Radius方法属于量化方式的位置计算，能够获取目标位置所有可能存在点的地理覆盖，适用于半定量的文本位置描述；Probability方法无法定量计算目标位置的地理覆盖，但是能够给出目标位置的概率分布，适用于定性的文本位置描述。Location description georeferencing is the conversion of a location from a textual description to a numerical description in a coordinate system. The ideal location description georeferencing process is to convert the text description into a digital description and map it on the map, and express the spatial range of the location and the uncertainty of the location distribution. Currently, the more popular algorithms are the Point-Radius algorithm and the Probability algorithm. . The Point-Radius method uses a point and the maximum error to describe the position and its uncertainty. The sources of uncertainty mainly considered include reference position (spatial range of reference position, geodetic datum, coordinate accuracy, map scale) and spatial relationship (distance Uncertainty of relationship and uncertainty of directional relationship), all uncertainty metrics are projected into one dimension as the maximum error of the target position, and the target position is expressed in a circular area formed by the point and the maximum error as the radius. The Probability method uses the uncertainty probability density surface to express the target position and its uncertainty, mainly considering the source of uncertainty including the spatial distribution of the target object, the imprecision and ambiguity of the spatial relationship, the incompleteness of the reference object, and the position Describe the uncertainty of itself. The Point-Radius method belongs to the position calculation of the quantitative method, which can obtain the geographical coverage of all possible points in the target position, and is suitable for semi-quantitative text position description; the Probability method cannot quantitatively calculate the geographical coverage of the target position, but can give the target position. Probability distributions, suitable for qualitative textual location descriptions.

2、实现过程。2. Implementation process.

（1）、检查用户输入目标位置查询词；在位置数据库中搜索查询词，如果位置不存在或者位置地理覆盖缺失，则主动进行基于网络搜索模式的查询，即以目标位置为主题利用网络搜索引擎从互联网爬取目标位置相关信息；(1) Check that the user enters the target location query words; search the query words in the location database, if the location does not exist or the geographical coverage of the location is missing, then actively conduct a query based on the network search mode, that is, use the network search engine with the target location as the theme Crawl information about the target location from the Internet;

（2）、识别和提取网络文档中的自然语言位置描述（包括参考位置和空间关系）；自然语言位置描述识别主要包括位置名称识别和空间关系的识别，本发明采用基于语义的多尺度提取方法抽取自然语言位置描述。首先，通过人工归纳和机器学习的方式建立位置描述的语料库，语料库中存储表达位置名称和空间关系的特征词汇以及位置描述的句法模式；然后，在语料库的支持下，对网络文本进行模式匹配，获取位置描述；最后，基于地理的和非地理的语义消除地名歧义；(2) Identify and extract natural language location descriptions (including reference locations and spatial relationships) in network documents; natural language location description recognition mainly includes location name identification and spatial relationship identification, and the present invention uses a semantic-based multi-scale extraction method Extract natural language location descriptions. First, establish a corpus of location descriptions through artificial induction and machine learning. The corpus stores the characteristic vocabulary expressing location names and spatial relationships as well as the syntactic patterns of location descriptions; then, with the support of the corpus, pattern matching is performed on network texts. Obtain location descriptions; finally, geographically based and non-geographically semantically disambiguated place names;

（3）、位置描述分类；利用参考位置和空间关系估算目标位置的前提是参考位置能够从位置数据库中获取精确的地理范围，设定单个位置描述按照公式（1）进行表达，RO为参考位置名称，SR为位置空间关系，T为位置描述的发生时间，C为位置描述具有的可信度，S为参考对象RO的搜索参考。抽取结果中前K个位置描述Loc_i，并依据前提条件进行分类，当Loc_i.RO满足前提条件时，Loc_i存入精确描述集合P，否则存入模糊描述集合A；(3) Location description classification; the premise of using reference location and spatial relationship to estimate the target location is that the reference location can obtain an accurate geographical range from the location database, and set a single location description to be expressed according to formula (1), and RO is the reference location Name, SR is the spatial relationship of the location, T is the occurrence time of the location description, C is the credibility of the location description, and S is the search reference of the reference object RO. Loc _i is described by the first K positions in the extraction results, and classified according to the preconditions. When Loc _{i.RO meets the preconditions, Loc i} _is stored in the precise description set P, otherwise it is stored in the fuzzy description set A;

Loc＝{RO，SR，T，C，S}（1）Loc = {RO, SR, T, C, S} (1)

（4）、计算搜索可信率C_s；搜索结果中位置描述的可信度必须达到一定水平才能用于估算目标位置，本发明提出搜索可信率C_s作为评价指标，搜索可信率是P中所有位置描述的可信度之和与位置描述总数之比，如公式（2）所示，m是P中位置描述个数，K是位置描述总数，Loc_i.C是某个位置描述的可信度。(4) Calculate the search credibility rate C _s ; the credibility of the location description in the search results must reach a certain level before it can be used to estimate the target location. The present invention proposes the search credibility rate C _s as an evaluation index, and the search credibility rate is The ratio of the sum of the credibility of all location descriptions in P to the total number of location descriptions, as shown in formula (2), m is the number of location descriptions in P, K is the total number of location descriptions, Loc _i.C is a certain location description credibility.

${C C}_{s the s} = = \frac{{Σ Σ}_{i i = = 00}^{m m - - 11} {Loc Loc}_{i i} . . C C}{K K} - - - - - - ((22))$

位置描述的可信度按照公式（3）进行计算，其中ε是衰减参数，n是搜索次数，设定位置描述可信度在首次搜索时为1，并随着搜索次数的增加而衰减。The credibility of the location description is calculated according to formula (3), where ε is the attenuation parameter, n is the number of searches, and the credibility of the location description is set to 1 at the first search, and decays as the number of searches increases.

Loc_i.C=1*(ε)ⁿ（3）Loc _i .C=1*(ε) ⁿ (3)

当C_s满足最低可信阈值C_min时，直接输出精确描述集合P进行目标位置估算；当C_s不满足条件时，本发明采用基于互联网多次迭代搜索的方法来保证搜索可信率，即取A中的模糊参考位置进行新一轮互联网搜索，通过网络资源先估算参考位置地理范围，进而利用参考位置估算目标位置；When C _s satisfies the minimum credible threshold C _min , the precise description set P is directly output to estimate the target position; when C _s does not meet the conditions, the present invention adopts a method based on Internet multiple iterative search to ensure the search credibility, namely Take the fuzzy reference position in A to conduct a new round of Internet search, first estimate the geographical range of the reference position through network resources, and then use the reference position to estimate the target position;

（5）、模糊参考位置迭代搜索；依据步骤三和步骤四的处理，设定搜索结果采用公式（4）表达，n是搜索次数，m是当次搜索的位置序号,P是精确描述集合，A是模糊描述集合，C_s是搜索可信率。(5) Iterative search of fuzzy reference position; according to the processing of step 3 and step 4, set the search result to be expressed by formula (4), n is the number of searches, m is the sequence number of the current search position, P is the precise description set, A is the set of fuzzy descriptions, and C _s is the search confidence rate.

WS[n][m]={P，A，C_s}（4）WS[n][m] = {P, A, C _s } (4)

迭代搜索过程如下：The iterative search process is as follows:

a）.将目标位置搜索结果的模糊位置描述WS[0][0].A存入搜索集合Ｑ，设n=0，m=0；a). Store the fuzzy location description WS[0][0].A of the target location search result into the search set Q, set n=0, m=0;

b）.取Q中模糊描述集合WS[n][m].A，判断n+1是否达到搜索次数限制，如果是则退出搜索；b). Take the fuzzy description set WS[n][m].A in Q, and judge whether n+1 reaches the limit of search times, and if so, exit the search;

c）.依次取WS[n][m].A中位置描述Loc_i进行第n+1次搜索，获取搜索结果WS[n+1][i]，并关联到位置描述的参考对象RO搜索引用，即Loc_i.S＝WS[n+1][i]；c). Take the location description Loc _i in WS[n][m].A in order to search for the n+1th time, obtain the search result WS[n+1][i], and associate it with the reference object RO search of the location description Reference, ie Loc _i.S = WS[n+1][i];

d）.从Q中去掉完成搜索的模糊描述集合WS[n][m].A，检查d). Remove from Q the fuzzy description set WS[n][m].A that completes the search, check

e）.检查Q中是否存在模糊描述集合，如果有则重复步骤b)至步骤d)进行迭代搜索；e). Check whether there is a fuzzy description set in Q, and if so, repeat step b) to step d) for iterative search;

（6）、计算所有位置描述的近似地理范围及其可信度；由于第k搜索结果的模糊位置描述需要参考第k+1次搜索结果，因此本发明采用逆序计算的方式，即从最后一次搜索开始进行地理范围计算。如图2所示，计算过程如下：(6) Calculating the approximate geographic scope and credibility of all location descriptions; since the fuzzy location description of the kth search result needs to refer to the k+1th search result, the present invention uses a reverse calculation method, that is, starting from the last The search begins a geographic extent calculation. As shown in Figure 2, the calculation process is as follows:

a）.定义搜索结果WS中搜索次数为n，第n次搜索位置个数为m，a). Define the number of searches in the search result WS as n, and the number of search positions for the nth time as m,

m=WS[n-1].size；定义地理范围集合FC存储每次搜索结果的地理范围；m=WS[n-1].size; Define the geographical range set FC to store the geographical range of each search result;

b）.取第n次搜索第m个位置的搜索结果WS[n-1][m-1]；b). Take the search result WS[n-1][m-1] of the nth search for the mth position;

c）.依次取WS[n-1][m-1].P中的位置Loc_y，基于位置数据库查询参考位置坐标，利用Point-Radius算法计算地理覆盖FP(y)及其可信度CP(y);c). Take the location Loc _y in WS[n-1][m-1].P in turn, query the reference location coordinates based on the location database, and use the Point-Radius algorithm to calculate the geographical coverage FP(y) and its credibility CP (y);

d）.依次取WS[n-1][m-1].A中的位置Loc_x,利用Loc_x.S在地理范围集合FC中查询参考位置坐标，若成功获取坐标，则利用Point-Radius算法计算地理覆盖FA(y)及其可信度CA(y);d). Take the position Loc _x in WS[n-1][m-1].A in turn, use Loc _x .S to query the reference position coordinates in the geographical range set FC, if the coordinates are successfully obtained, use Point-Radius The algorithm calculates the geographic coverage FA(y) and its credibility CA(y);

e）.融合P和A中所有位置的地理范围，获取当次搜索结果的地理范围FC(WS[n-1][m-1]);e). Merge the geographic range of all locations in P and A to obtain the geographic range FC(WS[n-1][m-1]) of the current search result;

f）.判断m-1是否大于0；若大于0，则进行下一个搜索结果的位置计算，令m=m-1，跳至步骤b）；若小于或等于0，则进行下一步；f). Determine whether m-1 is greater than 0; if it is greater than 0, calculate the position of the next search result, set m=m-1, and skip to step b); if it is less than or equal to 0, proceed to the next step;

g）.判断n-1是否大于0；若大于0，则进行前一次搜索结果的位置计算，令n=n-1，m=WS[n-1].size，跳至步骤b）；若小于或等于0，则进行下一步；g). Determine whether n-1 is greater than 0; if it is greater than 0, calculate the position of the previous search result, set n=n-1, m=WS[n-1].size, and skip to step b); if If it is less than or equal to 0, proceed to the next step;

h）.输出FC(WS[0][0])；h).Output FC(WS[0][0]);

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention belongs can make various modifications or supplements to the described specific embodiments or adopt similar methods to replace them, but they will not deviate from the spirit of the present invention or go beyond the definition of the appended claims range.

Claims

1., based on a unknown position evaluation method for internet active iteration detection, it is characterized in that, comprise the following steps:

Step 1, checks user's input position query word; If position cannot obtain geographical covering from spatial database, then initiatively start internet iteration detection, be namely the theme with target location and utilize network search engines to crawl target location relevant information from internet;

Step 2, is the theme with position enquiring word and carries out initial probe, utilizes network engine to obtain from internet to comprise the collections of web pages that target location describes;

Step 3, the network documentation that the target location obtained for step 2 describes carries out geographic position parsing, and namely from network documentation, extract natural language location expression, described natural language location expression comprises reference position and spatial relationship;

Step 4, the natural language location expression adopting step 3 to obtain carries out location expression classification; If the reference position of location expression can obtain geographical covering from location database, location expression stored in accurate description collections P, otherwise stored in vague description set A;

Step 5, the credible rate Cs of assessment current search; If Cs is less than the credible threshold value C of search _min, be the theme with the reference position in vague description set A and carry out new round internet text search, if Cs is greater than or equal to the credible threshold value C of search _min, then skipping to the concrete grammar that step 7 assesses the credible rate Cs of current search is: the credible rate C of definition search _sas evaluation index, searching for credible rate is the confidence level sum of all location expressions in P and the ratio of location expression sum, and shown in two, m is location expression number in P, and K is location expression sum, Loc _i.C be the confidence level of certain location expression:

C_{s} = \frac{Σ_{i = 0}^{m - 1} {Loc}_{i} . C}{K}

Formula two

The confidence level of location expression calculates according to formula three, and wherein ε is attenuation parameter, and n is searching times, and it is 1 when searching for first that desired location describes confidence level, and decays along with the increase of searching times;

Loc _i.C=1* (ε) ⁿformula three

Work as C _smeet minimum credible threshold value C _mintime, directly export accurate description collections P and carry out target location estimation; Work as C _swhen not satisfying condition, the method based on the search of internet successive ignition is adopted to ensure to search for credible rate, new round internet hunt is carried out in the fuzzy reference position of namely getting in A, first estimates reference position geographic range by Internet resources, and then utilizes estimation target location, reference position.

Step 6, repeats step 1 to step 5, till often the credible rate of wheel Search Results meets threshold value or reaches searching times restriction;

Step 7, calculates approximate geographic range and the confidence level thereof of all location expressions;

Step 8, the multiple location expression geography of integrated and refinement covers, and obtains the geographic range of target location;

Described step 6 is fuzzy reference position iterative search; The process of foundation step 4 and step 5, setting search result adopts formula four to express, and n is searching times, and m is that P is accurate description collections, and A is vague description set, C when time position number of search _sthe credible rate of search:

WS [n] [m]={ P, A, C _sformula four

Described iterative search procedures comprises following sub-step:

Step 4.1, describes WS [0] [0] .A stored in search set Q, if n=0, m=0 by the ambiguous location of target location Search Results;

Step 4.2, gets vague description set WS [n] [m] .A in Q, judges whether n+1 reaches searching times restriction, if it is exits search;

Step 4.3, gets location expression Loc in WS [n] [m] .A successively _icarry out (n+1)th search, obtain Search Results WS [n+1] [i], and the references object RO search being associated with location expression is quoted, be i.e. Loc _i.S=WS [n+1] [i];

Step 4.4, has removed vague description set WS [n] [m] .A of search from Q, checks

WS [n+1] [i] .C _swhether meet threshold value C _minif do not meet, WS [n+1] [i] .A is put into search set Q;

Step 4.5, checks in Q whether there is vague description set, if had, repeats step 4.2 and carries out iterative search to step 4.4;

Described step 7, the ambiguous location due to kth Search Results describes to be needed, with reference to kth+1 Search Results, to adopt the mode that backward calculates, and namely from last search, carries out geographic range calculating, specifically comprises following sub-step:

Step 5.1, in definition Search Results WS, searching times is n, and n-th searching position number is m, m=WS [n-1] .size; Definition geographic range set FC stores the geographic range of each Search Results;

Step 5.2, gets Search Results WS [n-1] [m-1] of n-th search m position;

Step 5.3, gets the position Loc in WS [n-1] [m-1] .P successively _y, position-based data base querying reference position coordinate, utilizes Point-Radius algorithm to calculate geographical covering FP (y) and confidence level CP (y) thereof;

Step 5.4, gets the position Loc in WS [n-1] [m-1] .A successively _x, utilize Loc _x.S in geographic range set FC, inquire about reference position coordinate, if successfully obtain coordinate, then utilize Point-Radius algorithm to calculate geographical covering FA (y) and confidence level CA (y) thereof;

Step 5.5, merges the geographic range of all positions in P and A, obtains the geographic range FC (WS [n-1] [m-1]) when time Search Results;

Step 5.6, judges whether m-1 is greater than 0; If be greater than 0, then carry out the position calculation of next Search Results, make m=m-1, skip to step 5.2; If be less than or equal to 0, then carry out next step;

Step 5.7, judges whether n-1 is greater than 0; If be greater than 0, then carry out the position calculation of a front Search Results, make n=n-1, m=WS [n-1] .size, skips to step 5.2; If be less than or equal to 0, then carry out next step;

Step 5.8, exports FC (WS [0] [0]);

2. a kind of unknown position evaluation method based on the detection of internet active iteration according to claim 1, it is characterized in that, in described step 3, the identification of natural language location expression mainly comprises the identification of location name identification and spatial relationship, adopt the multiple dimensioned extracting method based on semanteme to extract natural language location expression, specifically comprise following sub-step:

Step 3.1, sets up the corpus of location expression, stores and express location name and the feature vocabulary of spatial relationship and the syntactic pattern of location expression in corpus;

Step 3.2, under the support of corpus, carries out pattern match to network text, obtains location expression;

Step 3.3, eliminates place name ambiguity based on geography with the semanteme of non-geographic.

3. a kind of unknown position evaluation method based on the detection of internet active iteration according to claim 1, it is characterized in that, in described step 3, the prerequisite utilizing reference position and spatial relationship estimation target location is that reference position can obtain accurate geographic range from location database, set single location expression to express according to formula one, RO is reference position title, SR is locational space relation, T is the time of origin of location expression, C is the confidence level that location expression has, and S is the searching for reference of references object RO; Extract K location expression Loc before in result _i, and classify according to precondition, work as Loc _iwhen .RO meeting precondition, Loc _istored in accurate description collections P, otherwise stored in vague description set A;

Loc={RO, SR, T, C, S} formula one.