CN113673240A

CN113673240A - A method and storage medium for inferring text geographic location by integrating spatial entity relationships

Info

Publication number: CN113673240A
Application number: CN202110869708.0A
Authority: CN
Inventors: 曾壮; 陈仁谣; 程旭阳; 李圣文
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-19
Anticipated expiration: 2041-07-30
Also published as: CN113673240B

Abstract

The invention provides a method for inferring the geographic location of text by integrating spatial entity relationships, including: preparing two types of dictionaries: target dictionary and merged dictionary; using different attributes of place names to complete the weight factor labeling of names in the dictionary; proposing StringMerging (SM) algorithm Complete the extraction of candidate place names in the target text; use the vector representation corresponding to the place names to filter the noisy place names; propose a place name weight calculation formula to convert the weight factor of the internal names in the target text into the relative weights of the internal names in the target text; use the obtained relative weights Multiply and accumulate the coordinates of the place name to obtain the implied geographic coordinates of the target text. The technical route proposed by the present invention can effectively solve the problems of difficulty in acquiring external knowledge and large granularity of the predicted implicit geographic location in the existing methods for acquiring the implicit geographic location of the text, and can also effectively predict the location of the target text without supervision. Latitude and longitude coordinates.

Description

Method and storage medium for inferring textual geographic location from spatial entity relationships

Technical Field

The invention relates to the technical field, in particular to a method and a storage medium for inferring the geographical position of a text by synthesizing spatial entity relations.

Background

Studies have shown that a large part of all data generated today is unstructured data, and that about 60% of the data (text data and other data) can be considered geospatially referenced data. However, since the text directly containing the spatial position information only occupies a small part of the total text quantity, how to accurately extract the spatial position information implied in the remaining text has a very important research value. Existing methods for obtaining the implicit geographic location of the text have problems, such as: difficulty in obtaining external knowledge, large granularity of predicted implied geographic locations, etc.

Disclosure of Invention

The method solves the main problems that the conventional method for acquiring the hidden geographic position of the text has difficulty in acquiring external knowledge and large granularity of the predicted hidden geographic position.

According to one aspect of the invention, the invention provides a method for inferring the geographic location of text from the relationship of spatial entities, comprising:

acquiring a target dictionary by using a standard data set;

labeling a place name in the target dictionary with a weight factor;

performing text word segmentation on the target text by using a word segmentation tool to obtain a word segmentation text;

matching the word segmentation text with place names in the target dictionary by using a String Merging algorithm, further acquiring candidate place names corresponding to character strings in the word segmentation text, and acquiring a Merging dictionary consisting of a set of Merging character strings;

screening and denoising the candidate place names by using the merging dictionary;

converting the weight factor into the relative weight of each place name in the participle text by using a conversion formula, wherein the conversion formula is as follows:

wherein N is the number of place names, f_nFor each place name corresponding weight factor, LCM (W) is the least common multiple of N place names corresponding weight factors, N refers to each of N place namesPlace name, P (n) is the value of the relative weight;

and multiplying and accumulating the relative weight and the longitude and latitude coordinates of the place name to obtain the geographic coordinates of the word segmentation text.

Further, the weighting factors include different attributes of place names in the target dictionary.

Further, the attributes include a word frequency of place names appearing in the target dictionary, a place name category, and a geographical area corresponding to the place names.

Further, the acquiring the Merging dictionary by using String Merging algorithm comprises:

and acquiring a set of all the merged character strings by using the String Merging algorithm as the merged dictionary.

Further, the target dictionary is an overcomplete dictionary containing all place names in the target text.

Further, matching the segmented text with place names in the target dictionary by using a String Merging algorithm, and further acquiring candidate place names corresponding to character strings in the segmented text comprises:

if the first character string in the word segmentation text is the same as the first place name in the target dictionary, storing the first character string and the first place name;

and if the length of the first character string is greater than 1 and the first character string is contained by the second place name, continuously judging whether the next character string still belongs to the first place name.

If the next character string does not belong to the second place name but the next character belongs to the second place name, merging the first character string, the next character string and the next character string;

if the next character string of the first character string belongs to the stop character, ending the character string combination;

if the two adjacent character strings of the first character string do not belong to the second place name, ending the character string combination;

and repeating the character string merging step, and taking all the obtained merged character strings as candidate character strings so as to obtain all the candidate place names.

Further, the screening and denoising the candidate place names comprises:

calculating the place name S in the acquired character string list_aPlace name S corresponding to standard place name list_bThe cosine similarity of the place name vectors between the two sets of place name vectors is as follows:

wherein S is_aFor the place name in the list of character strings,

is S_aCorresponding place name vector, S_bFor another place name in the list of character strings,

is S_bA corresponding place name vector;

after calculation, if the value of the cosine similarity Cos _ sim is not less than a threshold value K, the place name S is reserved_aAnd said place name S_bOtherwise, deleting the place name S_aAnd said place name S_b。

Further, the relative weight is multiplied by the latitude and longitude coordinates of the place name and accumulated to obtain the geographic coordinate of the participle text, and the calculation formula is as follows:

wherein P (X, Y) is the predicted longitude and latitude coordinate value of the target text, P_n(X_n，Y_n) The longitude and latitude coordinates of the nth place name in the target text.

Furthermore, when the word segmentation tool is used for text word segmentation of the target text, a user-defined dictionary is added to improve the word segmentation accuracy.

According to another aspect of the present invention, the present invention further comprises a storage medium, wherein the storage medium is a computer-readable storage medium, and the computer-readable storage medium stores a method for inferring the geographic location of text from the relationship of spatial entities according to any one of claims 1 to 9.

Selecting a proper custom dictionary, namely a target dictionary, for a target text, segmenting words by using a word segmentation tool, and then merging character segments obtained after word segmentation by using an SM algorithm; filtering the noisy data using a vector representation of the place name; calculating to obtain the relative weight of the place name by using the weight factor of the place name and combining a weight calculation formula; and accumulating and multiplying the longitude and latitude coordinates of the target text by utilizing the longitude and latitude coordinates of the ground and combining the obtained relative weight of the place name.

In the method for inferring the geographic position of the text by integrating the spatial entity relationship, the place name weight factor labeling part acquires the weight factor corresponding to the place name by utilizing the attribute of the place name labeling part, and the weight factor labeling part is combined with the part for calculating the relative weight of the candidate place name in the target text, so that the relative weight of the candidate place name in the target text can be acquired in a simple, direct and effective mode.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic diagram of the UFGLI model in an embodiment of the present invention.

Fig. 2 is a schematic flow chart of the SM algorithm in the embodiment of the present invention.

FIG. 3 is a diagram illustrating a method for inferring a geographic location of a text from a spatial entity relationship according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

In the first embodiment, as shown in fig. 1, which is a schematic diagram of a UFGLI model, the method for inferring a geographic location of a text from a comprehensive spatial entity relationship proposed in this embodiment includes: dictionary preparation, place name weight factor labeling, candidate place name extraction in the target text, noise place name filtering in the candidate place names, relative weight calculation of the candidate place names and target text implied geographic coordinate prediction.

The UFGLI is a model for obtaining implicit geographic coordinates of an input target text, and an input part of the UFGLI needs to be marked with a location name dictionary of weight factors and longitude and latitude coordinates in addition to the target text. The target text passes through: and the predicted longitude and latitude coordinates can be obtained by four parts of candidate place name extraction in the target text, noise place name filtering in the candidate place names, relative weight calculation of the candidate place names and target text position prediction.

Step one, dictionary preparation:

the present embodiment includes two types of dictionaries: a target dictionary and a merged dictionary. The target dictionary is an overcomplete dictionary containing all possible place names in the target text. The place names in the target dictionary and the latitude and longitude coordinates of the place names are derived from the standard data set. The merging dictionary is a set of all the merging strings automatically acquired by the SM algorithm and used for calculating the cosine similarity.

Step two, labeling a place name weight factor:

and labeling the place names in the target dictionary by a weighting factor, wherein each place name has a plurality of different attributes, so that each different attribute of the place name can be labeled once, and the place name in the target dictionary is: and quantifying attributes such as word frequency, place name category, geographical area corresponding to place names and the like appearing in the text set to serve as weight factors.

In the embodiment, different numbers of text sets are selected for counting word frequency, and the word frequency f of the place name in the target dictionary is obtained by using a small number of text statistics₁Using mass text statistics to obtain the word frequency f of place names in the target dictionary₂. The invention classifies the place names in the target dictionary, and each place name has a category, such as: roads, cells, schools, etc.; different place names in the dictionary are respectively endowed with different weight factors f_c(ii) a The geographic area or the road length corresponding to the place name in the target dictionary is obtained, and the area or the length is set as the weight factor f corresponding to the place name_a(ii) a The units of area and road length are further unified and taken as the weighting factor f_a2。

Step three, extracting candidate place names in the target text:

first, candidate place names contained in the target text are acquired. The method for extracting the candidate place name in the target text comprises the following steps: the target text participle and the candidate place name are matched with two parts.

In some embodiments, the word segmentation method is not fixed, and any word segmentation tool can be selected to perform text word segmentation, and an external dictionary (a custom dictionary) can be added to improve the word segmentation accuracy. The character string merging algorithm SM designed in this embodiment can match the participle text of the target text with the place names in the target dictionary, and further obtain candidate place names corresponding to the character strings in the target text.

As shown in fig. 2, it is a flow chart of the SM algorithm, and its design idea is as follows:

if the current character string a is the same as a place name A in the target dictionary, storing the current character string and the corresponding place name;

if the length of the current character string a is greater than 1 and the current character string a is contained by a place name B, continuously judging whether the next character string a2 still belongs to the place name B; if a2 does not belong to B, but a3 belongs to B, then a, a2, a3 are merged and step 2 is repeated;

if the two adjacent character strings of a do not belong to B or the adjacent character strings of a are stop symbols, ending the character string combination;

and taking the combined character string as a candidate character string and taking the candidate character string as a candidate place name.

Step four, filtering noise place names in the candidate place names:

the obtained candidate place name list in the target text may contain a large amount of noise data, and for this reason, the place names in the place name list need to be screened, firstly, the cosine similarity of the place name vectors between the place names in the character string list sl _ list extracted by the algorithm SM and the corresponding place names in the standard place name list Poi _ list is calculated:

let place name S in sl _ list_bThe corresponding place name vector is

S_bThe corresponding bert vector is

Then B is_aAnd B_bThe calculation formula of the cosine similarity Cos _ sim is as follows:

if the value of Cos _ sim is not less than the threshold value K, the current place name S is reserved_bAnd S_bOtherwise, deleting the current place name S_aAnd S_b。

Step five, calculating the relative weight of the candidate place name in the target text

In the second step, the weighting factor corresponding to each place name is already marked, but the weighting factor corresponding to the place name needs to be converted through a weighting calculation formula and is finally converted into the relative weight of each place name in the target text.

In some embodiments, various weight calculation formulas are designed, wherein the best effect is a method of normalization by using the least common multiple of each place name corresponding to a weight factor contained in the target text. That is, for N place names contained in each text, the weighting factor corresponding to each place name of the N place names is set as f_nThe least common multiple of the weighting factors corresponding to the N place names is LCM (W), and the specific formula for calculating the relative weight P (N) of the nth place name by using the least common multiple is as follows:

step six, predicting the hidden geographic coordinates of the target text

For the N place names extracted from the target text according to the foregoing steps, the relative weight p (N) of each place name has been calculated using the weight calculation formula in step five, and position prediction is now performed using the longitude and latitude coordinates, which is the third attribute of the POI point. Namely the relative weight P (n) and the latitude and longitude coordinate P of each place name contained in the target text_n(X_n，Y_n) Multiplying and accumulating to obtain the final longitude and latitude coordinate P (X, Y) of the target text, wherein the specific formula is as follows:

in a second embodiment, as shown in fig. 3, a method for inferring a geographic location of a text from a spatial entity relationship includes the following steps:

after receiving a text signal of 'the grand square garden cell has the singing in the morning and evening every day and hopes the government to actively process', the word segmentation tool is adopted to segment words, and the word segmentation result is 'the grand square garden', 'the cell', 'every day', 'the morning and evening', 'the singing', 'the hope', 'the government', 'the initiative', 'the processing', 'and'. ";

after word segmentation, the SM algorithm mentioned in the previous embodiment is used to merge the character strings to obtain: the method comprises the following steps of [ 'Hongfang garden small area ] ], and place name matching is carried out to obtain a corresponding place name [' Hongfang garden green electric small area ] ];

after noise is filtered by utilizing the vector of the place name, the reserved character string is [ "macro square garden cell" ];

calculating relative weight according to the weight factor of the corresponding place name of the reserved character string, namely the weight factor of [ "macro Fangyuan green electricity cell" ] is 60, and the relative weight is 1;

multiplying the relative weight of the candidate place name and the corresponding longitude and latitude coordinates, wherein the relative weight of the 'macro square root cell' is 1, and the longitude and latitude coordinates are (114.354941, 30.586017);

and finally, obtaining the longitude and latitude coordinates of the target text (114.354941, 30.586017).

In the method for inferring the geographical position of the text by integrating the spatial entity relationship, the place name weighting factor labeling part acquires the weighting factor corresponding to the place name by utilizing the attribute of the place name labeling part, and the weighting factor labeling part is combined with the part for calculating the relative weight of the candidate place name in the target text, so that the relative weight of the candidate place name in the target text can be acquired in a simple, direct and effective mode, and the method for extracting the candidate place name in the target text can meet the requirement of unsupervised acquisition of the candidate place name contained in the target text.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and scope of the present invention should be included in the present invention.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

Claims

1. A method for inferring textual geolocation through the synthesis of spatial entity relationships, comprising:

acquiring a target dictionary by using a standard data set;

labeling a place name in the target dictionary with a weight factor;

wherein N is the number of place names, f_nFor each place name, LCM (W) is the least common multiple of the N place name weight factorsA number, N, denotes each of the N place names, p (N) being a value of relative weight;

2. The method of claim 1, wherein the weighting factors include different attributes of place names in the target dictionary.

3. The method as claimed in claim 2, wherein the attributes include word frequency of place names appearing in the target dictionary, place name category and geographical area corresponding to place names.

4. The method of claim 1, wherein the obtaining a merged dictionary using String Merging algorithm comprises:

5. The method for inferring geographic location of text from integrated spatial entity relationships of claim 1 wherein said target lexicon is an overcomplete lexicon containing all place names in said target text.

6. The method of claim 1, wherein matching the segmented text with place names in the target dictionary by using a String Merging algorithm to obtain candidate place names corresponding to character strings in the segmented text comprises:

if the length of the first character string is larger than 1 and the first character string is contained by the second place name, continuously judging whether the next character string still belongs to the first place name;

7. The method of claim 1, wherein filtering and denoising the candidate place names comprises:

calculating the place name S in the acquired character string list_aAnd the place name S corresponding to the standard place name list_bThe cosine similarity of the place name vectors between the two sets of place name vectors is as follows:

wherein,

is S_aCorresponding place name vector, S_aFor the place name in the list of character strings,

is S_bCorresponding place name vector, S_bIs another place name in the character string list;

after calculation, if the value of the cosine similarity Cos _ sim is not less than the threshold valueK, then the place name S is reserved_aAnd said place name S_bOtherwise, deleting the place name S_aAnd said place name S_b。

8. The method of claim 1, wherein the relative weight is multiplied by latitude and longitude coordinates of place names and accumulated to obtain the geographic coordinates of the participled text according to the following formula:

9. The method as claimed in claim 1, wherein a self-defined dictionary is added to improve the accuracy of word segmentation when a word segmentation tool is used to perform text word segmentation on the target text.

10. A storage medium, wherein the storage medium is a computer-readable storage medium, and wherein the computer-readable storage medium stores a method for inferring the geographic location of text from the relationship of spatial entities according to any of claims 1-9.