CN112527933A

CN112527933A - Chinese address association method based on space position and text training

Info

Publication number: CN112527933A
Application number: CN202011409893.7A
Authority: CN
Inventors: 董文杰; 何宗; 高翔; 袁超; 张红文; 贾亚辉; 刘建; 韩维喆; 叶胜; 瞿孟; 李胜; 王岚; 肖勇; 钱文进; 王俊; 曾攀; 彭婧
Original assignee: Chongqing Geographic Information And Remote Sensing Application Center
Current assignee: Chongqing Geographic Information And Remote Sensing Application Center
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-03-19

Abstract

The invention discloses a Chinese address association method based on space position and text training, which comprises the following steps: acquiring address data to be associated, and preprocessing the data; performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result; performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model; screening a candidate address set from an existing standard address library according to the thematic classification screening radius and the subject word searching radius; determining a target address in the candidate address set; and establishing an association relation table between the target address and the address to be associated. The remarkable effects are as follows: the problem of establishing an association relation between data of different industry departments and standard address data is solved, and the uniform association of the data of various industries through the standard addresses is realized.

Description

Chinese address association method based on space position and text training

Technical Field

The invention relates to the technical field of geographic information, in particular to a Chinese address association method based on spatial position and text training.

Background

With the rapid development of science and technology, industry departments such as natural resources, economic society and the like accumulate massive data resources, and simultaneously, the industry departments actively share and exchange the resources so as to maximize the value of data. However, since the data contents, organization modes and accuracies of different sources are different, effective association relationships are difficult to establish between data, which brings great challenges to comprehensive application, analysis and management of data, and how to effectively establish association relationships between data and break through data barrier limits makes the current problem to be solved urgently. Statistically, more than 80% of human activities are related to geospatial locations, and addresses are textual representations of geospatial locations. Therefore, establishing the association relationship between different data by using the address as a link is a feasible important method.

The association relationship between the data is established by the address, and the prior art usually adopts a means based on an address dictionary or independent of the address dictionary. Based on an address dictionary, namely, a key element word bank, a matching rule and a geocode bank of address data are established in advance, key elements in the address to be associated are used as retrieval conditions, traversal search and matching are carried out in the address dictionary, the same address data are found out, and the association relation among different source data is established. The method has a good analysis effect on the address data contained in the address dictionary, but has certain limitation when facing a complex Chinese address scene. Firstly, the existing address dictionary cannot contain all elements of all address data, the analysis effect on the address data uncovered by the dictionary is poor, and with the continuous richness of the contents of the address dictionary, the workload of constructing new contents is large, the time consumption is long, and the dictionary is overlarge and is difficult to maintain. The dictionary-independent method generally analyzes the address element configuration features by means of natural language processing technology or the like, and then performs matching with each other. The method has a good processing effect on address data with a standard structure specification, but has a poor processing effect on Chinese address conditions with fuzzy descriptive word semantics and an irregular structure, and can only analyze the Chinese address from the dimension of a text.

To summarize, the difficulty of address association is mainly reflected in the following three aspects:

1) because the demands of different industry departments on the addresses are different, the spatial position, the address description and the standard address of each data have deviation, and the incidence relation between the data of each industry and the standard address is difficult to be accurately established by only depending on the spatial position or the address description information.

2) Due to history transition and social development, the names of a plurality of addresses change along with the development of times, and comprise a large number of names, aliases and the like; in addition, in some address information acquisition processes, the unnormalization conditions such as missing items, wrongly written characters, approximate direction range description and the like are more;

3) the addresses described by natural language have semantic continuity or conditions of abbreviation, abbreviation and the like, and the traditional word segmentation method and the means of character string matching and the like are difficult to effectively and accurately identify.

Based on the above, a Chinese address association method which comprehensively considers two dimensions of a geographic space position and a text, does not simply depend on an address dictionary, can adapt to situations of fuzzy address description words, irregular structure and the like is needed at present, can adapt to complex Chinese address scenes, and effectively associates data of different industry departments with standard addresses.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a Chinese address association method based on space position and text training, which can be adapted to complex Chinese address scenes, does not only depend on an address dictionary, and can be better adapted to the conditions of fuzzy address description semantics, irregular structure and the like, and two dimensions of geographic space positions and texts, so as to solve the technical problem of establishing association between data of different industry departments and standard address data.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a Chinese address association method based on space position and text training is characterized by comprising the following steps:

step 1: acquiring address data to be associated, and preprocessing the data;

step 2: performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result;

and step 3: performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model;

and 4, step 4: screening a candidate address set from an existing standard address library according to the thematic classification screening radius and the subject word searching radius;

and 5: determining a target address in the candidate address set;

step 6: and establishing an association relation table between the target address and the address to be associated.

Further, the preprocessing of the address data to be associated in step 1 includes a special character cleaning process, a missing administrative region completion process, and a meaningless data cleaning process which is filled in by the user.

Further, the step 2 of using the conditional random field model to perform word segmentation on the preprocessed address data to be associated specifically comprises the following steps:

step 2.1: based on the phrase library content in the initial sample word library, performing word position labeling on each single word in the preprocessed address data to be associated by adopting a conditional random field model;

step 2.2: calculating the continuity probability among the single characters through a characteristic template in the conditional random field model, performing repeated iterative training, and finally calculating different word segmentation combination probabilities;

step 2.3: and selecting the word segmentation combination with the highest probability to form a word segmentation result.

Further, the step of performing part-of-speech tagging on the segmentation result in the step 2 is as follows:

step S1: constructing a part-of-speech dependency template between the phrases based on an eighteenth-level address hierarchical model;

step S2: in the training process of the conditional random field model, performing primary division on the part of speech of the corresponding phrase according to eighteen grades, and performing iteration according to the part of speech dependent template setting to determine the final grading label;

step S3: and feeding back and inputting the word segmentation result corresponding to the final hierarchical label into the initial sample word stock, and enriching the initial sample word stock.

Further, the training process of the conditional random field model is as follows:

step A1: based on the sample address data, obtaining address labeling data according to an eighteenth-level address hierarchical model;

step A2: according to the address labeling data, counting, summarizing and concluding various feature templates and forming feature functions;

step A3: and training the Chinese address by adopting a characteristic function to obtain a conditional random field model.

Further, the process of identifying and extracting the main words from the word segmentation result in the step 3 is as follows:

step 3.1: according to the word segmentation result, for the condition containing the level contents of the main words, from the thirteenth level of the eighteenth-level address classification model, if a plurality of same-level main words exist, the main words are proposed one by one;

step 3.2: if the thirteenth level is not available, returning to the first level upwards until all the main words are identified and extracted;

step 3.3: for the case that the content of the main word level is not contained, the space range described by the address is too large, and the utilization value is not too large.

Further, the screening process of the candidate address set in step 4 is as follows:

step 4.1: taking the larger value of the thematic classification screening radius and the main word searching radius as a screening radius;

step 4.2: selecting all standard library address data in the range by using the coordinate point of the address to be associated as an original point and the screening radius as a buffer radius through a buffer zone circle;

step 4.3: and screening out a candidate address set through the main word index on the basis of the circled standard address data.

Further, the determination process of the target address in step 5 is as follows:

step 5.1: constructing a candidate address data index according to an eighteen-level address grading model based on the candidate address set screened in the step 4;

step 5.2: searching the word segmentation result of the address to be associated in the candidate address data index, if the word segmentation result is completely matched with the address to be associated, finding the address which is completely the same as the address to be associated in the candidate address, and directly determining the address to be the target address, otherwise, entering the step 5.3;

step 5.3: searching in the candidate address data index again according to the main word information of the address to be associated to obtain candidate address data with intersection between the main word of the address to be associated and the main word of the candidate address as an initial recommended candidate address;

step 5.4: finally, sorting the addresses to be associated and the initial recommended candidate addresses in the same level of the main words from near to far according to the spatial position distance, and taking a plurality of parts sorted in the front to obtain the final recommended candidate addresses;

step 5.5: calculating the text similarity of the address to be associated and the final recommended candidate address by adopting an edit distance algorithm;

step 5.6: and taking the candidate address with the highest similarity value as the target address.

Further, the calculation formula of the edit distance algorithm is as follows:

sim＝1-dis/max(len(s1)，len(s2))，

where sim represents the text similarity of the character string len (s1) in the address to be associated with the character string len (s2) in the candidate address, and dis/max (len (s1), len (s2)) represents the longest character length in two character strings.

Further, in the process of establishing the association table, step 6 also corresponds the address code of the standard address and the unique element code of the address to be associated one by one and writes the address code and the unique element code into the association table, and an index is established to ensure the retrieval efficiency of the association table.

The invention has the following remarkable effects:

the method well solves the problem of establishing the association relationship between the data of different industry departments and the standard address data, realizes the unified association of the data of each industry through the standard address, has good accuracy and stability, and lays a foundation for the linkage update and the cross analysis application of the data of each industry in the follow-up process. Compared with the existing address correlation technology, the method comprehensively uses geographic spatial position screening and text training, and before the text matching similarity is calculated, the buffer area selection is carried out according to the spatial position, so that redundant interference data are effectively filtered, the subsequent calculation data amount is reduced, and the premise of guaranteeing the accurate matching of the text is ensured; and moreover, a word segmentation algorithm combining deep learning and machine learning is adopted, through part of speech tagging and main word recognition and extraction, the complicated Chinese address conditions such as irregular addresses and fuzzy key element expression are effectively solved, meanwhile, each association process is also a training process of the model, a sample word segmentation library can be automatically enriched, and compared with a manual operation and maintenance address dictionary, a large amount of manpower and material resources are saved.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram of a conditional random field model training process;

FIG. 3 is a schematic diagram of global feature function formation;

fig. 4 is a flow chart of determination of a target address.

Detailed Description

The following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.

As shown in fig. 1, a method for associating a chinese address based on spatial location and text training specifically includes the following steps:

step 1: acquiring address data to be associated, and preprocessing the data;

the address data to be associated generally originates from thematic data containing address information of various industry departments, such as medical treatment, education, buildings and the like, and in the thematic data, the address information is not the core service attribute in many cases, which causes uneven quality levels of the address information, so that the address data needs to be preprocessed firstly, and the method specifically comprises the following steps:

1) cleaning special characters such as full angle symbol to half angle symbol, ", $, &,%,", and the like;

2) filling up missing administrative regions such as province/city/district (county)/street (village, town);

3) and (4) cleaning up meaningless data (invalid addresses of administrative districts such as province/city/district (county), other irrelevant data and the like) filled by the user by himself.

conditional Random Field (CRF) is a discriminative probabilistic model commonly used to label or analyze sequence data, such as natural language text or biological sequences. The method can be used for segmenting the preprocessed address data to be associated, the CRF model takes the segmentation as the lexeme classification problem of the character to solve, generally, B, M, E, S four labels are adopted for lexeme definition of the character, wherein B represents the beginning of the word, M represents the middle of the word, E represents the end of the word, and S represents a single word.

The specific process of adopting CRF word segmentation comprises the following steps:

The first step of word segmentation is to perform preliminary part-of-speech tagging on a single word according to the content of a sample word stock, and then calculate and form each word group according to probability, but no dependency exists between the word groups, and the sequence cannot be determined, so that part-of-speech tagging of the word group needs to be performed again, and the specific process is as follows:

step S1: based on an eighteenth-level address hierarchical model, a part-of-speech dependency template between phrases is constructed (for example, a district or a county is generally behind a city, a street is behind the district (county), the street generally cannot directly follow the city, and cannot be in front of the city, and a corresponding hierarchical dependency template is formed based on the street);

As shown in FIG. 2, the training process of the conditional random field model (CRF model) is as follows:

the CRF feature template represents the link between a certain location or locations in a sentence and the information of the current training location. Since the conditional random field is defined in each position of the same feature, the same feature can be summed in each position to convert the local feature function into a global feature function, as shown in fig. 3.

Step A3: and training the Chinese addresses by adopting a characteristic function to obtain a conditional random field model for word segmentation.

It should be noted that, when the first training is started, because there is no feature template and feature function, a value needs to be manually counted first, that is, address labeling data is obtained according to the eighteenth-level address hierarchical model during the first training; and directly forming a characteristic function according to the address marking data. That is, the initial data, the address labeling data and the feature function are formed as a first training process, and the subsequent iterative training process is a process of applying and enriching the sample word stock and the feature template on the basis of the first result. When the word segmentation is carried out, Chinese addresses come in, the process is also the process, sample address data are formed firstly, then address marking data are formed, a feature template and a feature function are formed, a model is trained based on the feature function, the word segmentation is completed, and the training process is the word segmentation process.

Based on the description, the word segmentation algorithm combining deep learning and machine learning is adopted in the embodiment, the complicated Chinese address conditions such as irregular addresses and fuzzy key element expression are effectively solved through part-of-speech tagging and main word recognition and extraction, meanwhile, each association process is also a model training process, a sample word segmentation library can be automatically enriched, and compared with a manual operation and maintenance address dictionary, a large amount of manpower and material resources are saved.

In this example, the eighteenth-level address hierarchical model is subdivided into 18 levels according to the chinese address characteristics, including province (prefecture city/special administrative district), prefecture city, district (county/county city), development area (industrial park, etc.), street (county/town), community (village), group (team), business circle, main road, branch road, house number, branch number, house number, district (interest point), building number, unit number, floor number, room number, and address description information, as shown in table 1.

The address grading process takes the address as reference, and the combination is carried out according to the characteristics of the entries and the context relationship, and finally the corresponding grade is given. It should be noted that although the model subdivides eighteen levels of hierarchical levels, a specific address generally cannot contain all the hierarchical levels, and particularly after the ninth level, the number of entries is generally large, which is why it is difficult to accurately match address data only by relying on an address dictionary and text analysis. In addition, the eighteenth-level address hierarchy model is only one possible exemplary model, and in other possible practical cases, other hierarchy modes can be adopted.

TABLE 1 eighteen-level Address hierarchy

Rank of	Content providing method and apparatus	Examples of the invention
			1	Provincial and direct prefecture cities, special administrative districts	Special administrative areas of Hubei province, Chongqing city and hong Kong
2	Grade city	Mourning city
			3	City, county level city	Yu northern region, Yuyangxian county
4	Development area, industrial park, etc	High new area and two river new areas
			5	Street, town and village	Longshan street
6	Community and village	Countryside of ran family dam
			7	Group and team	Village 4 groups of mutual help
8	Trade circle	Longhu times sky street
			9	Main road	Elaphe carinata (Benth.) Merr
10	Branch circuit	2 branches of the clubmoss
			11	Number plate	Yusonlu No. 123
12	Brand number	Yusonlu 123 with number 4
			13	Zone, interest point	Spring city of two rivers
14	Building number	13-span of spring city of two rivers
			15	Unit number	13-span 2 units in spring cities of two rivers
16	Floor number	13-span 2-unit 3-storied building in spring cities of two rivers
			17	Room number	13-span 2-unit 3-building 301 room in two river spring cities
18	Address description information	Beside the great hall of people

after the associated address word segmentation is finished, identifying and extracting main words based on an eighteenth-level address hierarchical model, wherein the main words mainly correspond to ninth to thirteenth levels in the eighteenth-level address hierarchical model, namely a main road, a branch road, a house number, a branch number and a zone (interest point).

The process of identifying and extracting the main words comprises the following steps:

For example, for the Chongqing Chongshan street Longshan Dadao No. 101 Xinfeng city Xinjiang, the main words are Longshan Dadao No. 101 and Chunfeng city Xinjiang; for the Chongqing Chongshan street Longshan Dadao No. 101 in the northern district of Chongshan city, the main word is "Longshan Dadao No. 101"; for the Chongqing Chongshan street spring wind city core building, the main word is the spring wind city core building.

in specific implementation, considering that the data volume of a general standard address base is too large, if the data volume is directly retrieved through main word index, the algorithm time complexity and the space complexity are higher, and therefore before the text matching similarity is calculated, by introducing a special classification screening radius and a main word searching radius, buffer area selection is performed according to a space position, redundant interference data are effectively filtered, the subsequent calculation data volume is reduced, and the premise of guaranteeing accurate matching of the text is provided.

The screening process of the candidate address set comprises the following steps:

Through the steps, the retrieval efficiency is greatly improved, and a large amount of redundant data can be filtered. For example, taking educational special subject data as an example, the address of the tiger and stream school district of Chongqing university is as follows: "Chongqing city sandlot dam area tiger xi street university city south road 55", we set the education special topic colleges and universities classification radius as 5 kilometers, the address main word is: the university city south road No. 55 is set to have a search radius of 3 kilometers, and finally, an address coordinate point origin of the university city south road No. 55 of the tiger stream street in the sandlot dam area of Chongqing city is selected, and address data within the range of 5 kilometers of the radius of the buffer area are used as a candidate address set.

It should be emphasized that the radius of the topic classification and screening and the radius of the main word search are not fixed parameters, and can be adjusted according to the data situation in practical application. In addition, whether the topic classification radius and the subject word search radius are larger or smaller needs to be determined according to the actual application situation.

And 5: determining a target address in the candidate address set;

the determination process is as follows:

step 5.3: for the incomplete matching condition, searching in the candidate address data index again according to the main word information of the address to be associated to obtain candidate address data with intersection between the main word of the address to be associated and the main word of the candidate address as an initial recommended candidate address;

for example, the to-be-associated address is "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 1" (the main word is "the dragon mountain road No. 101 plus No. 1"), the candidate addresses are "the chongqing north dragon mountain street dragon mountain road No. 101" (the main word is "the dragon mountain road No. 101"), "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 2" (the main word is "the dragon mountain road No. 101 plus No. 2"), "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 3" (the main word is "the dragon mountain road No. 101 plus No. 3"), and the three candidate address main words and the to-be-associated address main words are intersected, so that all of the three candidate addresses are used as initial recommended candidate addresses.

According to the word segmentation information of the address to be associated, the more backward the main word level is, the closer the matching level is to the eighteenth level, and the more forward the matching similarity ordering is.

Step 5.4: sorting the addresses to be associated and the initial recommendation candidate addresses in the same level of the subject word from near to far according to the spatial position distance to obtain a final recommendation candidate address result, and outputting the sorting top10 (the ranking number of the output recommendation candidate addresses can be set as required);

the edit distance algorithm refers to the minimum number of editing operations required to convert one string into another between two strings. The allowed editing operations include replacing a character with another character, inserting a character, and deleting a character. The number of characters to be replaced, added and deleted is finally accumulated as the edit distance between two character strings. The specific calculation formula is as follows:

sim＝1-dis/max(len(s1)，len(s2))，

For example, the Chongqing Chongshan street Longshan Dadao No. 101, Chongqing Chongshan street Longshan Dadao; the two addresses are different by a character '101', and the two addresses are different by four characters at the same position, and dis is 4; max (len (s1), len (s2)) -18, sim-1-4/18.

Step 5.6: and taking the candidate address with the highest similarity value sim as the target address.

Step 6: and establishing an incidence relation table between the target address and the address to be associated, corresponding the address code of the target address and the unique element code of the address to be associated one by one and writing the address code and the unique element code into the incidence relation table, completing the establishment of the incidence relation, establishing a B-TREE index or other index structures, and ensuring the retrieval efficiency of the incidence relation table.

In conclusion, the method solves the problem of establishing the association relationship between the data of different industry departments and the target address data in the standard address library, realizes the unified association of the data of all industries through the standard addresses, and lays a foundation for the linkage update and the cross analysis application of the data of all the industries subsequently. The method is applied to establishing the association relationship between tens of millions of real standard addresses and each thematic data, the accuracy of the association result reaches more than 90%, and the method has good accuracy and stability.

The technical solution provided by the present invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A Chinese address association method based on space position and text training is characterized by comprising the following steps:

step 1: acquiring address data to be associated, and preprocessing the data;

and 4, step 4: screening a candidate address set in a standard address library according to the thematic classification screening radius and the main word searching radius;

and 5: determining a target address in the candidate address set;

2. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the preprocessing of the address data to be associated in the step 1 comprises a special character cleaning process, a missing administrative region completion process and a meaningless data cleaning process filled by a user.

3. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the specific steps of adopting the conditional random field model to perform word segmentation on the preprocessed address data to be associated in the step 2 are as follows:

4. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the part-of-speech tagging is performed on the segmentation result in the step 2 as follows:

5. The Chinese address association method based on spatial locality and text training of claim 1, 3 or 4, wherein: the training process of the conditional random field model comprises the following steps:

6. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the process of identifying and extracting the main words from the word segmentation result in the step 3 is as follows:

7. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the screening process of the candidate address set in the step 4 is as follows:

8. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the determination process of the target address in the step 5 is as follows:

step 5.4: sorting the addresses to be associated and the initial recommended candidate addresses in the same level of the main words from near to far according to the spatial position distance, and taking a plurality of parts sorted in the front to obtain the final recommended candidate addresses;

9. The Chinese address association method based on spatial locality and text training of claim 1 or 8, wherein: the calculation formula of the edit distance algorithm is as follows:

sim＝1-dis/max(len(s1)，len(s2))，

10. The Chinese address association method based on spatial locality and text training of claim 1, wherein: and 6, in the process of establishing the association table, the address codes of the standard addresses and the unique element codes of the addresses to be associated are in one-to-one correspondence and written into the association table, and indexes are established to ensure the retrieval efficiency of the association table.