CN112527933A - Chinese address association method based on space position and text training - Google Patents

Chinese address association method based on space position and text training Download PDF

Info

Publication number
CN112527933A
CN112527933A CN202011409893.7A CN202011409893A CN112527933A CN 112527933 A CN112527933 A CN 112527933A CN 202011409893 A CN202011409893 A CN 202011409893A CN 112527933 A CN112527933 A CN 112527933A
Authority
CN
China
Prior art keywords
address
data
word
candidate
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011409893.7A
Other languages
Chinese (zh)
Inventor
董文杰
何宗
高翔
袁超
张红文
贾亚辉
刘建
韩维喆
叶胜
瞿孟
李胜
王岚
肖勇
钱文进
王俊
曾攀
彭婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Geographic Information And Remote Sensing Application Center
Original Assignee
Chongqing Geographic Information And Remote Sensing Application Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Geographic Information And Remote Sensing Application Center filed Critical Chongqing Geographic Information And Remote Sensing Application Center
Priority to CN202011409893.7A priority Critical patent/CN112527933A/en
Publication of CN112527933A publication Critical patent/CN112527933A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese address association method based on space position and text training, which comprises the following steps: acquiring address data to be associated, and preprocessing the data; performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result; performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model; screening a candidate address set from an existing standard address library according to the thematic classification screening radius and the subject word searching radius; determining a target address in the candidate address set; and establishing an association relation table between the target address and the address to be associated. The remarkable effects are as follows: the problem of establishing an association relation between data of different industry departments and standard address data is solved, and the uniform association of the data of various industries through the standard addresses is realized.

Description

Chinese address association method based on space position and text training
Technical Field
The invention relates to the technical field of geographic information, in particular to a Chinese address association method based on spatial position and text training.
Background
With the rapid development of science and technology, industry departments such as natural resources, economic society and the like accumulate massive data resources, and simultaneously, the industry departments actively share and exchange the resources so as to maximize the value of data. However, since the data contents, organization modes and accuracies of different sources are different, effective association relationships are difficult to establish between data, which brings great challenges to comprehensive application, analysis and management of data, and how to effectively establish association relationships between data and break through data barrier limits makes the current problem to be solved urgently. Statistically, more than 80% of human activities are related to geospatial locations, and addresses are textual representations of geospatial locations. Therefore, establishing the association relationship between different data by using the address as a link is a feasible important method.
The association relationship between the data is established by the address, and the prior art usually adopts a means based on an address dictionary or independent of the address dictionary. Based on an address dictionary, namely, a key element word bank, a matching rule and a geocode bank of address data are established in advance, key elements in the address to be associated are used as retrieval conditions, traversal search and matching are carried out in the address dictionary, the same address data are found out, and the association relation among different source data is established. The method has a good analysis effect on the address data contained in the address dictionary, but has certain limitation when facing a complex Chinese address scene. Firstly, the existing address dictionary cannot contain all elements of all address data, the analysis effect on the address data uncovered by the dictionary is poor, and with the continuous richness of the contents of the address dictionary, the workload of constructing new contents is large, the time consumption is long, and the dictionary is overlarge and is difficult to maintain. The dictionary-independent method generally analyzes the address element configuration features by means of natural language processing technology or the like, and then performs matching with each other. The method has a good processing effect on address data with a standard structure specification, but has a poor processing effect on Chinese address conditions with fuzzy descriptive word semantics and an irregular structure, and can only analyze the Chinese address from the dimension of a text.
To summarize, the difficulty of address association is mainly reflected in the following three aspects:
1) because the demands of different industry departments on the addresses are different, the spatial position, the address description and the standard address of each data have deviation, and the incidence relation between the data of each industry and the standard address is difficult to be accurately established by only depending on the spatial position or the address description information.
2) Due to history transition and social development, the names of a plurality of addresses change along with the development of times, and comprise a large number of names, aliases and the like; in addition, in some address information acquisition processes, the unnormalization conditions such as missing items, wrongly written characters, approximate direction range description and the like are more;
3) the addresses described by natural language have semantic continuity or conditions of abbreviation, abbreviation and the like, and the traditional word segmentation method and the means of character string matching and the like are difficult to effectively and accurately identify.
Based on the above, a Chinese address association method which comprehensively considers two dimensions of a geographic space position and a text, does not simply depend on an address dictionary, can adapt to situations of fuzzy address description words, irregular structure and the like is needed at present, can adapt to complex Chinese address scenes, and effectively associates data of different industry departments with standard addresses.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a Chinese address association method based on space position and text training, which can be adapted to complex Chinese address scenes, does not only depend on an address dictionary, and can be better adapted to the conditions of fuzzy address description semantics, irregular structure and the like, and two dimensions of geographic space positions and texts, so as to solve the technical problem of establishing association between data of different industry departments and standard address data.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a Chinese address association method based on space position and text training is characterized by comprising the following steps:
step 1: acquiring address data to be associated, and preprocessing the data;
step 2: performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result;
and step 3: performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model;
and 4, step 4: screening a candidate address set from an existing standard address library according to the thematic classification screening radius and the subject word searching radius;
and 5: determining a target address in the candidate address set;
step 6: and establishing an association relation table between the target address and the address to be associated.
Further, the preprocessing of the address data to be associated in step 1 includes a special character cleaning process, a missing administrative region completion process, and a meaningless data cleaning process which is filled in by the user.
Further, the step 2 of using the conditional random field model to perform word segmentation on the preprocessed address data to be associated specifically comprises the following steps:
step 2.1: based on the phrase library content in the initial sample word library, performing word position labeling on each single word in the preprocessed address data to be associated by adopting a conditional random field model;
step 2.2: calculating the continuity probability among the single characters through a characteristic template in the conditional random field model, performing repeated iterative training, and finally calculating different word segmentation combination probabilities;
step 2.3: and selecting the word segmentation combination with the highest probability to form a word segmentation result.
Further, the step of performing part-of-speech tagging on the segmentation result in the step 2 is as follows:
step S1: constructing a part-of-speech dependency template between the phrases based on an eighteenth-level address hierarchical model;
step S2: in the training process of the conditional random field model, performing primary division on the part of speech of the corresponding phrase according to eighteen grades, and performing iteration according to the part of speech dependent template setting to determine the final grading label;
step S3: and feeding back and inputting the word segmentation result corresponding to the final hierarchical label into the initial sample word stock, and enriching the initial sample word stock.
Further, the training process of the conditional random field model is as follows:
step A1: based on the sample address data, obtaining address labeling data according to an eighteenth-level address hierarchical model;
step A2: according to the address labeling data, counting, summarizing and concluding various feature templates and forming feature functions;
step A3: and training the Chinese address by adopting a characteristic function to obtain a conditional random field model.
Further, the process of identifying and extracting the main words from the word segmentation result in the step 3 is as follows:
step 3.1: according to the word segmentation result, for the condition containing the level contents of the main words, from the thirteenth level of the eighteenth-level address classification model, if a plurality of same-level main words exist, the main words are proposed one by one;
step 3.2: if the thirteenth level is not available, returning to the first level upwards until all the main words are identified and extracted;
step 3.3: for the case that the content of the main word level is not contained, the space range described by the address is too large, and the utilization value is not too large.
Further, the screening process of the candidate address set in step 4 is as follows:
step 4.1: taking the larger value of the thematic classification screening radius and the main word searching radius as a screening radius;
step 4.2: selecting all standard library address data in the range by using the coordinate point of the address to be associated as an original point and the screening radius as a buffer radius through a buffer zone circle;
step 4.3: and screening out a candidate address set through the main word index on the basis of the circled standard address data.
Further, the determination process of the target address in step 5 is as follows:
step 5.1: constructing a candidate address data index according to an eighteen-level address grading model based on the candidate address set screened in the step 4;
step 5.2: searching the word segmentation result of the address to be associated in the candidate address data index, if the word segmentation result is completely matched with the address to be associated, finding the address which is completely the same as the address to be associated in the candidate address, and directly determining the address to be the target address, otherwise, entering the step 5.3;
step 5.3: searching in the candidate address data index again according to the main word information of the address to be associated to obtain candidate address data with intersection between the main word of the address to be associated and the main word of the candidate address as an initial recommended candidate address;
step 5.4: finally, sorting the addresses to be associated and the initial recommended candidate addresses in the same level of the main words from near to far according to the spatial position distance, and taking a plurality of parts sorted in the front to obtain the final recommended candidate addresses;
step 5.5: calculating the text similarity of the address to be associated and the final recommended candidate address by adopting an edit distance algorithm;
step 5.6: and taking the candidate address with the highest similarity value as the target address.
Further, the calculation formula of the edit distance algorithm is as follows:
sim=1-dis/max(len(s1),len(s2)),
where sim represents the text similarity of the character string len (s1) in the address to be associated with the character string len (s2) in the candidate address, and dis/max (len (s1), len (s2)) represents the longest character length in two character strings.
Further, in the process of establishing the association table, step 6 also corresponds the address code of the standard address and the unique element code of the address to be associated one by one and writes the address code and the unique element code into the association table, and an index is established to ensure the retrieval efficiency of the association table.
The invention has the following remarkable effects:
the method well solves the problem of establishing the association relationship between the data of different industry departments and the standard address data, realizes the unified association of the data of each industry through the standard address, has good accuracy and stability, and lays a foundation for the linkage update and the cross analysis application of the data of each industry in the follow-up process. Compared with the existing address correlation technology, the method comprehensively uses geographic spatial position screening and text training, and before the text matching similarity is calculated, the buffer area selection is carried out according to the spatial position, so that redundant interference data are effectively filtered, the subsequent calculation data amount is reduced, and the premise of guaranteeing the accurate matching of the text is ensured; and moreover, a word segmentation algorithm combining deep learning and machine learning is adopted, through part of speech tagging and main word recognition and extraction, the complicated Chinese address conditions such as irregular addresses and fuzzy key element expression are effectively solved, meanwhile, each association process is also a training process of the model, a sample word segmentation library can be automatically enriched, and compared with a manual operation and maintenance address dictionary, a large amount of manpower and material resources are saved.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic diagram of a conditional random field model training process;
FIG. 3 is a schematic diagram of global feature function formation;
fig. 4 is a flow chart of determination of a target address.
Detailed Description
The following provides a more detailed description of the embodiments and the operation of the present invention with reference to the accompanying drawings.
As shown in fig. 1, a method for associating a chinese address based on spatial location and text training specifically includes the following steps:
step 1: acquiring address data to be associated, and preprocessing the data;
the address data to be associated generally originates from thematic data containing address information of various industry departments, such as medical treatment, education, buildings and the like, and in the thematic data, the address information is not the core service attribute in many cases, which causes uneven quality levels of the address information, so that the address data needs to be preprocessed firstly, and the method specifically comprises the following steps:
1) cleaning special characters such as full angle symbol to half angle symbol, ", $, &,%,", and the like;
2) filling up missing administrative regions such as province/city/district (county)/street (village, town);
3) and (4) cleaning up meaningless data (invalid addresses of administrative districts such as province/city/district (county), other irrelevant data and the like) filled by the user by himself.
Step 2: performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result;
conditional Random Field (CRF) is a discriminative probabilistic model commonly used to label or analyze sequence data, such as natural language text or biological sequences. The method can be used for segmenting the preprocessed address data to be associated, the CRF model takes the segmentation as the lexeme classification problem of the character to solve, generally, B, M, E, S four labels are adopted for lexeme definition of the character, wherein B represents the beginning of the word, M represents the middle of the word, E represents the end of the word, and S represents a single word.
The specific process of adopting CRF word segmentation comprises the following steps:
step 2.1: based on the phrase library content in the initial sample word library, performing word position labeling on each single word in the preprocessed address data to be associated by adopting a conditional random field model;
step 2.2: calculating the continuity probability among the single characters through a characteristic template in the conditional random field model, performing repeated iterative training, and finally calculating different word segmentation combination probabilities;
step 2.3: and selecting the word segmentation combination with the highest probability to form a word segmentation result.
The first step of word segmentation is to perform preliminary part-of-speech tagging on a single word according to the content of a sample word stock, and then calculate and form each word group according to probability, but no dependency exists between the word groups, and the sequence cannot be determined, so that part-of-speech tagging of the word group needs to be performed again, and the specific process is as follows:
step S1: based on an eighteenth-level address hierarchical model, a part-of-speech dependency template between phrases is constructed (for example, a district or a county is generally behind a city, a street is behind the district (county), the street generally cannot directly follow the city, and cannot be in front of the city, and a corresponding hierarchical dependency template is formed based on the street);
step S2: in the training process of the conditional random field model, performing primary division on the part of speech of the corresponding phrase according to eighteen grades, and performing iteration according to the part of speech dependent template setting to determine the final grading label;
step S3: and feeding back and inputting the word segmentation result corresponding to the final hierarchical label into the initial sample word stock, and enriching the initial sample word stock.
Examples of word segmentations are, for example: zhang san shou of Chongqing university. The CRF model initializes the single word and labels with a plurality of situations, and we take one of the situations as an example for explanation, for example, initially labels the single word of the example as "Chong | B celebration | B great | B learning | B by | B three | B receiving | B", and finally labels as "Chong | B celebration | M great | M learning | E by | B three | E receiving | S" after repeated iterative training, and the final word segmentation result is obtained as: chongqing university receives three years.
As shown in FIG. 2, the training process of the conditional random field model (CRF model) is as follows:
step A1: based on the sample address data, obtaining address labeling data according to an eighteenth-level address hierarchical model;
step A2: according to the address labeling data, counting, summarizing and concluding various feature templates and forming feature functions;
the CRF feature template represents the link between a certain location or locations in a sentence and the information of the current training location. Since the conditional random field is defined in each position of the same feature, the same feature can be summed in each position to convert the local feature function into a global feature function, as shown in fig. 3.
Step A3: and training the Chinese addresses by adopting a characteristic function to obtain a conditional random field model for word segmentation.
It should be noted that, when the first training is started, because there is no feature template and feature function, a value needs to be manually counted first, that is, address labeling data is obtained according to the eighteenth-level address hierarchical model during the first training; and directly forming a characteristic function according to the address marking data. That is, the initial data, the address labeling data and the feature function are formed as a first training process, and the subsequent iterative training process is a process of applying and enriching the sample word stock and the feature template on the basis of the first result. When the word segmentation is carried out, Chinese addresses come in, the process is also the process, sample address data are formed firstly, then address marking data are formed, a feature template and a feature function are formed, a model is trained based on the feature function, the word segmentation is completed, and the training process is the word segmentation process.
Based on the description, the word segmentation algorithm combining deep learning and machine learning is adopted in the embodiment, the complicated Chinese address conditions such as irregular addresses and fuzzy key element expression are effectively solved through part-of-speech tagging and main word recognition and extraction, meanwhile, each association process is also a model training process, a sample word segmentation library can be automatically enriched, and compared with a manual operation and maintenance address dictionary, a large amount of manpower and material resources are saved.
In this example, the eighteenth-level address hierarchical model is subdivided into 18 levels according to the chinese address characteristics, including province (prefecture city/special administrative district), prefecture city, district (county/county city), development area (industrial park, etc.), street (county/town), community (village), group (team), business circle, main road, branch road, house number, branch number, house number, district (interest point), building number, unit number, floor number, room number, and address description information, as shown in table 1.
The address grading process takes the address as reference, and the combination is carried out according to the characteristics of the entries and the context relationship, and finally the corresponding grade is given. It should be noted that although the model subdivides eighteen levels of hierarchical levels, a specific address generally cannot contain all the hierarchical levels, and particularly after the ninth level, the number of entries is generally large, which is why it is difficult to accurately match address data only by relying on an address dictionary and text analysis. In addition, the eighteenth-level address hierarchy model is only one possible exemplary model, and in other possible practical cases, other hierarchy modes can be adopted.
TABLE 1 eighteen-level Address hierarchy
Rank of Content providing method and apparatus Examples of the invention
1 Provincial and direct prefecture cities, special administrative districts Special administrative areas of Hubei province, Chongqing city and hong Kong
2 Grade city Mourning city
3 City, county level city Yu northern region, Yuyangxian county
4 Development area, industrial park, etc High new area and two river new areas
5 Street, town and village Longshan street
6 Community and village Countryside of ran family dam
7 Group and team Village 4 groups of mutual help
8 Trade circle Longhu times sky street
9 Main road Elaphe carinata (Benth.) Merr
10 Branch circuit 2 branches of the clubmoss
11 Number plate Yusonlu No. 123
12 Brand number Yusonlu 123 with number 4
13 Zone, interest point Spring city of two rivers
14 Building number 13-span of spring city of two rivers
15 Unit number 13-span 2 units in spring cities of two rivers
16 Floor number 13-span 2-unit 3-storied building in spring cities of two rivers
17 Room number 13-span 2-unit 3-building 301 room in two river spring cities
18 Address description information Beside the great hall of people
And step 3: performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model;
after the associated address word segmentation is finished, identifying and extracting main words based on an eighteenth-level address hierarchical model, wherein the main words mainly correspond to ninth to thirteenth levels in the eighteenth-level address hierarchical model, namely a main road, a branch road, a house number, a branch number and a zone (interest point).
The process of identifying and extracting the main words comprises the following steps:
step 3.1: according to the word segmentation result, for the condition containing the level contents of the main words, from the thirteenth level of the eighteenth-level address classification model, if a plurality of same-level main words exist, the main words are proposed one by one;
step 3.2: if the thirteenth level is not available, returning to the first level upwards until all the main words are identified and extracted;
step 3.3: for the case that the content of the main word level is not contained, the space range described by the address is too large, and the utilization value is not too large.
For example, for the Chongqing Chongshan street Longshan Dadao No. 101 Xinfeng city Xinjiang, the main words are Longshan Dadao No. 101 and Chunfeng city Xinjiang; for the Chongqing Chongshan street Longshan Dadao No. 101 in the northern district of Chongshan city, the main word is "Longshan Dadao No. 101"; for the Chongqing Chongshan street spring wind city core building, the main word is the spring wind city core building.
And 4, step 4: screening a candidate address set from an existing standard address library according to the thematic classification screening radius and the subject word searching radius;
in specific implementation, considering that the data volume of a general standard address base is too large, if the data volume is directly retrieved through main word index, the algorithm time complexity and the space complexity are higher, and therefore before the text matching similarity is calculated, by introducing a special classification screening radius and a main word searching radius, buffer area selection is performed according to a space position, redundant interference data are effectively filtered, the subsequent calculation data volume is reduced, and the premise of guaranteeing accurate matching of the text is provided.
The screening process of the candidate address set comprises the following steps:
step 4.1: taking the larger value of the thematic classification screening radius and the main word searching radius as a screening radius;
step 4.2: selecting all standard library address data in the range by using the coordinate point of the address to be associated as an original point and the screening radius as a buffer radius through a buffer zone circle;
step 4.3: and screening out a candidate address set through the main word index on the basis of the circled standard address data.
Through the steps, the retrieval efficiency is greatly improved, and a large amount of redundant data can be filtered. For example, taking educational special subject data as an example, the address of the tiger and stream school district of Chongqing university is as follows: "Chongqing city sandlot dam area tiger xi street university city south road 55", we set the education special topic colleges and universities classification radius as 5 kilometers, the address main word is: the university city south road No. 55 is set to have a search radius of 3 kilometers, and finally, an address coordinate point origin of the university city south road No. 55 of the tiger stream street in the sandlot dam area of Chongqing city is selected, and address data within the range of 5 kilometers of the radius of the buffer area are used as a candidate address set.
It should be emphasized that the radius of the topic classification and screening and the radius of the main word search are not fixed parameters, and can be adjusted according to the data situation in practical application. In addition, whether the topic classification radius and the subject word search radius are larger or smaller needs to be determined according to the actual application situation.
And 5: determining a target address in the candidate address set;
the determination process is as follows:
step 5.1: constructing a candidate address data index according to an eighteen-level address grading model based on the candidate address set screened in the step 4;
step 5.2: searching the word segmentation result of the address to be associated in the candidate address data index, if the word segmentation result is completely matched with the address to be associated, finding the address which is completely the same as the address to be associated in the candidate address, and directly determining the address to be the target address, otherwise, entering the step 5.3;
step 5.3: for the incomplete matching condition, searching in the candidate address data index again according to the main word information of the address to be associated to obtain candidate address data with intersection between the main word of the address to be associated and the main word of the candidate address as an initial recommended candidate address;
for example, the to-be-associated address is "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 1" (the main word is "the dragon mountain road No. 101 plus No. 1"), the candidate addresses are "the chongqing north dragon mountain street dragon mountain road No. 101" (the main word is "the dragon mountain road No. 101"), "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 2" (the main word is "the dragon mountain road No. 101 plus No. 2"), "the chongqing north dragon mountain street dragon mountain road No. 101 plus No. 3" (the main word is "the dragon mountain road No. 101 plus No. 3"), and the three candidate address main words and the to-be-associated address main words are intersected, so that all of the three candidate addresses are used as initial recommended candidate addresses.
According to the word segmentation information of the address to be associated, the more backward the main word level is, the closer the matching level is to the eighteenth level, and the more forward the matching similarity ordering is.
Step 5.4: sorting the addresses to be associated and the initial recommendation candidate addresses in the same level of the subject word from near to far according to the spatial position distance to obtain a final recommendation candidate address result, and outputting the sorting top10 (the ranking number of the output recommendation candidate addresses can be set as required);
step 5.5: calculating the text similarity of the address to be associated and the final recommended candidate address by adopting an edit distance algorithm;
the edit distance algorithm refers to the minimum number of editing operations required to convert one string into another between two strings. The allowed editing operations include replacing a character with another character, inserting a character, and deleting a character. The number of characters to be replaced, added and deleted is finally accumulated as the edit distance between two character strings. The specific calculation formula is as follows:
sim=1-dis/max(len(s1),len(s2)),
where sim represents the text similarity of the character string len (s1) in the address to be associated with the character string len (s2) in the candidate address, and dis/max (len (s1), len (s2)) represents the longest character length in two character strings.
For example, the Chongqing Chongshan street Longshan Dadao No. 101, Chongqing Chongshan street Longshan Dadao; the two addresses are different by a character '101', and the two addresses are different by four characters at the same position, and dis is 4; max (len (s1), len (s2)) -18, sim-1-4/18.
Step 5.6: and taking the candidate address with the highest similarity value sim as the target address.
Step 6: and establishing an incidence relation table between the target address and the address to be associated, corresponding the address code of the target address and the unique element code of the address to be associated one by one and writing the address code and the unique element code into the incidence relation table, completing the establishment of the incidence relation, establishing a B-TREE index or other index structures, and ensuring the retrieval efficiency of the incidence relation table.
In conclusion, the method solves the problem of establishing the association relationship between the data of different industry departments and the target address data in the standard address library, realizes the unified association of the data of all industries through the standard addresses, and lays a foundation for the linkage update and the cross analysis application of the data of all the industries subsequently. The method is applied to establishing the association relationship between tens of millions of real standard addresses and each thematic data, the accuracy of the association result reaches more than 90%, and the method has good accuracy and stability.
The technical solution provided by the present invention is described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. A Chinese address association method based on space position and text training is characterized by comprising the following steps:
step 1: acquiring address data to be associated, and preprocessing the data;
step 2: performing word segmentation on the preprocessed address data to be associated by adopting a conditional random field model, and performing part-of-speech tagging on a word segmentation result;
and step 3: performing main word recognition and extraction on the word segmentation result based on an eighteenth-level address classification model;
and 4, step 4: screening a candidate address set in a standard address library according to the thematic classification screening radius and the main word searching radius;
and 5: determining a target address in the candidate address set;
step 6: and establishing an association relation table between the target address and the address to be associated.
2. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the preprocessing of the address data to be associated in the step 1 comprises a special character cleaning process, a missing administrative region completion process and a meaningless data cleaning process filled by a user.
3. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the specific steps of adopting the conditional random field model to perform word segmentation on the preprocessed address data to be associated in the step 2 are as follows:
step 2.1: based on the phrase library content in the initial sample word library, performing word position labeling on each single word in the preprocessed address data to be associated by adopting a conditional random field model;
step 2.2: calculating the continuity probability among the single characters through a characteristic template in the conditional random field model, performing repeated iterative training, and finally calculating different word segmentation combination probabilities;
step 2.3: and selecting the word segmentation combination with the highest probability to form a word segmentation result.
4. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the part-of-speech tagging is performed on the segmentation result in the step 2 as follows:
step S1: constructing a part-of-speech dependency template between the phrases based on an eighteenth-level address hierarchical model;
step S2: in the training process of the conditional random field model, performing primary division on the part of speech of the corresponding phrase according to eighteen grades, and performing iteration according to the part of speech dependent template setting to determine the final grading label;
step S3: and feeding back and inputting the word segmentation result corresponding to the final hierarchical label into the initial sample word stock, and enriching the initial sample word stock.
5. The Chinese address association method based on spatial locality and text training of claim 1, 3 or 4, wherein: the training process of the conditional random field model comprises the following steps:
step A1: based on the sample address data, obtaining address labeling data according to an eighteenth-level address hierarchical model;
step A2: according to the address labeling data, counting, summarizing and concluding various feature templates and forming feature functions;
step A3: and training the Chinese address by adopting a characteristic function to obtain a conditional random field model.
6. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the process of identifying and extracting the main words from the word segmentation result in the step 3 is as follows:
step 3.1: according to the word segmentation result, for the condition containing the level contents of the main words, from the thirteenth level of the eighteenth-level address classification model, if a plurality of same-level main words exist, the main words are proposed one by one;
step 3.2: if the thirteenth level is not available, returning to the first level upwards until all the main words are identified and extracted;
step 3.3: for the case that the content of the main word level is not contained, the space range described by the address is too large, and the utilization value is not too large.
7. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the screening process of the candidate address set in the step 4 is as follows:
step 4.1: taking the larger value of the thematic classification screening radius and the main word searching radius as a screening radius;
step 4.2: selecting all standard library address data in the range by using the coordinate point of the address to be associated as an original point and the screening radius as a buffer radius through a buffer zone circle;
step 4.3: and screening out a candidate address set through the main word index on the basis of the circled standard address data.
8. The Chinese address association method based on spatial locality and text training of claim 1, wherein: the determination process of the target address in the step 5 is as follows:
step 5.1: constructing a candidate address data index according to an eighteen-level address grading model based on the candidate address set screened in the step 4;
step 5.2: searching the word segmentation result of the address to be associated in the candidate address data index, if the word segmentation result is completely matched with the address to be associated, finding the address which is completely the same as the address to be associated in the candidate address, and directly determining the address to be the target address, otherwise, entering the step 5.3;
step 5.3: searching in the candidate address data index again according to the main word information of the address to be associated to obtain candidate address data with intersection between the main word of the address to be associated and the main word of the candidate address as an initial recommended candidate address;
step 5.4: sorting the addresses to be associated and the initial recommended candidate addresses in the same level of the main words from near to far according to the spatial position distance, and taking a plurality of parts sorted in the front to obtain the final recommended candidate addresses;
step 5.5: calculating the text similarity of the address to be associated and the final recommended candidate address by adopting an edit distance algorithm;
step 5.6: and taking the candidate address with the highest similarity value as the target address.
9. The Chinese address association method based on spatial locality and text training of claim 1 or 8, wherein: the calculation formula of the edit distance algorithm is as follows:
sim=1-dis/max(len(s1),len(s2)),
where sim represents the text similarity of the character string len (s1) in the address to be associated with the character string len (s2) in the candidate address, and dis/max (len (s1), len (s2)) represents the longest character length in two character strings.
10. The Chinese address association method based on spatial locality and text training of claim 1, wherein: and 6, in the process of establishing the association table, the address codes of the standard addresses and the unique element codes of the addresses to be associated are in one-to-one correspondence and written into the association table, and indexes are established to ensure the retrieval efficiency of the association table.
CN202011409893.7A 2020-12-04 2020-12-04 Chinese address association method based on space position and text training Pending CN112527933A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011409893.7A CN112527933A (en) 2020-12-04 2020-12-04 Chinese address association method based on space position and text training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011409893.7A CN112527933A (en) 2020-12-04 2020-12-04 Chinese address association method based on space position and text training

Publications (1)

Publication Number Publication Date
CN112527933A true CN112527933A (en) 2021-03-19

Family

ID=74997661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011409893.7A Pending CN112527933A (en) 2020-12-04 2020-12-04 Chinese address association method based on space position and text training

Country Status (1)

Country Link
CN (1) CN112527933A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361233A (en) * 2021-06-08 2021-09-07 广州城市规划技术开发服务部有限公司 Standard address and building association matching method and device
CN113656531A (en) * 2021-08-12 2021-11-16 南方电网数字电网研究院有限公司 Processing method and device for power grid address structuralization
CN114003812A (en) * 2021-10-29 2022-02-01 深圳壹账通智能科技有限公司 Address matching method, system, device and storage medium
CN116258141A (en) * 2023-05-16 2023-06-13 青岛海信网络科技股份有限公司 Text data processing method, server and device
CN116402050A (en) * 2022-12-26 2023-07-07 北京码牛科技股份有限公司 Address normalization and supplement method and device, electronic equipment and storage medium
CN116992294A (en) * 2023-09-26 2023-11-03 成都国恒空间技术工程股份有限公司 Satellite measurement and control training evaluation method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN111209362A (en) * 2020-01-07 2020-05-29 苏州城方信息技术有限公司 Address data analysis method based on deep learning
CN111625732A (en) * 2020-05-25 2020-09-04 鼎富智能科技有限公司 Address matching method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933023A (en) * 2015-05-12 2015-09-23 深圳市华傲数据技术有限公司 Chinese address word segmentation and annotation method
CN111209362A (en) * 2020-01-07 2020-05-29 苏州城方信息技术有限公司 Address data analysis method based on deep learning
CN111625732A (en) * 2020-05-25 2020-09-04 鼎富智能科技有限公司 Address matching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵英: "基于条件随机场的中文地址解析", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361233A (en) * 2021-06-08 2021-09-07 广州城市规划技术开发服务部有限公司 Standard address and building association matching method and device
CN113361233B (en) * 2021-06-08 2024-01-26 广州城市规划技术开发服务部有限公司 Standard address and building association matching method and device
CN113656531A (en) * 2021-08-12 2021-11-16 南方电网数字电网研究院有限公司 Processing method and device for power grid address structuralization
CN113656531B (en) * 2021-08-12 2024-06-14 南方电网数字电网研究院有限公司 Power grid address structuring processing method and device
CN114003812A (en) * 2021-10-29 2022-02-01 深圳壹账通智能科技有限公司 Address matching method, system, device and storage medium
CN116402050A (en) * 2022-12-26 2023-07-07 北京码牛科技股份有限公司 Address normalization and supplement method and device, electronic equipment and storage medium
CN116402050B (en) * 2022-12-26 2023-11-10 北京码牛科技股份有限公司 Address normalization and supplement method and device, electronic equipment and storage medium
CN116258141A (en) * 2023-05-16 2023-06-13 青岛海信网络科技股份有限公司 Text data processing method, server and device
CN116258141B (en) * 2023-05-16 2023-09-26 青岛海信网络科技股份有限公司 Text data processing method, server and device
CN116992294A (en) * 2023-09-26 2023-11-03 成都国恒空间技术工程股份有限公司 Satellite measurement and control training evaluation method, device, equipment and storage medium
CN116992294B (en) * 2023-09-26 2023-12-19 成都国恒空间技术工程股份有限公司 Satellite measurement and control training evaluation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112527933A (en) Chinese address association method based on space position and text training
CN106777275B (en) Entity attribute and property value extracting method based on more granularity semantic chunks
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN112329467B (en) Address recognition method and device, electronic equipment and storage medium
CN110298042A (en) Based on Bilstm-crf and knowledge mapping video display entity recognition method
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
CN104933039A (en) Entity link system for language lacking resources
CN107004000A (en) A kind of language material generating means and method
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN111625732B (en) Address matching method and device
CN104679885A (en) User search string organization name recognition method based on semantic feature model
CN111339318B (en) University computer basic knowledge graph construction method based on deep learning
CN110888989B (en) Intelligent learning platform and construction method thereof
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
CN106897274B (en) Cross-language comment replying method
CN110781681A (en) Translation model-based elementary mathematic application problem automatic solving method and system
CN108733810A (en) A kind of address date matching process and device
CN114780680A (en) Retrieval and completion method and system based on place name and address database
CN114662495A (en) English literature pollutant information extraction method based on deep learning
CN115438674A (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116737922A (en) Tourist online comment fine granularity emotion analysis method and system
CN115630648A (en) Address element analysis method and system for man-machine conversation and computer readable medium
CN114091454A (en) Method for extracting place name information and positioning space in internet text
CN112528642B (en) Automatic implicit chapter relation recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210319