CN111460069A - Address correction method based on weighted directed acyclic graph - Google Patents

Address correction method based on weighted directed acyclic graph Download PDF

Info

Publication number
CN111460069A
CN111460069A CN202010241121.0A CN202010241121A CN111460069A CN 111460069 A CN111460069 A CN 111460069A CN 202010241121 A CN202010241121 A CN 202010241121A CN 111460069 A CN111460069 A CN 111460069A
Authority
CN
China
Prior art keywords
address
acyclic graph
directed acyclic
weighted
normalization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010241121.0A
Other languages
Chinese (zh)
Inventor
姚伟烈
彭艳兵
邱秀连
许云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fenghuo Tiandi Communication Technology Co ltd
Original Assignee
Nanjing Fenghuo Tiandi Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fenghuo Tiandi Communication Technology Co ltd filed Critical Nanjing Fenghuo Tiandi Communication Technology Co ltd
Priority to CN202010241121.0A priority Critical patent/CN111460069A/en
Publication of CN111460069A publication Critical patent/CN111460069A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an address correction method based on a weighted directed acyclic graph, which relates to the field of data management and specifically comprises the following steps: performing tail-free processing on the address data subjected to pre-normalization, and performing TF-IDF feature word extraction; calculating an address similarity threshold based on the extracted feature words and the weights; calculating similarity between addresses in the tailless address base and the pre-normalized addresses: performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph; and generating a directed acyclic graph based on the calculation of the edge weight data, outputting a weighted maximum path, and completing address normalization. The address completion correction algorithm provided by the invention can effectively solve the problems and can solve the address normalization problem on the basis of no standard address level division data.

Description

Address correction method based on weighted directed acyclic graph
Technical Field
The invention belongs to the field of data management, and particularly relates to an address rectification method based on a weighted directed acyclic graph.
Background
In the field of data management, address normalization analysis is often involved, the most common normalized address data is logistics address data, early logistics address information is input by self, and the problem that address information is not standard due to the fact that people miss part of address information or fill part of address information by mistake is solved, so that an address normalization method is provided.
At present, most address normalization methods are fast normalization methods based on a dictionary tree, wherein each layer of the dictionary tree represents an address level, and the final normalized address is obtained by traversing from a root node to a leaf node. The main drawback of this method is the dependence on standard and complete address-level partitioning data for constructing the dictionary tree, and secondly the space consumption of the algorithm is excessive.
Disclosure of Invention
The invention aims to solve the technical problem of providing an address rectification method based on a weighted directed acyclic graph aiming at the defects of the background technology, wherein when no standard and complete address grade divides data, the weighted directed acyclic graph is established through the redundancy relation among mass data, and the weighted maximum path of the directed acyclic graph is the address after normalization.
The invention adopts the following technical scheme for solving the technical problems:
an address rectification method based on a weighted directed acyclic graph specifically comprises the following steps:
step 1, performing tailless processing on address data subjected to pre-normalization, and performing TF-IDF feature word extraction;
step 2, calculating an address similarity threshold value based on the extracted feature words and the weight;
step 3, calculating the similarity between the address in the non-tail address base and the pre-normalized address:
step 4, performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph;
and 5, generating a directed acyclic graph based on the calculated edge weight data, outputting a weighted maximum path, and completing address normalization.
As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 1 specifically includes:
carrying out tailless processing on the address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF characteristic word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FnRepresenting a characteristic word, WnThe TF-IDF value representing the feature word.
As a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
Figure BDA0002432599980000021
as a further preferable solution of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated, which is specifically as follows:
Figure BDA0002432599980000022
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10}。
As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph, in step 4, word segmentation analysis is carried out on 11 addresses S and preprocessed addresses which meet the requirement that the similarity W is more than or equal to theta, and the two addresses are combined in sequence to form an edge set of the directed graph:
Figure BDA0002432599980000023
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences.
As a further preferable scheme of the address rectification method based on the weighted directed acyclic graph of the present invention, in step 5, a directed acyclic graph G is generated based on the calculated edge weight data E, a weighted maximum path P is output, and address normalization is completed:
Figure BDA0002432599980000024
wherein the node information on the P path is the normalized result of the address Add, i.e. w1w2...wp
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address;
2. the address normalization method is a method for optimizing address by segments, words forming the address are sequentially combined and divided into address sections, and the optimal address section relation forms the final normalized address, so that the error rate of address normalization can be reduced, and the better the address normalization effect is improved; when no standard and complete address grade division data exists, the weighted directed acyclic graph is established through the redundancy relation among the mass data, and the weighted maximum path of the directed acyclic graph is the normalized address.
Drawings
FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph of the present invention;
FIG. 2 is a weighted directed acyclic graph constructed based on an address in accordance with the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the defects that address normalization excessively depends on a standard and complete address grade data division, the invention provides a weighted directed acyclic graph established based on the redundancy relation among massive address data, wherein the weighted maximum path of the directed acyclic graph is a normalized labeled standard address.
FIG. 1 is a flow chart of an address completion correction algorithm for a weighted directed acyclic graph, including a precedence relationship between data processing modules; fig. 2 is a weighted directed acyclic graph constructed based on an address, in which part of the private information has been replaced with a #.
An address rectification method based on a weighted directed acyclic graph is shown in fig. 1, and specifically includes the following steps:
step 1, carrying out tailless processing on address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF feature word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FiRepresenting a characteristic word, WiA TF-IDF value representing the feature word;
step 2, calculating an address similarity threshold value theta based on the extracted TF-IDF characteristic words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
Figure BDA0002432599980000031
step 3, calculating the similarity W between the address (Sadd) in the non-tail address base and the pre-normalized address, which is as follows:
Figure BDA0002432599980000032
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10};
Step 4, performing word segmentation analysis on 11 addresses including the first 10S and the preprocessing address which meet the similarity W being more than or equal to theta, and combining every two addresses in sequence to form an edge set of the directed graph:
Figure BDA0002432599980000041
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences;
step 5, generating a directed acyclic graph G based on the calculated edge weight data E, and outputting a weighted maximum path P based on a weighted directed acyclic graph constructed on a certain address, as shown in fig. 2, to complete address normalization:
Figure BDA0002432599980000042
the normalization result w of whether the nodes on the weighted maximum path are sequentially connected or the address Add is1w2...wp
The details of the technical solution of the invention are illustrated with reference to the accompanying drawings:
fig. 1 shows the overall process of the address completion algorithm, for example, the address to be completed is "oasis home yard in large watt kiln" unit cell "; the result after deleting the tail information is: "Luzhou home garden on the way in a big tile kiln"; the result of the word segmentation TF-IDF without the tail address is: 2.987, 'homeland', 2.178, 'oasis', 2.108, 'middle road', 2.064 }; the similarity threshold is: 5.051, respectively; the top 10 tailless addresses that meet the threshold are shown in table 1.
TABLE 1
Serial number Tailless address
1 Middle road of large tile kiln in Toyotai district of Beijing City
2 Middle road of large tile kiln in Toyotai district of Beijing City
3 Large tile kiln middle road of Fengtai district of Haizui district of Beijing
4 Zhonglu Fengze home garden with large tile kiln in Fengtai district of Beijing City
5 Great watt kiln green land home garden between five rings and six rings of Fengtai district in Beijing City
6 Luzhou home garden of Lugou bridge tile kiln Zhongyao of Fengtai district of Beijing
7 Large tile kiln middle road at west road junction of Toyotai district of Beijing City
8 Great tile kiln oasis home garden in Fengtai district of Beijing City
9 Middle road of large tile kiln in east city of Beijing City
10 Middle road of Rugou bridge large tile kiln in Fengtu platform area of Beijing City
The set of directed edge weights obtained by word segmentation is shown in table 2:
TABLE 2
Directed graph edges Directed graph edges Directed graph edges Directed graph edges
<Beginning, Beijing, 10> <Dongchuan area, large brickkiln, 1> <To, six rings, 1> <Feng ze, Home 1>
<Start, large brickkiln, 1> <Fengcai, Lugou bridge, 3> <Six rings, between, 1> <Oasis, home, 4>
<Beijing City, Fengtai, 1> <Rugou bridge, large brickkiln, 3> <Chinese, Large Tile kiln 1> <Zhongbaodao, oasis, 1>
<Beijing City, Tokyo, 1> <Fengcai, West, 1> <Fengtai district, Large Tile kiln, 3> <Home, date, 1>
<Beijing City, Fengtai district, 7> <West, crossing, 1> <Large tile kiln, middle road, 8> <Period, cell, 1>
<Beijing, Haihe district, 1> <Level crossing, large brickkiln, 1> <Large Tile kiln, Medium court, 1> <Cell, 1>
<Fengtai, Fengtai district, 1> <Fengcai region, five rings, 1> <Zhongluo, oasis, 1> <Chamber, end, 1>
<Haihai region, large brickkiln, 1> <Pentacyclic ring to, 1> <Zhonglu, Fengze, 1>
The formed weighted directed acyclic graph is shown in fig. 2 in the attached drawing, and finally the most weighted path node value is output as the result of address normalization, that is: "the Luzhou home garden of Lugou bridge large tile kiln of Fengti district of Beijing unit.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention. While the embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (6)

1. A method for correcting addresses based on a weighted directed acyclic graph is characterized in that: the method specifically comprises the following steps:
step 1, performing tailless processing on address data subjected to pre-normalization, and performing TF-IDF feature word extraction;
step 2, calculating an address similarity threshold value based on the extracted feature words and the weight;
step 3, calculating the similarity between the address in the non-tail address base and the pre-normalized address:
step 4, performing word segmentation analysis on the address of the preprocessed address, and then combining every two in sequence to form an edge set of the directed graph;
and 5, generating a directed acyclic graph based on the calculated edge weight data, outputting a weighted maximum path, and completing address normalization.
2. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 1 is specifically as follows:
carrying out tailless processing on the address data subjected to pre-normalization, deleting unit building number information behind the address to obtain a tailless address (Add), segmenting words of the Add, and carrying out TF-IDF characteristic word extraction { (F)1,W1),(F2,W2),...,(Fn,Wn) In which FnRepresenting a characteristic word, WnThe TF-IDF value representing the feature word.
3. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: the step 2 is specifically as follows: calculating an address similarity threshold θ based on the extracted TF-IDF feature words and a weight FW, wherein FW { (F)1,W1),(F2,W2),...,(Fn,Wn) And, calculating specifically as follows:
Figure FDA0002432599970000011
4. the address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 3, the similarity W between the address (Sadd) in the tailless address base and the pre-normalized address is calculated as follows:
Figure FDA0002432599970000012
extracting the first 10 addresses S satisfying the similarity W ≧ theta, wherein S ═ S1,s2,...,s10}。
5. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 4, performing word segmentation analysis on 11 addresses including the first 10 addresses S and the preprocessing address satisfying the similarity W being more than or equal to theta, and combining every two addresses in sequence to form an edge set of the directed graph:
Figure FDA0002432599970000013
Figure FDA0002432599970000014
......
Figure FDA0002432599970000015
all the above-mentioned edge data are formed into edge weight data E, E { < w { [1,w2,c1>,<w2,w3,c2>,...,<wk-1,wk,c3> -, where w1,w2Respectively representing the start node of an edge, c1Representing the weight of an edge, i.e. w1,w2The number of occurrences.
6. The address rectification method based on the weighted directed acyclic graph according to claim 1, wherein: in step 5, generating a directed acyclic graph G based on the calculation edge weight data E, outputting a weighted maximum path P, and completing address normalization:
Figure FDA0002432599970000021
wherein the node information on the P path is the normalized result of the address Add, i.e. w1w2...wp
CN202010241121.0A 2020-03-31 2020-03-31 Address correction method based on weighted directed acyclic graph Pending CN111460069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010241121.0A CN111460069A (en) 2020-03-31 2020-03-31 Address correction method based on weighted directed acyclic graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010241121.0A CN111460069A (en) 2020-03-31 2020-03-31 Address correction method based on weighted directed acyclic graph

Publications (1)

Publication Number Publication Date
CN111460069A true CN111460069A (en) 2020-07-28

Family

ID=71685100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010241121.0A Pending CN111460069A (en) 2020-03-31 2020-03-31 Address correction method based on weighted directed acyclic graph

Country Status (1)

Country Link
CN (1) CN111460069A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970518A (en) * 2022-02-15 2022-08-30 北京青萌数海科技有限公司 Method and device for correcting address data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN110175216A (en) * 2019-05-15 2019-08-27 腾讯科技(深圳)有限公司 Coordinate error correction method, device and computer equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804398A (en) * 2017-05-03 2018-11-13 阿里巴巴集团控股有限公司 The similarity calculating method and device of address text
CN110175216A (en) * 2019-05-15 2019-08-27 腾讯科技(深圳)有限公司 Coordinate error correction method, device and computer equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970518A (en) * 2022-02-15 2022-08-30 北京青萌数海科技有限公司 Method and device for correcting address data
CN114970518B (en) * 2022-02-15 2022-12-16 北京青萌数海科技有限公司 Method and device for correcting address data

Similar Documents

Publication Publication Date Title
CN110941722A (en) Knowledge graph fusion method based on entity alignment
CN110955780A (en) Entity alignment method for knowledge graph
WO2020119619A1 (en) Network optimization structure employing 3d target classification and scene semantic segmentation
CN108614997B (en) Remote sensing image identification method based on improved AlexNet
CN109636817B (en) Lung nodule segmentation method based on two-dimensional convolutional neural network
CN103377237B (en) The neighbor search method of high dimensional data and fast approximate image searching method
Zhang et al. Development of a supervised software tool for automated determination of optimal segmentation parameters for ecognition
CN104281617A (en) Domain knowledge-based multilayer association rules mining method and system
CN113159451B (en) Long-term prediction method for drainage basin drought and flood events based on event knowledge graph construction
WO2022262320A1 (en) Information completion method for knowledge graph-based power distribution network cim model, and system
CN104317908B (en) Outlier detection method based on three decisions and distance
CN111597331A (en) Judgment document classification method based on Bayesian network
CN111460069A (en) Address correction method based on weighted directed acyclic graph
CN103729467A (en) Community structure discovery method in social network
CN108268460A (en) A kind of method for automatically selecting optimal models based on big data
CN113761221A (en) Knowledge graph entity alignment method based on graph neural network
CN104346442B (en) A kind of Rules extraction method of Process-Oriented object data
CN114491200A (en) Method and device for matching heterogeneous interest points based on graph neural network
CN116089627A (en) Knowledge graph construction method based on big data technology
CN111192154A (en) Social network user node matching method based on style migration
CN113807366B (en) Point cloud key point extraction method based on deep learning
CN110399455A (en) A kind of deep learning data digging method based on CNN and LSTM
CN113505261A (en) Data annotation method and device and data annotation model training method and device
CN103164487A (en) Clustering algorithm based on density and geometrical information
CN113159044A (en) Deep learning-based road material identification method for convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination