CN112364113A - Address error correction method and system - Google Patents

Address error correction method and system Download PDF

Info

Publication number
CN112364113A
CN112364113A CN202011271106.7A CN202011271106A CN112364113A CN 112364113 A CN112364113 A CN 112364113A CN 202011271106 A CN202011271106 A CN 202011271106A CN 112364113 A CN112364113 A CN 112364113A
Authority
CN
China
Prior art keywords
address
error correction
place
place name
place names
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011271106.7A
Other languages
Chinese (zh)
Inventor
陈奇宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202011271106.7A priority Critical patent/CN112364113A/en
Publication of CN112364113A publication Critical patent/CN112364113A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application relates to an address error correction method and system, wherein the method comprises the following steps: a data acquisition step for acquiring an address to be corrected; an administrative division address error correction step, which is used for recognizing wrong place names in the first three-level place names of the administrative division according to an address tree after word segmentation is carried out on the address data according to a place name dictionary, and correcting the wrong place names in the first three-level place names through full text retrieval and similarity comparison; and a detailed address error correction step, which is used for carrying out place name segmentation on the address data subjected to the administrative division address error correction step according to an address standardized segmentation model, then identifying wrong place names in the detailed addresses according to a detailed address index, and carrying out full text retrieval and similarity contrast error correction. By the method and the device, the quality of the address data is improved, and the value of the address data is better exerted.

Description

Address error correction method and system
Technical Field
The present application relates to the field of internet technologies, and in particular, to an address error correction method and system.
Background
At present, a large amount of address data are arranged in a plurality of enterprises and departments, and the value of the address data is self-evident as the most important information of space dimension. However, the address data itself is rich in sources, some addresses are selected by the user through the APP, but many are manually entered, or OCR-recognized addresses in the photograph. Such addresses are prone to place name errors, which makes it difficult to effectively utilize later address data. Therefore, it is necessary to correct an error address using the NLP technique.
Because the Chinese place names are very rich, particularly the village names, the district names, the road names and the like are involved, the phenomenon that the duplicate names and the names are similar is very common, and the error correction difficulty for the place names is higher. Some error correction techniques for text are described below in the prior art.
The N-Gram language model carries out word segmentation on the text, adjacent words are counted, a two-two adjacent statistical model is a bi-Gram, and a statistical model of three adjacent words is a tri-Gram. By counting word frequency, the correct text adjacent word relation is obtained, so that the conditional probability of word adjacency can be conveniently calculated, and whether adjacent words are reasonable or not is judged when a new text exists. If the scores of the adjacent words are low, the adjacent words are likely to be wrong texts, the most likely adjacent word candidates are obtained according to the proximity relation, and the most likely correct texts are obtained by editing the adjacent words with the largest distance calculation probability.
The method is commonly used for text error correction, has a good effect in common texts, and can accurately position wrong texts. However, for address data, because the names of places are rich, there are many similar phenomena of names of villages, cell names, road names, etc., and the places of the names are very regional, the same place name is legal in one city, and may be wrong in another city, for example: the longyun is legal in Beijing and wrong in Shanghai, and should be Longyun. Under the condition, the ordinary N-Gram language model is difficult to capture regional information of the corresponding city, and the place name cannot be corrected.
In addition, the address data usually has many omissions (such as the omission of towns and towns, and the direct writing of road cells), and has many place names for short. Therefore, the N-Gram model trained by the common standard address is difficult to play a role.
In addition, the wrong place names are difficult to correctly divide words, the influence of wrong word division results on the N-Gram model is large, and when the place names are corrected by using the N-Gram model on the basis of word division, misjudgment is easy to be carried out, or a lot of place names needing to be corrected are omitted.
Disclosure of Invention
The embodiment of the application provides an address error correction method, an address error correction system, computer equipment and a computer readable storage medium based on an address tree, error correction is carried out on the first three-level address in address data and the error place name in a detailed address, so that the quality of the address data is improved, and the value of the address data is better exerted.
In a first aspect, an embodiment of the present application provides an address error correction method, including:
a data acquisition step for acquiring an address to be corrected;
an administrative division address error correction step, which is used for recognizing wrong place names in three-layer place names in the front of an administrative division according to an address tree after word segmentation is carried out on the address data according to a place name dictionary, and correcting the wrong place names in the three-layer place names through full text retrieval and similarity comparison, wherein the three-layer place names mainly refer to the place names of addresses of provinces, direct prefectures, district and county areas above the towns;
and a detailed address error correction step, which is used for carrying out place name segmentation on the address data subjected to the administrative division address error correction step according to an address standardized segmentation model, then identifying wrong place names in the detailed addresses according to a detailed address index, and carrying out full text retrieval and similarity contrast error correction.
Through the steps, on the basis of address level layering, error correction is respectively carried out on administrative division information of province, city and county and detailed address information, error correction is carried out on the place names through word segmentation and segmentation of address data, and relation error correction between address levels is also achieved through an address tree.
In some embodiments, the method further includes an address database establishing step, configured to collect address data in advance and establish an address database, where the address database at least includes: one or any combination of the place name dictionary, the address tree, an address index and the detailed address index.
In some embodiments, the step of correcting the administrative zone address error further includes:
performing forward maximum matching word segmentation on the address data according to the place name dictionary to obtain a word segmentation list, wherein the word segmentation list comprises place name entries after word segmentation;
a word segmentation position recognition step, which is used for matching the word segmentation list based on the place name dictionary to obtain the first three-level place names in the word segmentation list;
identifying the wrong place name, namely verifying the place name of the previous three levels based on the address tree, and identifying to obtain the wrong place name in the place name of the previous three levels;
and a third-level place name error correction step, which is used for performing error correction on the place names with errors in the third-level place names based on the address index through full text retrieval and similarity comparison to obtain the place names with the highest similarity with the place names with errors in the third-level place names in the address index as correct place names.
Through the steps, the place names in the first three levels are identified and corrected, and the corrected place names simultaneously accord with correct place names and the hierarchical relation of the place names, so that the correction accuracy of the embodiment of the application is improved.
In some embodiments, the detailed address error correction step further comprises:
a detailed address segmentation step, which is used for performing place name segmentation on the address data subjected to the first three-level place name error correction step based on an address standardized segmentation model to obtain a segmentation result;
a detailed address checking step, configured to perform place name checking on the segmentation result based on the detailed address index, so as to obtain a wrong place name in the detailed address;
and a detailed address place name error correction step, which is used for carrying out full text retrieval and similarity comparison on the wrong place names in the detailed addresses based on the detailed address indexes, and obtaining place names with the highest similarity with the wrong place names in the detailed addresses in the detailed address indexes as correct place names to carry out error correction.
In some embodiments, the address database establishing step further comprises:
a place name dictionary obtaining step, which is used for obtaining the place name of each level of the administrative division and establishing a place name dictionary;
an address tree obtaining step, configured to perform place name expansion on the place name according to a hierarchy suffix of the place name, and establish a dependency relationship between hierarchy place names to obtain the address tree;
an address index construction step, configured to simplify suffixes of place names of the first three levels in the administrative division to obtain simplified place names, and establish a full-text index between the simplified place names and the place names of the first three levels to obtain the address index;
and a detailed address index construction step, which is used for establishing a full-text index for the detailed address place name in the administrative division to obtain the detailed address index.
In some of these embodiments, the similarity alignment is measured by calculating an edit distance and/or by a jaccard similarity metric.
In a second aspect, an embodiment of the present application provides an address error correction system, including:
the data acquisition module is used for acquiring an address to be corrected;
the administrative district address error correction module is used for identifying wrong place names in three levels of place names in the front of the administrative district according to an address tree after word segmentation is carried out on the address data according to a place name dictionary, and correcting the wrong place names in the three levels of place names through full text retrieval and similarity comparison, wherein the three levels of place names mainly refer to the place names of addresses of provinces, direct prefectures, local cities, districts and counties above the villages and towns;
and the detailed address error correction module is used for carrying out place name segmentation on the address data obtained by the administrative division address error correction module according to an address standardized segmentation model, identifying wrong place names in the detailed addresses according to a detailed address index, and carrying out full-text retrieval and similarity contrast error correction.
Through the modules, on the basis of address level layering, error correction is respectively performed on administrative division information of provinces, cities and counties and detailed address information, error correction is performed on the place names through word segmentation and segmentation of address data, and relation error correction between address levels is also performed through an address tree.
In some embodiments, the system further includes an address database establishing module, configured to collect address data in advance and establish an address database, where the address database at least includes: one or any combination of the place name dictionary, the address tree, an address index and the detailed address index.
In some embodiments, the administrative region address error correction module further comprises:
the administrative division address word segmentation module is used for performing forward maximum matching word segmentation on the address data according to the place name dictionary to obtain a word segmentation list, and specifically, the word segmentation list comprises place name entries after word segmentation;
the word segmentation position recognition module is used for matching the word segmentation list based on the place name dictionary to obtain the place names of the first three levels in the word segmentation list;
the wrong place name identification module is used for verifying the place names in the first three levels based on the address tree and identifying the wrong place names in the first three levels;
and the first three-level place name error correction module is used for performing error correction on the place names with errors in the first three-level place names based on the address index by full-text retrieval and similarity comparison to obtain the place names with the highest similarity with the place names with errors in the first three-level place names in the address index as correct place names.
Through the modules, the place names in the first three levels are identified and corrected, and the corrected place names simultaneously accord with correct place names and the hierarchical relation of the place names, so that the correction accuracy of the embodiment of the application is improved.
In some embodiments, the detailed address error correction module further comprises:
the detailed address segmentation module is used for performing place name segmentation on the address data passing through the first three-level place name error correction module based on an address standardized segmentation model to obtain a segmentation result;
the detailed address checking module is used for checking the place name of the segmentation result based on the detailed address index to obtain a wrong place name in the detailed address;
and the detailed address and place name error correction module is used for carrying out full-text retrieval and similarity comparison on the wrong place names in the detailed addresses based on the detailed address indexes to obtain the place names with the highest similarity with the wrong place names in the detailed addresses in the detailed address indexes, and using the place names as correct place names to carry out error correction.
In some embodiments, the address database establishment module further comprises:
the system comprises a place name dictionary acquisition module, a place name dictionary acquisition module and a place name dictionary generation module, wherein the place name dictionary acquisition module is used for acquiring the place name of each level of an administrative division and establishing the place name dictionary;
the address tree acquisition module is used for carrying out place name expansion on the place names according to the hierarchy suffixes of the place names and establishing the dependency relationship among the place names of each hierarchy to obtain the address tree;
the address index building module is used for simplifying the suffixes of the place names of the first three levels in the administrative division to obtain simplified place names, and building a full-text index by using the simplified place names and the place names of the first three levels to obtain the address index;
and the detailed address index building module is used for building a full-text index for the detailed address place name in the administrative division to obtain the detailed address index.
In some of these embodiments, the similarity alignment is measured by calculating an edit distance and/or by a jaccard similarity metric.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the address error correction method according to the first aspect when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the address error correction method according to the first aspect.
Compared with the related art, the address error correction method, the address error correction system, the computer equipment and the computer readable storage medium based on the address tree solve the problem that the place name cannot be corrected according to the similar name place or the regional information in the prior art, and improve the correct rate of the place name error correction; on the other hand, the utilization efficiency of the address data after error correction is greatly improved, and the cost for processing the address data in the application process is further saved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flowchart illustrating an address error correction method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating a sub-step of an address error correction method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating another substep of an address error correction method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating another substep of an address error correction method according to an embodiment of the present application;
FIG. 5 is a block diagram illustrating the structure of an address error correction system according to an embodiment of the present application;
FIG. 6 is a schematic diagram of an address tree in accordance with a preferred embodiment of the present application;
FIG. 7 is a flowchart illustrating an address error correction method according to a preferred embodiment of the present application.
Description of the drawings:
10. an address database establishing module; 11. a data acquisition module;
12. an administrative division address error correction module; 13. a detailed address error correction module;
101. a place name dictionary obtaining module; 102. an address tree acquisition module;
103. an address index construction module; 104. a detailed address index building module;
121. an administrative division address word segmentation module; 122. a word segmentation position identification module;
123. a wrong place name identification module; 124. the first three levels of place name error correction modules;
131. a detailed address segmentation module; 132. a detailed address verification module;
133. and the detailed address and place name error correction module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.
It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.
Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.
The address data is generally composed of provincial, municipal, county, and administrative division information and detailed address information. The place name stability of the administrative division information of the provincial region is high, the distinguishability is strong, the expression mode is relatively fixed, the expression of the detailed address information is very flexible, and the omission of the place name level, the alias of the place name, the abbreviation, the similar place name and the like exist generally. Therefore, based on the above consideration, the address error correction method and system of the embodiment of the present application mainly perform address error correction from two aspects: and correcting addresses and detailed addresses of administrative divisions of provinces, cities and counties.
The present embodiment provides an address error correction method, and fig. 1 to 4 are flowcharts of an address error correction method according to an embodiment of the present application, and referring to fig. 1 to 4, the flowcharts include the following steps:
an address database establishing step S10, configured to collect address data in advance and establish an address database, where the address database at least includes: one or any combination of a place name dictionary, an address tree, an address index, a detailed address index and a shape-near word dictionary;
a data obtaining step S11, configured to obtain an address to be error-corrected;
an administrative division address error correction step S12, configured to, after word segmentation is performed on address data according to a place name dictionary, identify an incorrect place name in three previous levels of place names of an administrative division according to an address tree, and correct an incorrect place name in three previous levels of place names through full text retrieval and similarity comparison, where the three previous levels of place names mainly refer to place names of provinces, prefectural cities, district and county areas above the towns;
and a detailed address error correction step S13, for performing place name segmentation on the address data subjected to the administrative division address error correction step according to an address standardized segmentation model, then identifying the wrong place name in the detailed address according to a detailed address index, and performing full text retrieval and similarity contrast error correction.
Through the steps, on the basis of address level layering, error correction is respectively performed on administrative division information of province, city and county and detailed address information, error correction is performed on the place names through word segmentation and segmentation of address data, and relation error correction between address levels is also performed through an address tree.
Fig. 2 is a schematic flowchart of a substep of step S10 of the address error correction method according to an embodiment of the application, referring to fig. 2, in some embodiments, the address database establishing step S10 further includes:
a place name dictionary obtaining step S101, which is used for obtaining the place name of each level of the administrative division and establishing a place name dictionary;
an address tree obtaining step S102, configured to perform place name expansion on the place name according to the hierarchy suffix of the place name and establish a dependency relationship between the place names of each hierarchy to obtain an address tree;
an address index construction step S103, configured to simplify suffixes of place names of the first three levels in the administrative division to obtain simplified place names, and establish full-text indexes between the simplified place names and the place names of the first three levels to obtain address indexes;
and a detailed address index construction step S104, configured to construct a full-text index for the detailed address place name in the administrative division, so as to obtain a detailed address index.
Fig. 3 is a schematic flowchart of a substep of step S12 of the address error correction method according to an embodiment of the present application, and referring to fig. 3, in some embodiments, the step S12 of correcting the address of the administrative division further includes:
an administrative division address word segmentation step S121, configured to perform forward maximum matching word segmentation on address data according to a place name dictionary to obtain a word segmentation list, where the word segmentation list includes a place name entry after word segmentation;
a step S122 of recognizing the position of the participle, which is used for matching the participle list based on the place name dictionary to obtain the place names of the first three levels in the participle list;
an error place name identification step S123, configured to verify the place names in the previous three levels based on the address tree, and identify an error place name in the place names in the previous three levels;
and a step S124 of correcting the first three-level place names, configured to perform error correction on the place names with errors in the first three-level place names through full-text retrieval and similarity comparison on the basis of the address indexes, to obtain, as correct place names, the place names with the highest similarity to the place names with errors in the first three-level place names in the address indexes, and optionally, the similarity comparison is performed through calculating an editing distance and/or through a jaccard similarity measurement.
Through the steps, the place names in the first three levels are identified and corrected, and the corrected place names simultaneously accord with correct place names and the hierarchical relation of the place names, so that the correction accuracy of the embodiment of the application is improved.
FIG. 4 is a schematic flowchart of a sub-step of step S13 of the address error correction method according to the embodiment of the present application, and referring to FIG. 4, in some embodiments, the detailed address error correction step S13 further includes:
a detailed address splitting step S131, configured to perform place name splitting on the address data subjected to the first three-level place name error correction step based on an address standardized splitting model, so as to obtain a splitting result;
a detailed address verification step S132, configured to perform place name verification on the segmentation result based on the detailed address index, so as to obtain a wrong place name in the detailed address;
a detailed address and place name error correction step S133, configured to perform full-text retrieval and similarity comparison on the wrong place name in the detailed address based on the detailed address index, and obtain a place name with the highest similarity to the wrong place name in the detailed address index as a correct place name for error correction, where optionally, the similarity comparison is performed by calculating an edit distance and/or by an jaccard similarity measure.
The embodiments of the present application are described and illustrated below by means of preferred embodiments.
Data preparation is performed through step S10, address data is collected in advance, and an address database is established, which is specifically exemplified as follows:
obtaining a place name dictionary through the step S101, wherein the place name dictionary specifically comprises a dictionary of place names such as province, city, district, town, road, place name suffix and the like;
in the address tree obtaining step S102, for each administrative area address in the whole country, the place name is extended according to the suffix from the provincial level to the village/cell level, and the address dependency relationship of each level is established. The processing principle of each level of address is as follows:
1. the first level address comprises province and direct prefecture cities. The processing process mainly removes the province and the city at the end, for example, the expansion of Beijing City is { "Beijing City", "Beijing" }); processing special addresses, such as expanding an inner Mongolia autonomous region into { 'inner Mongolia autonomous region', 'inner Mongolia' };
2. second level addresses, including grade cities, regions, etc. The treatment process mainly comprises the steps of removing the 'cities and regions' at the end, for example, expanding the 'fertilizer-combining city' into { 'fertilizer-combining city', 'fertilizer-combining' }; the method comprises the following steps of processing special addresses, such as 'summer mountain Yi-nationality autonomous state', expanding into { 'summer mountain Yi-nationality autonomous state', 'summer mountain' };
3. and third-level addresses comprising counties and the like. The processing procedure is mainly to remove the ending 'district, county', such as: the 'sea area' is expanded into { 'sea area', 'sea lake' }; processing special addresses, such as a 'Pudong new area', and expanding the special addresses into { 'Pudong new area', 'Pudong' };
4. and the fourth level address comprises villages, towns, streets and the like. The treatment process mainly comprises the step of removing the ending villages and towns, such as: the octave town is expanded into { "octave town" "octave" }; expansion of street offices, such as: the drum building street office is expanded into { "drum building street office", "drum building street" }; the processing of special addresses, such as: the east vibration community working committee is expanded into { "east vibration community working committee", "east vibration community" });
5. and the fifth level address comprises villages, cells and the like. The processing process mainly comprises extracting names of villages, such as expansion of the eastern wind New village Committee to { "the eastern wind New village Committee", "the eastern wind New village" }, and optionally extracting names of cells according to the word number, such as expansion of the Jinao mountain Committee to { "the Jinao mountain Committee of Committee", the Jinao mountain community "," the Jinao mountain community ", and" the Jinao mountain "}.
Fig. 6 is a schematic diagram of an address tree according to a preferred embodiment of the present application, and fig. 6 is a diagram illustrating an address tree obtained by converting each level of an administrative division in step S102 described above, by way of example and not limitation, where the diagram illustrates an example from hexan province to a street or town level, and is merely used to illustrate an address tree structure of the present embodiment.
And establishing a full-text index by the address index establishing step S103 according to the first three-level address and the suffix removal place name to obtain a first three-level address index of the whole country. By way of example and not limitation, the address index includes, for example, "loyang city lolonge, south China", "loyang city lolonge", "south China loyang lolonge", "south China lolonge", "loyang city lolonge", "loyang lolonge" … … to facilitate full-text search queries on top three levels of addresses.
The detailed address index is constructed in the detailed address index construction step S104, specifically, a full-text index is constructed for the place names of villages, towns, roads, villages, districts and the like under each province, city, county and county (three-level address), the former three-level address is a folder name, and the detailed address index corresponding to the former three-level address is stored in the corresponding folder.
Step S10 of this embodiment also obtains a shape-near word dictionary, which contains a set of similar words, such as: "Yang, Tang, Fang, Chang"; the following steps are repeated: remote, Shao, Yao ". The method is mainly used for comparing the identification similarity of the place names.
After the address database is established through the above steps, the address data to be corrected is acquired through step S11, and the administrative division address correction of the province, city and county district is performed on the address data to be corrected through step S12.
Fig. 7 is a schematic flowchart of another substep of step S12 of this embodiment, and referring to fig. 7, step S12 specifically includes the following steps:
step S121, the address is segmented by using the forward maximum matching of the local dictionary. The location name dictionary includes province, city, district, abbreviation and location name suffix, such as Zheng, Beijing, province, city, and municipality. The words are used for segmenting the address, and a segmentation list segmented into place name entries can be obtained. For example, the matching address "huanan lolo central lolong district zhuguezhen people road 32" gets the word segmentation list { "huanan", "lolo", "central", "lolong district", "zhuguezhen", "people road", "3", "2", "number" }.
Step S122, recognizing the segmentation end positions of the place names of the first three levels including the district and county of the province and the city, and specifically determining the segmentation end positions of the addresses of the first three levels through matching of the district name dictionary and the suffix of the district and county of the province and the city. Specifically, the word segmentation list obtained in step S121 is traversed from the 1 st word, and if the word segmentation list matches the top three-level place name dictionary, the next entry is determined. And if the word segmentation is judged to be the third-level address or the third-level address suffix, namely the county-level address, ending the judgment, and returning to the position of the current word. If the word is judged to be the detailed address, the detailed address represents villages and towns, streets, roads, villages and the like, the position of the last word is returned. And if the word segmentation position of the word segmentation list obtained in the step S121 is identified, judging that the third-level address is matched when the "Luilong district" is matched, and ending the judgment.
And step S123, verifying whether errors exist in province, city and county or not through address tree level matching, and verifying whether the addresses of the first three levels are completely matched or not. Specifically, the method comprises the following steps: if the first-level address containing the province/direct prefecture city is matched, judging whether the next participle can be matched with the second-level address under the first-level address, if the next participle is matched with the prefecture/direct prefecture city, and if the next participle is matched with the third-level address containing the district-county-level address under the prefecture/direct prefecture city. In this process, if there is a mismatch, the address of the first three levels is considered to have an error. For example, if "Henan" is matched, the first level address in the address tree is matched to "Henan", and an address "Lo" is found in the subordinate address of "Henan", and it is not found, and it is determined that the address has an error.
And step S124, correcting the error of the previous three-level address judged to have the error in the step S123, and acquiring the previous three-level address with the highest similarity through full text retrieval and editing distance. The method comprises the following specific steps: and splicing the addresses of the first three levels to obtain the address character strings of the first three levels. Such as "helonanlong," and then query from the national top three levels of address indices to get a list of top 10 candidate similar place names. For example, in the above example, the full text search can obtain "lolongdistrict of south river luyang", "lolongdistrict of south river luyang city", "lolongdistrict of south river luyang lolonge", and the like, and the edit distances are calculated from "lolongdistrict of south river luyang city" one by one, thereby obtaining candidate place names with edit distance of 1. If the character of the candidate place name different from the character of the 'Henan Luo Luolong district' is a character with a shape similar to the character or a homophone, the candidate place name is used as the corrected place name.
After the completion of step S12, error correction is continued for the detailed address to be error-corrected. The detailed address refers to other specific addresses except for the district and county address of the province and city, namely, the village and the town, the road, the village group, the cell and the like. Error correction of the detailed address depends on the first three levels of addresses. The detailed address error correction step S13 specifically includes the following steps,
step S131, obtaining a standard place name segmentation result by using a predetermined address standardized segmentation model, for example: "Henan province, Luoyang city, Luolong district, Li villages and towns, Chanxiu Lu, Meijing Yuan".
Step S132, obtaining a detailed address index corresponding to the current province and city, and then respectively checking the place names of the detailed addresses divided by address standardization;
step S133, full-text retrieval is carried out on the error place names which do not exist in the verification based on the detailed address index, and 10 place names with higher similarity are searched; and then, calculating the editing distance between the similar place names and the wrong place names one by one, and if the editing distance is 1 and different characters are similar characters or homophones, replacing the wrong place names with correct place names to finish error correction of detailed addresses.
It should be noted that, in the above steps S124 and S133, the jaccard similarity may be used instead of the edit distance in the similarity comparison, and the homonyms are not compared any more, but the most similar result is obtained as the result of error correction.
Through the steps, the method is convenient for the user to verify the place name, improves the accuracy of place name error correction, repairs the error address and improves the utilization efficiency of address data for the place name with OCR recognition errors and user input errors. In this way, the labor cost of processing the address data can be greatly saved.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.
The embodiment also provides an address error correction system. Fig. 5 is a block diagram of an address error correction system according to an embodiment of the present application. As shown in fig. 5, the address error correction system includes: the system comprises an address database establishing module 10, a data acquisition module 11, an administrative division address error correction module 12, a detailed address error correction module 13 and the like. Those skilled in the art will appreciate that the address error correction system architecture shown in FIG. 5 does not constitute a limitation of the address error correction system and may include more or fewer modules than shown, or some modules in combination, or a different arrangement of modules.
The following describes each constituent module of the address error correction system in detail with reference to fig. 1:
an address database establishing module 10, configured to collect address data in advance and establish an address database, where the address database at least includes: one or any combination of a place name dictionary, an address tree, an address index, a detailed address index and a shape-near word dictionary;
the data acquisition module 11 is used for acquiring an address to be corrected;
the administrative district address error correction module 12 is configured to, after word segmentation is performed on address data according to a place name dictionary, identify an incorrect place name in three levels of place names before the administrative district according to an address tree, and correct an incorrect place name in the three levels of place names by full text retrieval and similarity comparison, where the three levels of place names mainly refer to place names of province, direct prefecture city, district, and county address above the villages and towns;
and the detailed address error correction module 13 is configured to perform place name segmentation on the address data obtained by the administrative division address error correction module 12 according to an address standardized segmentation model, identify an incorrect place name in the detailed address according to a detailed address index, and perform full-text retrieval and similarity contrast error correction.
Through the modules, on the basis of address level layering, error correction is respectively performed on administrative division information of provinces, cities and counties and detailed address information, error correction is performed on the place names through word segmentation and segmentation of address data, and relation error correction between address levels is also performed through an address tree.
Wherein, the address database establishing module 10 further comprises: a place name dictionary obtaining module 101, configured to obtain a place name of each level of an administrative division and establish a place name dictionary; the address tree obtaining module 102 is configured to perform place name expansion on place names according to hierarchy suffixes of the place names and establish a dependency relationship between the place names of each hierarchy to obtain an address tree; the address index building module 103 is configured to simplify suffixes of place names of the first three levels in the administrative division to obtain simplified place names, and build a full-text index between the simplified place names and the place names of the first three levels to obtain an address index; and the detailed address index building module 104 is configured to build a full-text index for the detailed address place name in the administrative division to obtain a detailed address index.
Wherein, the administrative division address error correction module 12 further includes: the administrative division address word segmentation module 121 is configured to perform forward maximum matching word segmentation on the address data according to the place name dictionary to obtain a word segmentation list, where the word segmentation list specifically includes a place name entry after word segmentation; the segmentation position recognition module 122 is configured to match the segmentation list based on a place name dictionary to obtain three previous-level place names in the segmentation list; the wrong place name identification module 123 is configured to verify the place names in the first three levels based on the address tree, and identify to obtain a wrong place name in the place names in the first three levels; the first three-level place name error correction module 124 is configured to perform error correction on a place name with the highest similarity to an incorrect place name in the first three-level place name through full-text retrieval and similarity comparison on the incorrect place name in the first three-level place name based on the address index, where the correct place name is the place name with the highest similarity to the incorrect place name in the first three-level place name in the address index, and specifically, the similarity comparison is performed through calculating an editing distance and/or through a jaccard similarity measurement. Through the modules, the place names in the first three levels are identified and corrected, and the corrected place names simultaneously accord with correct place names and the hierarchical relation of the place names, so that the correction accuracy of the embodiment of the application is improved.
Wherein, the detailed address error correction module 13 further includes: the detailed address segmentation module 131 is configured to perform place name segmentation on the address data after passing through the first three-level place name error correction module based on an address standardized segmentation model to obtain a segmentation result; the detailed address checking module 132 is configured to perform place name checking on the segmentation result based on the detailed address index to obtain a wrong place name in the detailed address; the detailed address and place name error correction module 133 is configured to perform full-text retrieval and similarity comparison on the wrong place name in the detailed address based on the detailed address index, and obtain a place name with the highest similarity to the wrong place name in the detailed address index as a correct place name for error correction, specifically, the similarity comparison is performed by calculating an edit distance and/or by an jaccard similarity measure.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
In addition, the address error correction method of the embodiment described in conjunction with fig. 1 to 4 may be implemented by a computer device. The computer device may include a processor and a memory storing computer program instructions. The memory may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by the processor. The processor reads and executes the computer program stored in the memory.
In addition, in combination with the address error correction method in the foregoing embodiment, the embodiment of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the address error correction methods in the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An address error correction method, comprising:
a data acquisition step for acquiring an address to be corrected;
an administrative division address error correction step, which is used for recognizing wrong place names in the first three-level place names of the administrative division according to an address tree after word segmentation is carried out on the address data according to a place name dictionary, and correcting the wrong place names in the first three-level place names through full text retrieval and similarity comparison;
and a detailed address error correction step, which is used for carrying out place name segmentation on the address data subjected to the administrative division address error correction step according to an address standardized segmentation model, then identifying wrong place names in the detailed addresses according to a detailed address index, and carrying out full text retrieval and similarity contrast error correction.
2. The method for correcting the place name error according to claim 1, further comprising an address database establishing step for collecting address data in advance and establishing an address database, wherein the address database at least comprises: one or any combination of the place name dictionary, the address tree, an address index and the detailed address index.
3. The address error correction method of claim 2, wherein the administrative region address error correction step further comprises:
performing forward maximum matching word segmentation on the address data according to the place name dictionary to obtain a word segmentation list;
a word segmentation position recognition step, which is used for matching the word segmentation list based on the place name dictionary to obtain the first three-level place names in the word segmentation list;
identifying the wrong place name, namely verifying the place name of the previous three levels based on the address tree, and identifying to obtain the wrong place name in the place name of the previous three levels;
and a third-level place name error correction step, which is used for performing error correction on the place names with errors in the third-level place names based on the address index through full text retrieval and similarity comparison to obtain the place names with the highest similarity with the place names with errors in the third-level place names in the address index as correct place names.
4. The address error correction method of claim 3, wherein the detailed address error correction step further comprises:
a detailed address segmentation step, which is used for performing place name segmentation on the address data subjected to the first three-level place name error correction step based on an address standardized segmentation model to obtain a segmentation result;
a detailed address checking step, configured to perform place name checking on the segmentation result based on the detailed address index, so as to obtain a wrong place name in the detailed address;
and a detailed address place name error correction step, which is used for carrying out full text retrieval and similarity comparison on the wrong place names in the detailed addresses based on the detailed address indexes, and obtaining place names with the highest similarity with the wrong place names in the detailed addresses in the detailed address indexes as correct place names to carry out error correction.
5. The place name error correction method according to claim 1, wherein the address database creating step further comprises:
a place name dictionary obtaining step, which is used for obtaining the place name of each level of the administrative division and establishing a place name dictionary;
an address tree obtaining step, configured to perform place name expansion on the place name according to a hierarchy suffix of the place name, and establish a dependency relationship between hierarchy place names to obtain the address tree;
an address index construction step, configured to simplify suffixes of place names of the first three levels in the administrative division to obtain simplified place names, and establish a full-text index between the simplified place names and the place names of the first three levels to obtain the address index;
and a detailed address index construction step, which is used for establishing a full-text index for the detailed address place name in the administrative division to obtain the detailed address index.
6. An address error correction system, comprising:
the data acquisition module is used for acquiring an address to be corrected;
the administrative division address error correction module is used for identifying wrong place names in the first three-level place names of the administrative division according to an address tree after word segmentation is carried out on the address data according to a place name dictionary, and correcting the wrong place names in the first three-level place names through full-text retrieval and similarity comparison;
and the detailed address error correction module is used for carrying out place name segmentation on the address data obtained by the administrative division address error correction module according to an address standardized segmentation model, identifying wrong place names in the detailed addresses according to a detailed address index, and carrying out full-text retrieval and similarity contrast error correction.
7. The address correction system according to claim 6, further comprising an address database creation module for collecting address data in advance and creating an address database, wherein the address database at least comprises: one or any combination of the place name dictionary, the address tree, an address index and the detailed address index.
8. The address error correction system of claim 7, wherein the administrative zone address error correction module further comprises:
the administrative division address word segmentation module is used for performing forward maximum matching word segmentation on the address data according to the place name dictionary to obtain a word segmentation list;
the word segmentation position recognition module is used for matching the word segmentation list based on the place name dictionary to obtain the place names of the first three levels in the word segmentation list;
the wrong place name identification module is used for verifying the place names in the first three levels based on the address tree and identifying the wrong place names in the first three levels;
and the first three-level place name error correction module is used for performing error correction on the place names with errors in the first three-level place names based on the address index by full-text retrieval and similarity comparison to obtain the place names with the highest similarity with the place names with errors in the first three-level place names in the address index as correct place names.
9. The address error correction system of claim 8, wherein the detailed address error correction module further comprises:
the detailed address segmentation module is used for performing place name segmentation on the address data passing through the first three-level place name error correction module based on an address standardized segmentation model to obtain a segmentation result;
the detailed address checking module is used for checking the place name of the segmentation result based on the detailed address index to obtain a wrong place name in the detailed address;
and the detailed address and place name error correction module is used for carrying out full-text retrieval and similarity comparison on the wrong place names in the detailed addresses based on the detailed address indexes to obtain the place names with the highest similarity with the wrong place names in the detailed addresses in the detailed address indexes, and using the place names as correct place names to carry out error correction.
10. The address error correction system of claim 6, wherein the address database building module further comprises:
the system comprises a place name dictionary acquisition module, a place name dictionary acquisition module and a place name dictionary generation module, wherein the place name dictionary acquisition module is used for acquiring the place name of each level of an administrative division and establishing the place name dictionary;
the address tree acquisition module is used for carrying out place name expansion on the place names according to the hierarchy suffixes of the place names and establishing the dependency relationship among the place names of each hierarchy to obtain the address tree;
the address index building module is used for simplifying the suffixes of the place names of the first three levels in the administrative division to obtain simplified place names, and building a full-text index by using the simplified place names and the place names of the first three levels to obtain the address index;
and the detailed address index building module is used for building a full-text index for the detailed address place name in the administrative division to obtain the detailed address index.
CN202011271106.7A 2020-11-13 2020-11-13 Address error correction method and system Pending CN112364113A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011271106.7A CN112364113A (en) 2020-11-13 2020-11-13 Address error correction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011271106.7A CN112364113A (en) 2020-11-13 2020-11-13 Address error correction method and system

Publications (1)

Publication Number Publication Date
CN112364113A true CN112364113A (en) 2021-02-12

Family

ID=74515568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011271106.7A Pending CN112364113A (en) 2020-11-13 2020-11-13 Address error correction method and system

Country Status (1)

Country Link
CN (1) CN112364113A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112353A (en) * 2021-06-15 2021-07-13 红盾大数据(北京)有限公司 Address information perfecting method and device, electronic equipment and readable storage medium
CN113204606A (en) * 2021-04-30 2021-08-03 武汉大学 Address position presumption method based on semantic position network
CN113221558A (en) * 2021-05-28 2021-08-06 中邮信息科技(北京)有限公司 Express delivery address error correction method and device, storage medium and electronic equipment
CN113434708A (en) * 2021-05-25 2021-09-24 北京百度网讯科技有限公司 Address information detection method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000011096A (en) * 1998-06-23 2000-01-14 Canon Inc Character recognizing processor, its method and storage medium
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN108369582A (en) * 2018-03-02 2018-08-03 福建联迪商用设备有限公司 A kind of address error correction method and terminal
CN110704564A (en) * 2019-09-27 2020-01-17 北京沃东天骏信息技术有限公司 Address error correction method and device
CN111291277A (en) * 2020-01-14 2020-06-16 浙江邦盛科技有限公司 Address standardization method based on semantic recognition and high-level language search

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000011096A (en) * 1998-06-23 2000-01-14 Canon Inc Character recognizing processor, its method and storage medium
CN101719128A (en) * 2009-12-31 2010-06-02 浙江工业大学 Fuzzy matching-based Chinese geo-code determination method
CN108369582A (en) * 2018-03-02 2018-08-03 福建联迪商用设备有限公司 A kind of address error correction method and terminal
CN110704564A (en) * 2019-09-27 2020-01-17 北京沃东天骏信息技术有限公司 Address error correction method and device
CN111291277A (en) * 2020-01-14 2020-06-16 浙江邦盛科技有限公司 Address standardization method based on semantic recognition and high-level language search

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204606A (en) * 2021-04-30 2021-08-03 武汉大学 Address position presumption method based on semantic position network
CN113434708A (en) * 2021-05-25 2021-09-24 北京百度网讯科技有限公司 Address information detection method and device, electronic equipment and storage medium
CN113221558A (en) * 2021-05-28 2021-08-06 中邮信息科技(北京)有限公司 Express delivery address error correction method and device, storage medium and electronic equipment
CN113221558B (en) * 2021-05-28 2023-09-19 中邮信息科技(北京)有限公司 Express address error correction method and device, storage medium and electronic equipment
CN113112353A (en) * 2021-06-15 2021-07-13 红盾大数据(北京)有限公司 Address information perfecting method and device, electronic equipment and readable storage medium
CN113112353B (en) * 2021-06-15 2021-11-23 红盾大数据(北京)有限公司 Address information perfecting method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN112364113A (en) Address error correction method and system
CN103440312B (en) A kind of system and terminal of mailing address inquiry postcode
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN107145577A (en) Address standardization method, device, storage medium and computer
WO2016050088A1 (en) Address search method and device
CN108369582B (en) Address error correction method and terminal
WO2015027836A1 (en) Method and system for place name entity recognition
CN104699835A (en) Method and device used for determining webpages including POI (point of interest) data
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN106874287B (en) Method and device for processing POI address codes
CN109145073A (en) A kind of address resolution method and device based on segmentation methods
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
CN107463711A (en) A kind of tag match method and device of data
CN112528174A (en) Address finishing and complementing method based on knowledge graph and multiple matching and application
Chen et al. Georeferencing places from collective human descriptions using place graphs
CN112256817A (en) Geocoding method, system, terminal and storage medium
CN104679801A (en) Point of interest searching method and point of interest searching device
CN111008625B (en) Address correction method, device, equipment and storage medium
CN114168705B (en) Chinese address matching method based on address element index
CN112069824A (en) Region identification method, device and medium based on context probability and citation
CN111190937B (en) Method and device for inquiring native information, electronic equipment and storage medium
CN111611793B (en) Data processing method, device, equipment and storage medium
CN115759055A (en) English place name proofreading method considering multi-dimensional character characteristics
JP4510792B2 (en) LOCATION ANALYSIS DEVICE, LOCATION ANALYSIS METHOD, ITS PROGRAM, AND RECORDING MEDIUM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination