CN111159973B - Administrative division alignment and standardization method for Chinese addresses - Google Patents

Administrative division alignment and standardization method for Chinese addresses Download PDF

Info

Publication number
CN111159973B
CN111159973B CN201911280553.6A CN201911280553A CN111159973B CN 111159973 B CN111159973 B CN 111159973B CN 201911280553 A CN201911280553 A CN 201911280553A CN 111159973 B CN111159973 B CN 111159973B
Authority
CN
China
Prior art keywords
address
community
data
standard
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911280553.6A
Other languages
Chinese (zh)
Other versions
CN111159973A (en
Inventor
贾晓光
张磊
寇志刚
李圣亮
罗群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongguancun Technology Software Co ltd
Original Assignee
Zhongguancun Technology Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongguancun Technology Software Co ltd filed Critical Zhongguancun Technology Software Co ltd
Priority to CN201911280553.6A priority Critical patent/CN111159973B/en
Publication of CN111159973A publication Critical patent/CN111159973A/en
Application granted granted Critical
Publication of CN111159973B publication Critical patent/CN111159973B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a Chinese address administrative division alignment and standardization method, which comprises the following steps: s1: collecting real user original address data through a network, and placing the data into a basic original data database; s2: normalizing the original address data collected in the step S1; s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address; s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model; s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements; s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.

Description

Administrative division alignment and standardization method for Chinese addresses
Technical Field
The invention relates to the technical field of address standardization information processing, in particular to a method for filling and standardizing administrative division of Chinese addresses.
Background
In China, an address model is not paid enough attention to urban planning, taking house number management as an example, the address of the university of clearing water is "Beijing city lake area double clear road number 30", the address model can be expressed as "city |area|road|house number" or "Beijing city lake area double clear road clearing water university", and the two expression modes can confirm the same geographic space position. This causes uncertainty and diversity in the address model, and this prominent contradiction has not been able to meet the demands of urban planning and informatization development. Due to historical and cultural factors, the current situation of the addresses in China also has a plurality of problems. Firstly, the names of streets in China are various, and the streets are not named by a certain street or a certain road in a standard way; secondly, because of the fact that the person does not name or write out, the complexity of the address and the complexity of the business data in China are caused.
In view of various data analysis, address non-standardization in the service address data storage process has the following characteristics that various address models which are not standardized in strict sense are derived: firstly, the address is a typical three-section type storage structure in the service address data storage process, namely a three-field storage structure, and three-section type address elements are not standard enough in aspects of address matching, address element division and the like; secondly, the structure of the address data elements is diversified, the expression modes of dividing the address elements used in different areas are different, the division of the address elements is promoted to be diversified, and the division is not standard; thirdly, address data is not written normally, so that synonyms and different shapes are formed, and partial address elements can be written non-normally or for other reasons in the storage process, so that expression modes of the synonyms and the address elements are different; fourthly, partial address elements are missing in the service address data, so that the address expression is incomplete; fifthly, the address element directly uses the minimum address element type, so that the address division is uncertain; sixthly, administrative division is used as a main space region constraint element, and because of lack of unique and standard address expression, when describing an address, the address description information redundancy can be caused, so that address ambiguity is caused; seventhly, people mostly select place names with low stability as address descriptions, which may cause imperfect addresses and information loss; eighth, there is no unified standard in new and old urban areas, and the coding scheme of villages in partial urban areas is disordered; nine is a governmental agency that has many community information in the comparison base layer, is not uniformly named and is not easy to collect, and the boundary is easy to change with the passage of time.
The address model is the core of the administrative division alignment and standardization of Chinese addresses, and is also the core of the implementation of geocoding. The establishment of the address model needs to have a perfect planning scheme as a premise, and meanwhile, space cognition habits of users are considered, guidance is taken as a main role, and effective implementation of address standardization is gradually advanced. Aiming at the reality that the existing nonstandard addresses exist in a large quantity, the effective administrative division alignment and standardization of Chinese addresses are the only solutions.
Disclosure of Invention
Aiming at the technical problems in the related art, the invention provides a Chinese address administrative division alignment and standardization method which can overcome the defects in the prior art.
In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:
a Chinese address administrative division alignment and standardization method comprises the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
Further, in the step S1, the real user original address data includes address sample data including a basic base name, and address data for government field, house property field, etc.
Further, in the step S2, when it is determined that a certain address is not standardized, the standard address is firstly standardized to be structured according to the standard address specification, and secondly, the address element name is made correct.
The non-standard addresses comprise address data which are misspelled, address redundancy, address incomplete, address ambiguity, full-half angle mixed use and inconsistent in expression and cannot be used by a system.
Further, in the step S3, when the standard address coordinate library is established, unique coordinates, namely longitude and latitude, are matched in the standard address coordinate library, then the spatial expression mode of the current normalized address is obtained, and if the unique coordinates are not matched, the longitude and latitude are required to be obtained by means of the third party GIS service.
Further, in the step S5, the AI intelligent community algorithm includes three parts, namely sample training, model construction and test verification.
The sample training is to build a rule model of geographical place name information of communities through training a sample library, and form a community place name address identification rule library and a dynamic community association relation library.
The model construction combines the real rule and the actual situation of community data according to relation constraint, inheritance and rule coding among all levels of address elements, and an AI intelligent community model is designed from the real application requirement.
The test verification is to verify the effect of the AI intelligent community algorithm through the tested community address data set.
The invention has the beneficial effects that: the invention realizes automatic identification standardization of Chinese geographic addresses close to government and real estate fields, improves the accuracy of describing geographic address information, helps governments to realize resident location identification and social opinion management in smart city systems, helps e-commerce enterprises to realize accurate marketing such as regional grid sales analysis and collaborative recommendation, and helps individuals to realize location inquiry, hot spot search and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for administrative division alignment and normalization of Chinese addresses according to an embodiment of the present invention;
fig. 2 is a schematic diagram showing the effects of a method for administrative division alignment and standardization of chinese addresses according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.
As shown in fig. 1-2, the method for filling and standardizing administrative regions of chinese addresses according to the embodiment of the present invention includes the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
In one embodiment, for step S1, the real user original address data includes address sample data containing a basic base name, and address data for government, real estate, etc.
In a specific embodiment, for step S2, when it is determined that an address is scrambled, does not meet the standard address specification, is unusable by the system, and has a relatively low matching rate, the standard is standardized to be structured according to the standard address specification, and then the address element name is made correct.
Preferably, common non-canonical addresses include spelling errors, address redundancy, address incomplete, address ambiguities, full-half-angle mixed use, presentation inconsistencies, and the like.
Preferably, the basic principle of the address element complies with the standard specification, and the size of the space object generally presents inclusion relationships, such as: province > administrative division (city) > county > street > community > street lane > cell > building > unit > room.
In a specific embodiment, for step S3, when a standard address coordinate library is established, unique coordinates, namely, longitude and latitude, are matched in the standard address coordinate library, then the spatial expression mode of the current normalized address is obtained, and if the unique coordinates are not matched, the longitude and latitude are required to be obtained by means of a third party GIS service.
In a specific embodiment, for step S4, the established standard address library includes different standards, including standards of the national statistical office, the homeowner office, the soil commission, the naughty, the jindong, and the like, and different standards, including different geographic levels, and when the address standard library cannot be matched, the third party GIS service is needed.
In a specific embodiment, for step S5, the AI intelligent community algorithm includes three parts, i.e., sample training, model construction, and test verification.
Preferably, the sample training is that a rule model of geographical place name information of a community is built through training a sample library, meanwhile, information such as address superior provinces, cities, counties, streets and the like and longitude and latitude of community boundaries are extracted, and a community place name address identification rule library and a dynamic community association relation library are formed according to association relations among the information.
Preferably, the community place name address recognition rule base is formed by collecting community related data from authorities of different authorities in various modes to form a community corpus, extracting needed related training samples from the community corpus, obtaining statistical results through statistics and analysis, and forming the community place name address rule base aiming at rules of specific summary of community address information.
Preferably, the dynamic community association relation library is used for collecting the superior geographic level information of a large number of communities, so that the obtained community information is more accurate, and accurate community positioning is achieved.
Preferably, the model is constructed by combining the real rule and the actual situation of community data according to the relation constraint, inheritance and rule coding among all levels of address elements, and designing an AI intelligent community model from the real application requirement.
Preferably, the test verification is that the effect of the AI intelligent community algorithm is verified by the tested community address data set, and the accuracy rate of the AI intelligent community algorithm can reach 90 percent.
In order to facilitate understanding of the above technical solutions of the present invention, the following describes the above technical solutions of the present invention in detail by a specific usage manner.
When the method is specifically used, according to the administrative division filling and standardization method of the Chinese address, the original address of the user is collected and acquired firstly; then standardizing the collected original address, and when judging that a certain address is address data which is disordered and does not accord with the standard address specification, can not be used by a system and has low matching rate, firstly standardizing the standard into a structure according to the standard address specification, and secondly enabling the address element name to be correct; then, the space expression mode of the current normalized address is obtained, the unique coordinate, namely the longitude and latitude, is matched in a standard address coordinate library, and a standard address coordinate library is established, so that the mapping of the geographic coordinate and the natural language address is established, and if the unique coordinate is not matched, the longitude and latitude are obtained by means of a third-party GIS service; establishing a standard address library, acquiring address elements of various levels of standard provinces, cities and the like, and acquiring by means of a third-party GIS service when the address standard library cannot be matched; because communities tend to be fixed unlike areas and streets, most of data of communities are government institutions existing in a comparison base layer, the data are not well collected and are more messy, and boundaries of many communities are changed along with the years, so that community address elements are required to be acquired through an AI intelligent algorithm, and the effectiveness of the AI intelligent community algorithm is verified by testing a community address data set; finally, the final standardized address data will be obtained.
In summary, the accuracy rate of the AI intelligent community algorithm used by the method can reach 90 percent, the automatic identification standardization of Chinese geographic addresses close to the government field, the house property field and the like is realized, the accuracy of describing geographic address information is improved, the government is helped to realize resident position identification and social public opinion management in a smart city and other systems, the electronic commerce industry is helped to realize accurate marketing such as regional grid sales analysis and collaborative recommendation, and the individual is helped to realize position inquiry, hot spot search and the like.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (4)

1. The administrative division alignment and standardization method for the Chinese address is characterized by comprising the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using normalized address data, thereby establishing mapping between geographic coordinates and natural language addresses, matching unique coordinates, namely longitude and latitude, in the standard address coordinate library when the standard address coordinate library is established, then acquiring a space expression mode of the current normalized address, and acquiring the longitude and latitude by means of a third-party GIS service if the unique coordinates are not matched;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm to establish an existing model and obtain community address elements, wherein the AI intelligent community algorithm comprises three parts, namely sample training, model construction and test verification, the sample training is used for establishing a rule model of geographical place name information of a community through a training sample library, meanwhile, a community place name address identification rule library and a dynamic community association relation library are formed, the model construction is used for constructing a relation constraint, inheritance and rule coding according to all levels of address elements, combining the real rule and the actual situation of the community data, and designing an AI intelligent community model from the real application requirements, wherein the test verification is used for verifying the effect of the AI intelligent community algorithm through the tested community address data set;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
2. The method according to claim 1, wherein in the step S1, the real user original address data includes address sample data including a basic base name, and address data for government, real estate, etc.
3. The method for administrative division alignment and standardization of chinese addresses according to claim 1, wherein in step S2, when it is determined that an address is not standardized, the standard address is standardized to be structured according to the standard address specification, and then the address element name is made correct.
4. The method of claim 3, wherein the non-canonical address includes address data that is unusable by a system that is misspelled, address redundancy, address incomplete, address ambiguities, full-half-angle mixed, and presentation inconsistent.
CN201911280553.6A 2019-12-13 2019-12-13 Administrative division alignment and standardization method for Chinese addresses Active CN111159973B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911280553.6A CN111159973B (en) 2019-12-13 2019-12-13 Administrative division alignment and standardization method for Chinese addresses

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911280553.6A CN111159973B (en) 2019-12-13 2019-12-13 Administrative division alignment and standardization method for Chinese addresses

Publications (2)

Publication Number Publication Date
CN111159973A CN111159973A (en) 2020-05-15
CN111159973B true CN111159973B (en) 2023-06-02

Family

ID=70557064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911280553.6A Active CN111159973B (en) 2019-12-13 2019-12-13 Administrative division alignment and standardization method for Chinese addresses

Country Status (1)

Country Link
CN (1) CN111159973B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753035A (en) * 2020-06-16 2020-10-09 福建票付通信息科技有限公司 Scenic spot name matching method based on administrative district division completion
CN112380858A (en) * 2020-11-12 2021-02-19 中国科学技术大学智慧城市研究院(芜湖) Address completion and correction method based on government affair big data
CN113642313B (en) * 2021-09-02 2024-03-29 阿里巴巴达摩院(杭州)科技有限公司 Address text processing method, device, equipment, storage medium and program product
CN115809315B (en) * 2022-11-24 2024-06-21 中科星图智慧科技安徽有限公司 Standardized matching algorithm for place name and address
CN117251554B (en) * 2023-11-16 2024-02-20 中科星图智慧科技安徽有限公司 Method for converting non-standard address into standard address

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
CN106682175A (en) * 2016-12-29 2017-05-17 华南师范大学 Method and system for matching address
CN106874384A (en) * 2017-01-10 2017-06-20 广东精规划信息科技股份有限公司 A kind of isomery address standard handovers and matching process
CN109815498A (en) * 2019-01-25 2019-05-28 深圳市小赢信息技术有限责任公司 A kind of Chinese address standardized method, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN102169498A (en) * 2011-04-14 2011-08-31 中国测绘科学研究院 Address model constructing method and address matching method and system
CN106682175A (en) * 2016-12-29 2017-05-17 华南师范大学 Method and system for matching address
CN106874384A (en) * 2017-01-10 2017-06-20 广东精规划信息科技股份有限公司 A kind of isomery address standard handovers and matching process
CN109815498A (en) * 2019-01-25 2019-05-28 深圳市小赢信息技术有限责任公司 A kind of Chinese address standardized method, device and electronic equipment

Also Published As

Publication number Publication date
CN111159973A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN111159973B (en) Administrative division alignment and standardization method for Chinese addresses
Lozano et al. A complex network analysis of global tourism flows
WO2016150407A1 (en) Address resolution data-based construction land type rapid identification method
CN101350012B (en) Method and system for matching address
US11966424B2 (en) Method and apparatus for dividing region, storage medium, and electronic device
US9270712B2 (en) Managing moderation of user-contributed edits
CN103514235B (en) A kind of method for building up of incremental code library and device
Yin et al. A deep learning approach for rooftop geocoding
Zandbergen et al. Positional accuracy of TIGER 2000 and 2009 road networks
CN108268445A (en) A kind of method and device for handling address information
Kitamoto et al. Toponym-based geotagging for observing precipitation from social and scientific data streams
CN108345662A (en) A kind of microblog data weighted statistical method of registering considering user distribution area differentiation
Cetl et al. A comparison of address geocoding techniques–case study of the city of Zagreb, Croatia
Pan et al. Impact of Check‐In Data on Urban Vitality in the Macao Peninsula
Chen Delineating the spatial boundaries of megaregions in China: A city network perspective
Dumedah Address points of landmarks and paratransit services as a credible reference database for geocoding
Roongpiboonsopit et al. Quality assessment of online street and rooftop geocoding services
Murray et al. Spatial optimization and geographic uncertainty: implications for sex offender management strategies
Dumedah et al. The case of electoral polling station data for geocoding in facilitating accessibility to social, economic and cultural opportunities in Ghana
Sarretta et al. Towards the integration of authoritative and OpenStreetMap geospatial datasets in support of the European strategy for data
CN101567150A (en) Method for accurately positioning digital map
Iannacchione et al. Comparing the coverage of a household sampling frame based on mailing addresses to a frame based on field enumeration
Hung et al. Assessing the quality of building footprints on OpenStreetMap: a case study in Taiwan
CN114153897A (en) Electricity-consumption structured address cascade search method, system, device and storage medium
CN107135281B (en) IP region feature extraction method based on multi-data source fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 6th floor, Building 1, Zone 2, No. 81 Beiqing Road, Haidian District, Beijing, 100036

Patentee after: Zhongguancun Technology Software Co.,Ltd.

Address before: Building 2, Building C, Zhongguancun Software Park, Shangdi Information Industry Base, Haidian District, Beijing, 100193

Patentee before: Zhongguancun Technology Software Co.,Ltd.

CP02 Change in the address of a patent holder