CN111159973B - Administrative division alignment and standardization method for Chinese addresses - Google Patents
Administrative division alignment and standardization method for Chinese addresses Download PDFInfo
- Publication number
- CN111159973B CN111159973B CN201911280553.6A CN201911280553A CN111159973B CN 111159973 B CN111159973 B CN 111159973B CN 201911280553 A CN201911280553 A CN 201911280553A CN 111159973 B CN111159973 B CN 111159973B
- Authority
- CN
- China
- Prior art keywords
- address
- community
- data
- standard
- library
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a Chinese address administrative division alignment and standardization method, which comprises the following steps: s1: collecting real user original address data through a network, and placing the data into a basic original data database; s2: normalizing the original address data collected in the step S1; s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address; s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model; s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements; s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
Description
Technical Field
The invention relates to the technical field of address standardization information processing, in particular to a method for filling and standardizing administrative division of Chinese addresses.
Background
In China, an address model is not paid enough attention to urban planning, taking house number management as an example, the address of the university of clearing water is "Beijing city lake area double clear road number 30", the address model can be expressed as "city |area|road|house number" or "Beijing city lake area double clear road clearing water university", and the two expression modes can confirm the same geographic space position. This causes uncertainty and diversity in the address model, and this prominent contradiction has not been able to meet the demands of urban planning and informatization development. Due to historical and cultural factors, the current situation of the addresses in China also has a plurality of problems. Firstly, the names of streets in China are various, and the streets are not named by a certain street or a certain road in a standard way; secondly, because of the fact that the person does not name or write out, the complexity of the address and the complexity of the business data in China are caused.
In view of various data analysis, address non-standardization in the service address data storage process has the following characteristics that various address models which are not standardized in strict sense are derived: firstly, the address is a typical three-section type storage structure in the service address data storage process, namely a three-field storage structure, and three-section type address elements are not standard enough in aspects of address matching, address element division and the like; secondly, the structure of the address data elements is diversified, the expression modes of dividing the address elements used in different areas are different, the division of the address elements is promoted to be diversified, and the division is not standard; thirdly, address data is not written normally, so that synonyms and different shapes are formed, and partial address elements can be written non-normally or for other reasons in the storage process, so that expression modes of the synonyms and the address elements are different; fourthly, partial address elements are missing in the service address data, so that the address expression is incomplete; fifthly, the address element directly uses the minimum address element type, so that the address division is uncertain; sixthly, administrative division is used as a main space region constraint element, and because of lack of unique and standard address expression, when describing an address, the address description information redundancy can be caused, so that address ambiguity is caused; seventhly, people mostly select place names with low stability as address descriptions, which may cause imperfect addresses and information loss; eighth, there is no unified standard in new and old urban areas, and the coding scheme of villages in partial urban areas is disordered; nine is a governmental agency that has many community information in the comparison base layer, is not uniformly named and is not easy to collect, and the boundary is easy to change with the passage of time.
The address model is the core of the administrative division alignment and standardization of Chinese addresses, and is also the core of the implementation of geocoding. The establishment of the address model needs to have a perfect planning scheme as a premise, and meanwhile, space cognition habits of users are considered, guidance is taken as a main role, and effective implementation of address standardization is gradually advanced. Aiming at the reality that the existing nonstandard addresses exist in a large quantity, the effective administrative division alignment and standardization of Chinese addresses are the only solutions.
Disclosure of Invention
Aiming at the technical problems in the related art, the invention provides a Chinese address administrative division alignment and standardization method which can overcome the defects in the prior art.
In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:
a Chinese address administrative division alignment and standardization method comprises the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
Further, in the step S1, the real user original address data includes address sample data including a basic base name, and address data for government field, house property field, etc.
Further, in the step S2, when it is determined that a certain address is not standardized, the standard address is firstly standardized to be structured according to the standard address specification, and secondly, the address element name is made correct.
The non-standard addresses comprise address data which are misspelled, address redundancy, address incomplete, address ambiguity, full-half angle mixed use and inconsistent in expression and cannot be used by a system.
Further, in the step S3, when the standard address coordinate library is established, unique coordinates, namely longitude and latitude, are matched in the standard address coordinate library, then the spatial expression mode of the current normalized address is obtained, and if the unique coordinates are not matched, the longitude and latitude are required to be obtained by means of the third party GIS service.
Further, in the step S5, the AI intelligent community algorithm includes three parts, namely sample training, model construction and test verification.
The sample training is to build a rule model of geographical place name information of communities through training a sample library, and form a community place name address identification rule library and a dynamic community association relation library.
The model construction combines the real rule and the actual situation of community data according to relation constraint, inheritance and rule coding among all levels of address elements, and an AI intelligent community model is designed from the real application requirement.
The test verification is to verify the effect of the AI intelligent community algorithm through the tested community address data set.
The invention has the beneficial effects that: the invention realizes automatic identification standardization of Chinese geographic addresses close to government and real estate fields, improves the accuracy of describing geographic address information, helps governments to realize resident location identification and social opinion management in smart city systems, helps e-commerce enterprises to realize accurate marketing such as regional grid sales analysis and collaborative recommendation, and helps individuals to realize location inquiry, hot spot search and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for administrative division alignment and normalization of Chinese addresses according to an embodiment of the present invention;
fig. 2 is a schematic diagram showing the effects of a method for administrative division alignment and standardization of chinese addresses according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.
As shown in fig. 1-2, the method for filling and standardizing administrative regions of chinese addresses according to the embodiment of the present invention includes the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using the normalized address data, thereby establishing the mapping between the geographic coordinates and the natural language address;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm, establishing an existing model, and acquiring community address elements;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
In one embodiment, for step S1, the real user original address data includes address sample data containing a basic base name, and address data for government, real estate, etc.
In a specific embodiment, for step S2, when it is determined that an address is scrambled, does not meet the standard address specification, is unusable by the system, and has a relatively low matching rate, the standard is standardized to be structured according to the standard address specification, and then the address element name is made correct.
Preferably, common non-canonical addresses include spelling errors, address redundancy, address incomplete, address ambiguities, full-half-angle mixed use, presentation inconsistencies, and the like.
Preferably, the basic principle of the address element complies with the standard specification, and the size of the space object generally presents inclusion relationships, such as: province > administrative division (city) > county > street > community > street lane > cell > building > unit > room.
In a specific embodiment, for step S3, when a standard address coordinate library is established, unique coordinates, namely, longitude and latitude, are matched in the standard address coordinate library, then the spatial expression mode of the current normalized address is obtained, and if the unique coordinates are not matched, the longitude and latitude are required to be obtained by means of a third party GIS service.
In a specific embodiment, for step S4, the established standard address library includes different standards, including standards of the national statistical office, the homeowner office, the soil commission, the naughty, the jindong, and the like, and different standards, including different geographic levels, and when the address standard library cannot be matched, the third party GIS service is needed.
In a specific embodiment, for step S5, the AI intelligent community algorithm includes three parts, i.e., sample training, model construction, and test verification.
Preferably, the sample training is that a rule model of geographical place name information of a community is built through training a sample library, meanwhile, information such as address superior provinces, cities, counties, streets and the like and longitude and latitude of community boundaries are extracted, and a community place name address identification rule library and a dynamic community association relation library are formed according to association relations among the information.
Preferably, the community place name address recognition rule base is formed by collecting community related data from authorities of different authorities in various modes to form a community corpus, extracting needed related training samples from the community corpus, obtaining statistical results through statistics and analysis, and forming the community place name address rule base aiming at rules of specific summary of community address information.
Preferably, the dynamic community association relation library is used for collecting the superior geographic level information of a large number of communities, so that the obtained community information is more accurate, and accurate community positioning is achieved.
Preferably, the model is constructed by combining the real rule and the actual situation of community data according to the relation constraint, inheritance and rule coding among all levels of address elements, and designing an AI intelligent community model from the real application requirement.
Preferably, the test verification is that the effect of the AI intelligent community algorithm is verified by the tested community address data set, and the accuracy rate of the AI intelligent community algorithm can reach 90 percent.
In order to facilitate understanding of the above technical solutions of the present invention, the following describes the above technical solutions of the present invention in detail by a specific usage manner.
When the method is specifically used, according to the administrative division filling and standardization method of the Chinese address, the original address of the user is collected and acquired firstly; then standardizing the collected original address, and when judging that a certain address is address data which is disordered and does not accord with the standard address specification, can not be used by a system and has low matching rate, firstly standardizing the standard into a structure according to the standard address specification, and secondly enabling the address element name to be correct; then, the space expression mode of the current normalized address is obtained, the unique coordinate, namely the longitude and latitude, is matched in a standard address coordinate library, and a standard address coordinate library is established, so that the mapping of the geographic coordinate and the natural language address is established, and if the unique coordinate is not matched, the longitude and latitude are obtained by means of a third-party GIS service; establishing a standard address library, acquiring address elements of various levels of standard provinces, cities and the like, and acquiring by means of a third-party GIS service when the address standard library cannot be matched; because communities tend to be fixed unlike areas and streets, most of data of communities are government institutions existing in a comparison base layer, the data are not well collected and are more messy, and boundaries of many communities are changed along with the years, so that community address elements are required to be acquired through an AI intelligent algorithm, and the effectiveness of the AI intelligent community algorithm is verified by testing a community address data set; finally, the final standardized address data will be obtained.
In summary, the accuracy rate of the AI intelligent community algorithm used by the method can reach 90 percent, the automatic identification standardization of Chinese geographic addresses close to the government field, the house property field and the like is realized, the accuracy of describing geographic address information is improved, the government is helped to realize resident position identification and social public opinion management in a smart city and other systems, the electronic commerce industry is helped to realize accurate marketing such as regional grid sales analysis and collaborative recommendation, and the individual is helped to realize position inquiry, hot spot search and the like.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (4)
1. The administrative division alignment and standardization method for the Chinese address is characterized by comprising the following steps:
s1: collecting real user original address data through a network, and placing the data into a basic original data database;
s2: normalizing the original address data collected in the step S1;
s3: establishing a standard address coordinate library by using normalized address data, thereby establishing mapping between geographic coordinates and natural language addresses, matching unique coordinates, namely longitude and latitude, in the standard address coordinate library when the standard address coordinate library is established, then acquiring a space expression mode of the current normalized address, and acquiring the longitude and latitude by means of a third-party GIS service if the unique coordinates are not matched;
s4: establishing a standard address library, and acquiring standard provincial level address elements through an address identification and expression model;
s5: training an existing real community address data sample set through an AI intelligent community algorithm to establish an existing model and obtain community address elements, wherein the AI intelligent community algorithm comprises three parts, namely sample training, model construction and test verification, the sample training is used for establishing a rule model of geographical place name information of a community through a training sample library, meanwhile, a community place name address identification rule library and a dynamic community association relation library are formed, the model construction is used for constructing a relation constraint, inheritance and rule coding according to all levels of address elements, combining the real rule and the actual situation of the community data, and designing an AI intelligent community model from the real application requirements, wherein the test verification is used for verifying the effect of the AI intelligent community algorithm through the tested community address data set;
s6: and carrying out standardized and automatic complement information processing on the obtained address geographic information to obtain a final standardized address.
2. The method according to claim 1, wherein in the step S1, the real user original address data includes address sample data including a basic base name, and address data for government, real estate, etc.
3. The method for administrative division alignment and standardization of chinese addresses according to claim 1, wherein in step S2, when it is determined that an address is not standardized, the standard address is standardized to be structured according to the standard address specification, and then the address element name is made correct.
4. The method of claim 3, wherein the non-canonical address includes address data that is unusable by a system that is misspelled, address redundancy, address incomplete, address ambiguities, full-half-angle mixed, and presentation inconsistent.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911280553.6A CN111159973B (en) | 2019-12-13 | 2019-12-13 | Administrative division alignment and standardization method for Chinese addresses |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911280553.6A CN111159973B (en) | 2019-12-13 | 2019-12-13 | Administrative division alignment and standardization method for Chinese addresses |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111159973A CN111159973A (en) | 2020-05-15 |
CN111159973B true CN111159973B (en) | 2023-06-02 |
Family
ID=70557064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911280553.6A Active CN111159973B (en) | 2019-12-13 | 2019-12-13 | Administrative division alignment and standardization method for Chinese addresses |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111159973B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753035A (en) * | 2020-06-16 | 2020-10-09 | 福建票付通信息科技有限公司 | Scenic spot name matching method based on administrative district division completion |
CN112380858A (en) * | 2020-11-12 | 2021-02-19 | 中国科学技术大学智慧城市研究院(芜湖) | Address completion and correction method based on government affair big data |
CN113642313B (en) * | 2021-09-02 | 2024-03-29 | 阿里巴巴达摩院(杭州)科技有限公司 | Address text processing method, device, equipment, storage medium and program product |
CN115809315B (en) * | 2022-11-24 | 2024-06-21 | 中科星图智慧科技安徽有限公司 | Standardized matching algorithm for place name and address |
CN117251554B (en) * | 2023-11-16 | 2024-02-20 | 中科星图智慧科技安徽有限公司 | Method for converting non-standard address into standard address |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882163A (en) * | 2010-06-30 | 2010-11-10 | 中国科学院地理科学与资源研究所 | Fuzzy Chinese address geographic evaluation method based on matching rule |
CN102169498A (en) * | 2011-04-14 | 2011-08-31 | 中国测绘科学研究院 | Address model constructing method and address matching method and system |
CN106682175A (en) * | 2016-12-29 | 2017-05-17 | 华南师范大学 | Method and system for matching address |
CN106874384A (en) * | 2017-01-10 | 2017-06-20 | 广东精规划信息科技股份有限公司 | A kind of isomery address standard handovers and matching process |
CN109815498A (en) * | 2019-01-25 | 2019-05-28 | 深圳市小赢信息技术有限责任公司 | A kind of Chinese address standardized method, device and electronic equipment |
-
2019
- 2019-12-13 CN CN201911280553.6A patent/CN111159973B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882163A (en) * | 2010-06-30 | 2010-11-10 | 中国科学院地理科学与资源研究所 | Fuzzy Chinese address geographic evaluation method based on matching rule |
CN102169498A (en) * | 2011-04-14 | 2011-08-31 | 中国测绘科学研究院 | Address model constructing method and address matching method and system |
CN106682175A (en) * | 2016-12-29 | 2017-05-17 | 华南师范大学 | Method and system for matching address |
CN106874384A (en) * | 2017-01-10 | 2017-06-20 | 广东精规划信息科技股份有限公司 | A kind of isomery address standard handovers and matching process |
CN109815498A (en) * | 2019-01-25 | 2019-05-28 | 深圳市小赢信息技术有限责任公司 | A kind of Chinese address standardized method, device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111159973A (en) | 2020-05-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111159973B (en) | Administrative division alignment and standardization method for Chinese addresses | |
Lozano et al. | A complex network analysis of global tourism flows | |
WO2016150407A1 (en) | Address resolution data-based construction land type rapid identification method | |
CN101350012B (en) | Method and system for matching address | |
US11966424B2 (en) | Method and apparatus for dividing region, storage medium, and electronic device | |
US9270712B2 (en) | Managing moderation of user-contributed edits | |
CN103514235B (en) | A kind of method for building up of incremental code library and device | |
Yin et al. | A deep learning approach for rooftop geocoding | |
Zandbergen et al. | Positional accuracy of TIGER 2000 and 2009 road networks | |
CN108268445A (en) | A kind of method and device for handling address information | |
Kitamoto et al. | Toponym-based geotagging for observing precipitation from social and scientific data streams | |
CN108345662A (en) | A kind of microblog data weighted statistical method of registering considering user distribution area differentiation | |
Cetl et al. | A comparison of address geocoding techniques–case study of the city of Zagreb, Croatia | |
Pan et al. | Impact of Check‐In Data on Urban Vitality in the Macao Peninsula | |
Chen | Delineating the spatial boundaries of megaregions in China: A city network perspective | |
Dumedah | Address points of landmarks and paratransit services as a credible reference database for geocoding | |
Roongpiboonsopit et al. | Quality assessment of online street and rooftop geocoding services | |
Murray et al. | Spatial optimization and geographic uncertainty: implications for sex offender management strategies | |
Dumedah et al. | The case of electoral polling station data for geocoding in facilitating accessibility to social, economic and cultural opportunities in Ghana | |
Sarretta et al. | Towards the integration of authoritative and OpenStreetMap geospatial datasets in support of the European strategy for data | |
CN101567150A (en) | Method for accurately positioning digital map | |
Iannacchione et al. | Comparing the coverage of a household sampling frame based on mailing addresses to a frame based on field enumeration | |
Hung et al. | Assessing the quality of building footprints on OpenStreetMap: a case study in Taiwan | |
CN114153897A (en) | Electricity-consumption structured address cascade search method, system, device and storage medium | |
CN107135281B (en) | IP region feature extraction method based on multi-data source fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP02 | Change in the address of a patent holder |
Address after: 6th floor, Building 1, Zone 2, No. 81 Beiqing Road, Haidian District, Beijing, 100036 Patentee after: Zhongguancun Technology Software Co.,Ltd. Address before: Building 2, Building C, Zhongguancun Software Park, Shangdi Information Industry Base, Haidian District, Beijing, 100193 Patentee before: Zhongguancun Technology Software Co.,Ltd. |
|
CP02 | Change in the address of a patent holder |