CN116414824A

CN116414824A - Administrative division information identification and standardization processing method, device and storage medium

Info

Publication number: CN116414824A
Application number: CN202111658596.0A
Authority: CN
Inventors: 高志; 周训飞; 王小龙
Original assignee: Fengtu Technology Shenzhen Co Ltd
Current assignee: Fengtu Technology Shenzhen Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2023-07-11

Abstract

The application provides a method, a device and a storage medium for identifying and standardizing administrative division information, comprising the following steps: acquiring the verified original address text and the original address code; traversing characters in an original address text, preprocessing each character, and obtaining a first address text; segmenting the first address text, and extracting a plurality of administrative division fields; carrying out named entity identification on administrative division fields to obtain a second address text; expanding the original address code to form a administrative division code sequence which is used for covering the second address text and corresponds to the second address text; and replacing the administrative division field identified by the named entity in the second address text with the standard administrative division field according to the administrative division coding sequence. According to the method and the device, the condition of missing recognition or false recognition after entity naming text matching can be effectively avoided, and the administrative division and extraction and standardization effects can be effectively improved.

Description

Administrative division information identification and standardization processing method, device and storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a storage medium for identifying and standardizing administrative division information.

Background

With the rapid development of modern communication technology and modern traffic represented by computer networks, satellite technology and optical cables, the activity space of human beings is rapidly enlarged, social interaction is increasingly frequent, place names are one of the most frequent and most widely used tools in the social interaction, on one hand, the social value and social status of the people are continuously improved, the use range of the place names is larger and larger, the frequency is higher and the use means are more and more, and on the other hand, the place names are unified, and the writing consistency is more strict. Non-uniformity of place names, non-uniformity of writing and translation writing, multiple places (renames), multiple writing (multiple writing forms of one place) and the like. In the application scene of geographic information navigation and delivery service, due to reasons such as user habit, writing error and the like, the input partial addresses have the conditions of information deficiency, information error, information redundancy and the like, so that an algorithm is difficult to accurately identify and position the addresses, an accurate unit area cannot be returned, and the returned inaccurate or wrong unit area affects timeliness of the express mail on one hand, and on the other hand, the transportation or forwarding cost is increased.

Although standardized promotion can be performed for non-standardized addresses by related systems and manual methods, the effect is poor. In addition, the original method often has a reverse error phenomenon because of the system incompleteness, namely, the situation that part of correct addresses are modified and wrong in the process of rewriting.

Disclosure of Invention

In view of the foregoing, there is a need for a method for extracting and standardizing administrative division addresses, which includes:

acquiring the verified original address text and the original address code;

traversing characters in the original address text, and preprocessing each character to obtain a first address text;

segmenting the first address text, and extracting a plurality of administrative division fields; carrying out named entity recognition on a plurality of administrative division fields to obtain a second address text;

expanding the original address code to form a administrative division code sequence which is used for covering the second address text and corresponds to the second address text;

replacing the second address text with a standard administrative division according to the administrative division coding sequence and the standard administrative division metadata; wherein each of said standard administrative section metadata is defined by at least one administrative section code sequence.

Further, the identifying the named entity to the administrative division field, and obtaining the second address text includes:

performing text matching according to the extracted administrative division fields and pre-constructed place name entity metadata;

deleting repeated administrative division fields in the first address text after text matching, and matching administrative division fields which cannot be matched with the text in the first address text with at least one similar administrative division place name to obtain at least one second address text.

Further, the expanding the original address code to form a administrative division code sequence for covering the second address text and corresponding to the second address text includes:

and analyzing administrative division fields identified by the named entities in the second address text into corresponding codes, and supplementing the codes in sequence according to the administrative division level sequence on the basis of original address codes to form an administrative division code sequence.

Further, the administrative division field after the named entity recognition in the second address text at least includes: one or more of a standard administrative division field, a homonymous administrative division field, an aliased administrative division field, and an abbreviated administrative division field.

Further, the replacing the second address text with the standard administrative division according to the administrative division code sequence and the standard administrative division metadata includes:

and loading the administrative division coding sequence as a primary key and the standard administrative division as a value into a hash tree, and replacing the second address text according to the standard administrative division corresponding to the value.

Further, the obtaining the verified original address text and the original address code includes:

judging whether the input original address text is blank, messy code, special symbol or irregular letter ordering, if so, failing to pass the verification;

and judging whether the input original address code is blank or not according with the rule of the standard address code, and if so, failing to pass the verification.

The preprocessing of each character comprises at least one of the following steps:

performing redundancy deletion on punctuation characters, english characters, arabic numerals and format characters in an original address text, and performing half-angle full-angle and lower-case-to-upper-case processing;

performing complex-to-simple processing on Chinese characters in the original address text;

and (3) repeatedly reducing, schematically expanding and deleting the non-address text of the continuous text sequence in the original address text.

Further, the processing for performing redundancy pruning on punctuation characters, english characters, arabic numerals and format characters in the original address text includes:

deleting punctuation characters except for front and rear adjacent Arabic numerals or English characters in the original address text; and deleting English characters, arabic characters and format characters at the beginning or the end of the original address text.

Further, the processing for converting the traditional Chinese character in the original address text into the simplified Chinese character comprises the following steps:

and loading the traditional Chinese characters serving as a primary key and the simplified Chinese characters serving as values into a hash table, traversing the original address text, and replacing the traditional Chinese characters by using the corresponding simplified Chinese characters in the values.

A second aspect of the present application provides an apparatus for extracting and standardizing an administrative division address, including:

the verification module is used for acquiring an original address text and an original address code and verifying the original address text and the original address code;

the preprocessing module is used for traversing characters in the original address text, and performing program conversion, expansion or deletion on each character to obtain a first address text;

the first matching module is used for segmenting the first address text and extracting a plurality of administrative division fields; carrying out named entity recognition on a plurality of administrative division fields to obtain a second address text;

The code generation module is used for expanding the original address code to form a administrative division code sequence which is used for covering the second address text and corresponds to the second address text;

and the second matching module is used for replacing the administrative division fields identified by the named entities in the second address text with standard administrative division fields according to the administrative division coding sequence.

A third aspect of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

According to the technical scheme, the administrative division extraction and standardization processing method provided by the application not only carries out pretreatment on the address text and matches the entity naming text, but also converts the preliminarily matched address text into the address coding sequence, and address name standardization steps such as homonymous address filtering, address alias deducing, mispronounced character recognition correcting, special (abbreviated) place name deducing and the like are completed by replacing various address coding sequences corresponding to standard addresses, so that the condition of missing recognition or misidentification after the entity naming text is matched is effectively avoided, and administrative division extraction and standardization effects can be effectively improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a method for identifying and normalizing administrative division information according to one embodiment of the present application;

FIG. 2 is a flowchart illustrating a method step S2 of identifying and normalizing administrative division information according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a method step S5 of the administrative division information identification and standardization process according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. In addition, technical features described below in the various embodiments of the present application may be combined with each other as long as they do not conflict with each other.

The method for extracting and standardizing the administrative division place name in the embodiment may be configured in an administrative division extracting and standardizing device, and the device may be disposed in a server or a micro-service architecture, or may be disposed in an electronic device, which is not limited in this application.

In the prior art, the current technology for matching and identifying administrative division generally uses keyword complete matching or regular matching, so that the condition of missing identification or false identification is easy to occur, and the identification effect is poor.

In order to solve the technical problems, the method of administrative division extraction and standardization processing is configured in a micro-service architecture as an example, and the method of administrative division extraction and standardization processing is provided, which not only performs preprocessing and entity naming text matching on address texts, but also converts the preliminarily matched address texts into address coding sequences, and address name standardization steps such as homonym address filtering, address alias inference, misplacement word recognition correction, special (abbreviation) place name inference and the like are completed by replacing corresponding standard addresses of various address coding sequences, thereby effectively avoiding the condition of missing recognition or misidentification after the entity naming text matching, and effectively improving the administrative division recognition and standardization effects.

As shown in fig. 1, which is a flow chart of a method for extracting and normalizing an administrative division address according to a first embodiment of the present application, the method for extracting and normalizing an administrative division address is configured in a micro-service architecture, and the method includes the following steps:

s1: acquiring the verified original address text and the original address code; the micro service providing interface obtains an original address to be processed, and verifies address text information and address coding information in the original address.

Specifically, in step S1, the micro service may provide an interface to the web front end/mobile front end page, the user may input an original address through the PC terminal/mobile terminal, and the micro service may call the interface to obtain related original address information. After the original address is obtained, the micro-service performs preliminary verification on the original address. The original address entered by the user should include: the method comprises the steps of original address text and original address codes, wherein the original address text comprises address text information, and the original address codes comprise address code information which is effective and accords with standard address code rules. More specifically, the address text information is text information of a specific geographic location, for example, a street XXX cell in the natal district XXX of Mianyang city in Sichuan province; the address code information is a code capable of realizing the digitalization of an address network, and is composed of Arabic numerals, more specifically, a national administrative division code or a postal code which is appointed and circulated for the people's republic of China, for example, the national administrative division code of the Mianyang city, the Games area of Mianyang, sichuan province is 510704; the postal code of the Mianyang city, sichuan province, is 621000.

Specifically, in step S1, after the original address to be processed is acquired, it is necessary to verify the address text information and the address coding information in the original address, respectively:

the verification of the address text information is mainly as follows: and judging whether the input original address text is blank, messy code, special symbol or irregular letter ordering, and if so, failing to pass the verification. The special characters specifically refer to other characters except punctuation marks, english letters and Arabic numerals. An example is provided, if the input address text is only "AAAAA" or "% > and #", it is determined after verification that it cannot be the address text, and the subsequent steps cannot be entered.

The verification of the address coding information is mainly as follows: whether the input original address code (national administrative division code or postal code) is blank or not is judged, if so, the verification cannot be passed. Here, the standard address code includes, but is not limited to, a national administrative division code or a national postal code. More specifically, the original address code must be all arabic numerals, not blank or splice of numerals with other characters. An example is provided, if the entered zip code is "4300ab" or "4,300,00", it is determined after verification that it cannot be address coded information, and the subsequent steps cannot be entered.

More specifically, if the verification is not passed, the micro service architecture will feed back error/unrecognizable information to the interface to be re-entered into the original address for re-verification.

S2: traversing characters in the original address text, preprocessing each character, and obtaining a first address text; and preprocessing the original address text after the verification is passed. FIG. 2 is a flow chart of preprocessing an original address text, and after character processing, simplified and complex conversion and text replacement are performed on the original address text, a preprocessed address is obtained; specifically, the method comprises at least one of the following steps S201 to S203:

s201: performing redundancy deletion on punctuation characters, english characters, arabic numerals and format characters in an original address text, and performing half-angle full-angle and lower-case-to-upper-case processing;

s202: performing complex-to-simple processing on Chinese characters in the original address text;

s203: and (3) repeatedly reducing, schematically expanding and deleting the non-address text of the continuous text sequence in the original address text.

Specifically, in step S201, redundancy pruning includes: deleting redundant formats such as space and tab; deleting redundant characters in the formats such as HTML/XML files transmitted by different platforms; the half-angle turn full angle includes: converting punctuation marks, english letters and Arabic numerals of the whole corners into half corners; the lower case transfer uppercase includes: and converting the lower case English letters into upper case.

According to a specific embodiment, the step S201 mainly performs character processing, and specifically includes analysis processing on punctuation marks, case letters, arabic numerals, full-horn symbols, space symbols, tab symbols, HTML/XML entity symbols in a format. In order to obtain standardized address text, the application uniformly formulates a symbol conversion rule, reserves all capital letters and Arabic numerals in an original address, and processes other characters, wherein specific processing logic comprises the following steps:

all Chinese characters, english capital letters and Arabic numerals in the original address text are reserved: splitting address text information of an original address into character strings and/or single characters; more specifically, splitting grammar connection Chinese text in the address text into character strings, and splitting punctuation marks, english case letters, arabic numerals and glyphs into single characters; and then, reserving the Chinese text character string, the English capital letter character and the Arabic numeral character, and entering other processing links by other character strings and/or single characters.

Deleting other punctuation marks except for adjacent Arabic numerals or English letters in the original address text, and deleting English letters, arabic numerals and format symbols at the beginning or the end of the address text information: deleting other punctuation characters except for front and rear Arabic numerals or English letters in the split other character strings and/or single characters; this is because the operation logic in this step performs the deletion process by regarding other punctuation marks other than the preceding and following adjacent numerals or letters in the address text as nonsensical characters; in addition, the input address is started by consecutive Arabic numerals or English letters to be deleted, and the format symbols at the beginning and the end are deleted, and the format symbols commonly used in the processing process comprise space symbols and tab symbols. Providing an example, "())) in the mountain area of mountain in martial arts in Hubei province is pretreated to" mountain area of mountain area in martial arts in Hubei province ", wherein/t is a tab.

Converting English letters, arabic numbers and punctuation marks of full corners in the original address text into half corners: converting full-angle characters in the split English letters, arabic numerals and punctuation characters into half angles; this is because the usual english letters, arabic numerals, punctuation marks are all half-angular, and the display internal code of the half-angle is one byte; in the micro-service architecture, the above three characters can be treated as basic codes, but if the three characters are full angles, the three characters cannot be identified as codes, and the full angles of the three characters need to be converted into half angles.

Converting lower-case English letters in the original address text into upper-case English letters: converting the English characters with lower cases into capital characters; this is because the standardized address text in the micro-service architecture provides matching/searching services in different application scenarios with the database/metadata edited in the architecture later, and the english address metadata established by the database/metadata is unified into english capital letters.

Specifically, in step S202, the chinese traditional characters in the original address text information are converted into simplified characters, so as to be matched and identified with the standard address library in the micro-service architecture in the later stage. After the step S202, the original address text does not contain traditional Chinese characters any more, and the original traditional Chinese characters are converted into corresponding simplified Chinese characters in the Xinhua dictionary.

According to one specific embodiment, a complex transformation is performed using an algorithm that uses a hash lookup, which is typically an algorithm that changes time in space. All Chinese traditional and simple mapping sets can be recorded in a database corresponding to a micro-service architecture in a hash table mode in advance; and then taking the traditional Chinese characters as a main key (key), taking the simplified Chinese characters as values (value), sequentially searching each character as the main key (key) by traversing the original address text, replacing the characters if the values (value) can be found, and traversing the next character if the values (value) cannot be found. The micro-service directly accesses the data structure in the memory storage position of the architecture according to the key (key), so that the searching speed can be increased.

Specifically, in step S203, the continuous text sequence including punctuation marks in the original address text information is subjected to reduction, expansion, and deletion processing. The text replacement in the preprocessing link is mainly processed aiming at a continuous text sequence containing punctuation marks, a necessary foundation is laid for ensuring the standard and complete address text information input by a user, the logic source of the text replacement is the summarization analysis of massive user input original address big data, and the data source comprises the manual rules and the change of national administrative division. In the running process, the text is replaced with three processing types of reduction, expansion and deletion, and the specific processing logic is as follows:

And (3) reducing: deleting the content repeatedly input by the user; providing an example, "XX cell", would be reduced to "XX cell" by preprocessing;

expansion: performing content schematic expansion according to the non-normative address text or the ambiguous symbol input by the user; providing an example, the "-" may be extended to "11";

deletion: deleting the text input by the user without referring to the content of the address information; providing an example, the "delivery gate" would be deleted as ""; the above-described extensions can be considered to consist of two steps, a reduction and an increase, in operational applications.

It should be noted that, in the present embodiment, step S201, step S202, and step S203 in step 2 belong to parallel preprocessing links in the micro-service architecture; thus, during operation steps S201, S202 and S203 may be performed in any order or according to the specific original address text data, preferably in order of step s201→step s202→step S203. After the preprocessing step in the step S2, the reserved and replaced characters are recombined or combined to form a first address text.

S3: segmenting the first address text, and extracting a plurality of administrative division fields; and carrying out named entity recognition on the administrative division fields to acquire a second address text. The method comprises the steps of performing administrative division on a preprocessed address to form address fields of a plurality of administrative divisions, then performing text matching on each address field and place name entity metadata respectively, deleting repeated administrative division fields, performing preliminary level matching on the administrative division fields, and performing verification of upper and lower logic of an administrative division chain without concrete, so that a second address text is obtained. The method specifically comprises the following two steps:

s301: segmenting the first address text, extracting a plurality of administrative division fields, and performing text matching on the extracted administrative division fields of all levels and the place name entity metadata;

s302: and deleting the repeated administrative division fields of the first address text after text matching, and matching the administrative division fields which cannot be matched with similar administrative division names to obtain a second address text, wherein the same first address text can correspond to a plurality of different second address texts.

Specifically, in step S301, the micro service performs semantic segmentation on the address text after preprocessing, extracts the administrative division field, performs text matching operation according to the administrative division field and the pre-constructed place name entity metadata, and primarily normalizes the address names that are not normalized. And performing quick text matching on the address text, and comparing the address text with the geographical name entity metadata to realize the named entity identification of the administrative division geographical name.

More specifically, since the basic rules of administrative division follow the standard specification, the spatial object size generally presents multiple levels or inclusion relationships, and the multiple levels of division relationships include: province (direct administration city, special administrative district) > city > county (district) > village (street, town) > village (community), etc., thereby obtaining the address names of each level of administrative division, and extracting the address field of the name of each level of administrative division, namely the corresponding administrative division field; meanwhile, establishing place name entity metadata in advance according to address fields of administrative division at each level, and then matching the place name entity metadata with the extracted administrative division fields to carry out a primarily normalized second address text; the names and corresponding short names of all levels of administrative division can be included in the field consideration category, and the field accuracy of the non-standard address data can be effectively improved.

According to a specific embodiment, the text matching in step S301 mainly uses the dictionary tree concept to import the configuration standard administrative division place name data into the micro-service architecture in a hash tree manner, so as to reduce the time complexity of the algorithm, and replace the administrative division in the first address text with the place name appearing in the place name entity metadata. The dictionary tree is also called word search tree, is a tree structure, is a variant of hash table, and utilizes the common prefix of character strings to reduce inquiry time and maximally reduce unnecessary character string comparison. Each node of the dictionary contains a number of attributes, mainly character values, (phrase) whether to end, child node address, path length to root node. When the character value of the root node is empty, the child node address is the first character of each text in the configuration file, the grandchild node address is the second character, and so on until the end. Core ideas of dictionary tree: the aim of improving the query is achieved by reducing the cost of the query time by using the common prefix of the character string. Basic properties: (1) the root node does not contain characters, and each node except the root node contains only one character; (2) from the root node to a certain node, the characters passing through the path are connected to form a character string corresponding to the node; (3) all child nodes of each node contain different characters.

Specifically, in step S302, the above extracted administrative division information is subjected to preliminary hierarchical combination matching, the possible administrative division level of each entity and the complete administrative division chain to which the possible administrative division level belongs are determined, and then the correct administrative division address described by the address text information is obtained after processing by the duplicate discrimination algorithm. The duplicate removal authentication algorithm is to delete the administrative division fields repeatedly of the first address text after text matching, and match the administrative division fields which cannot be matched with the similar administrative division names. The specific processing logic is as follows:

repeated administrative division field delete: the method mainly comprises the steps of choosing and simplifying repeated administrative division fields, wherein the user input address can describe provincial and municipal information for multiple times, and the complete one time is reserved in general. Providing an example, the "south mountain area" may be reduced to "south mountain area" by preprocessing. In some special cases, however, duplicate administrative division fields cannot be deleted; for example, the front administrative division field and the rear administrative division field are combined together to form an administrative division entity; providing an example, "south mountain (regional) south mountain community" is a complete five-level administrative division description, cannot be simplified to "community", otherwise it may be confusing or indistinguishable from other communities in understanding.

Matching the unmatched administrative division fields with the similar administrative division names: the processing is mainly performed for the administrative division fields that may be erroneously recognized when text matching is performed in step S301, and these unmatched administrative division fields are mainly matched with similar administrative division names. Providing an example, a "Changbai mountain" may not match any administrative division names in the place name entity metadata, but may match similar standard administrative division names, which may be one or more, such as may match "Changbai (county)" or "Bai mountain (city)".

According to a specific embodiment, the corresponding metadata may be also created for the standard administrative division names similar to those in step S302, and the configuration file of the duplicate removal authentication algorithm may be preloaded into the memory of the computer when the micro-service is started, and the configuration file is also processed through text matching in combination with the dictionary tree structure.

It should be noted that, the administrative division is an abbreviation of administrative division, which is address division carried out by the country for hierarchical management, and in the embodiment of the present application, five-level administrative division may be extracted from the preprocessed address text; the embodiment sets four provincial administrative regions of provinces, autonomous regions, direct administration cities and special administrative regions of the whole country as first-level administrative regions (namely, provincial regions); setting four ground administrative areas of a ground city, a region, an autonomous state and an alliance as a secondary administrative area (namely, a city level); setting the district, county, flag, autonomous flag, district, forest area, and the district and county under the direct district as three-level administrative division (i.e. district-county level); setting villages and towns, streets, sappan wood, nationality and villages and regional public offices as four-level administrative areas (namely, street and village levels); the community, village committee is set to a five level administrative division (i.e., community village level).

S4: expanding the original address code to form a administrative division code sequence which is used for covering the second address text and corresponds to the second address text; here, since the original address codes can only be displayed in province, city and county (district), the streets and communities input in the second address text cannot be represented by the original address codes, the original address codes are expanded, codes corresponding to the streets and communities input in the second address text are supplemented in the original address codes, namely administrative division fields identified by named entities in the second address text are resolved into corresponding codes, and the codes are sequentially supplemented according to the order of administrative division levels on the basis of the original address codes to form administrative division code sequences.

Specifically, in step S4, in combination with the national administrative division encoding rule, five-level administrative division fields (five-level administrative division fields may be identified and extracted if five-level administrative division levels are not involved in the address text information) after the named entity is identified from the second address text information, for example, "province, city, county (district), country (street), village (community)", and each level of administrative division names are respectively resolved into corresponding codes, and twelve-bit codes are formed after merging, i.e., the original six-bit address codes are extended into twelve bits.

The administrative district place name set refers to names of administrative districts of various levels, and provides an example, such as Hubei province (province/direct jurisdiction), wuhan city (city), jiang Xiaou (district), nude street (street), and financial harbor (community/industrial park). More specifically, embodiments of the present application extend the administrative division code to twelve on the basis of a national six-digit administrative division code, with the objective of ensuring that each administrative division to the street level can have a unique administrative division code (adcode). Twelve administrative division codes are coded in the format: AABBCCDDDEEE, the twelve-bit administrative division code may specifically refer to five-level administrative divisions; more specifically, AA represents province, direct jurisdiction, BB represents city, CC represents county, district, DDD represents county, town, street, EEE represents village, community, etc. In addition, administrative division codes (adcode) are changed along with the change of the national administrative division, and business problems are not affected by the change. If there are multiple different names in one administrative division, different attribute values in the same administrative division code (adcode) are used to distinguish standard names from common aliases.

More specifically, in step S3, the same first address text may form a plurality of second address texts after being identified by the named entity, so the corresponding plurality of second address texts may form a corresponding plurality of administrative division code sequences again in step S4, and the plurality of administrative division code sequences will enter step S5 in the same link.

S5: and judging an administrative division field in the second address text, determining corresponding preset standard administrative division metadata according to a judging result, and replacing the second address text with the standard administrative division according to the administrative division coding sequence and the corresponding preset standard administrative division metadata. Here, the preset standard administrative division metadata is a preset national administrative division code or postal code which accords with the specification of the people's republic of China and is circulated and used, a preset code designed for streets and communities, and a preset character segment such as province, city, county (area), county (street), village (community) and the like corresponding to the administrative division code or postal code and the preset code designed for streets and communities; the standard administrative division refers to the character segment information of province, city, county (district), country (street), village (community) and the like, which is extracted corresponding to the administrative division code after the administrative division code sequence is compared with the preset standard administrative division metadata.

Specifically, in step S5, firstly, a hash tree of the administrative division field is established for the metadata according to the same administrative division code sequence rule, so that a mapping relationship can be quickly established between the extracted candidate administrative division field, that is, the second address text, and the preset standard administrative division field through the administrative division code sequence.

More specifically, after steps S1 to S3 are performed, administrative regions in the address text are extracted and subjected to preliminary standardized processing, but address errors of administrative regions such as homonymous addresses, alias addresses, mispronounced addresses, address abbreviations and the like still occur, and the above problems are faced, as shown in fig. 3, in addition to the corresponding standard administrative region hash table, corresponding homonymous administrative region hash table, separate word administrative region hash table, separate name administrative region hash table and abbreviated administrative region hash table are respectively established, and the steps are as follows:

s501: and carrying out matching replacement according to the administrative division code sequence and the standard administrative division hash table to form a complete administrative division chain.

S502: and screening and combining the homonymous address ambiguity words according to the administrative division coding sequence and the homonymous administrative division hash table to form a complete administrative division chain.

S503: and deducing the alias address according to the administrative division code sequence and the alias administrative division hash table to form a complete administrative division chain.

S504: and correcting the written addresses according to the administrative division code sequence and the written administrative division hash table to form a complete administrative division chain.

S505: and expanding the abbreviated address according to the administrative division code sequence and the abbreviated administrative division hash table to form a complete administrative division chain.

Specifically, in step S501, each level of standard administrative division in the standard administrative division hash table has a corresponding code in the administrative division code sequence, and the corresponding code is traversed to replace the administrative division field in the second text with the standard administrative division field.

According to a specific embodiment, the administrative division code sequence is a key, the standard administrative division field is a value, and the value is loaded into the hash table, and the corresponding standard administrative division field in the value is used for replacing the second address text, so that the administrative division information of which the multiple stages are completed is obtained.

Specifically, in step S502, there may be different semantics in different contexts due to the same text sequence. Providing an example, a user inputs "south mountain area", wherein the south mountain area can represent "south mountain area of Shenzhen city in Guangdong province" and also can represent "south mountain area of Heilongjiang mountain city", "Hebei province/280955, deer county south mountain area", and then the attribution judgment of the field is completed by combining the administrative division place name set through the upper and lower administrative divisions matched in the address text. Still more complex homonymous address ambiguity determinations, such as the user entering "red river", where red river can represent multiple administrative divisions, for example, red river county, florist, autonomous state, have one alias of "red river" and "red river county" and one alias of "red river", both belonging to the upper and lower names of the administrative divisions, so simply matching administrative division text cannot determine the inferred depth. So the context characteristics of the administrative division place names (namely the relation among the administrative division place names) and the internal characteristics of the administrative division (namely the hierarchy of the administrative division place names and whether aliases exist) are fully found and utilized, and the same-name address ambiguity words are screened and combined to form a complete administrative division chain.

More specifically, all possible homonymous situations need to be counted at this time, all possible homonymous ambiguous place names are resolved into administrative division code sequences as in step S4, and mapping is established between all resolved administrative division code sequences and standard administrative division fields to form a homonymous administrative division hash table. And traversing the administrative division coding sequence formed in the step S4, and replacing administrative division fields with the same name with standard administrative division fields.

According to a specific embodiment, as shown in the following table 1, all possible ambiguous words of the same name and address may be recorded in a configuration file in advance, a hash table of the same name and administrative division may be established, and the hash table is recorded in a database of a micro service architecture in a data structure, and a key (key) of the hash table is a twelve-bit administrative division code, and a value (value) is an inputted address and a corresponding extracted administrative division. When the administrative division is inferred, if the complete 'city and district country' place name is not matched, twelve-bit codes corresponding to the matched administrative division are sequentially used as keys to find values (value), and if the character strings immediately behind the administrative division are consistent with the homonymous addresses of the values (value), the homonymous addresses of the user input addresses can be judged, and the corresponding extracted administrative division addresses in the values (value) are used for replacing the input addresses. The "red river" homonymous ambiguity address filtering logic in the address text information is as follows in table 1:

TABLE 1 Hash Table for administrative division with the same name of "Red river" in address text information

It can be seen that the above table numbers 2 and 3 are repeated to show "red river", but according to the steps S3 to S4, the formed administrative division coding sequences are "532500000000" and "532529000000", respectively, and the corresponding inferred depths are "city" and "county", respectively; and the code sequences of the administrative division numbers 4 and 5 correspond to 532529205000 and 532500005000 respectively, but the extracted standard administrative division fields are also the red river, the Hani nationality, the Yi nationality, the red river county, the Azahe county, the Yunnan province.

Specifically, in step S503, it is considered that the same administrative division names often have a plurality of names, and that there are also renames between different administrative division names. The embodiment of the application extends to address alias inference logic, a corresponding alias administrative division hash table is established for each administrative division field, the administrative division field is given higher reliability when being matched with the standard administrative division hash table, the administrative division hash table is given lower reliability when being matched with the alias administrative division hash table, the reliability is propagated through an administrative division chain, and the highest reliability is selected as a final result.

According to one specific embodiment, as shown in table 2 below, all possible alias address ambiguity words may be pre-recorded in a configuration file, an alias administrative division hash table may be established, and recorded in a database of the micro service architecture in a data structure of the hash table, where keys (keys) of the hash table are twelve-bit administrative division codes, and values (values) are corresponding extracted administrative divisions by the input address. When the administrative division is inferred, if the complete administrative division place name is not matched, twelve-bit codes corresponding to the matched administrative division are sequentially used as keys to find values (value), and if the character strings immediately behind the administrative division are consistent with the alias addresses of the values (value), the existence of the same name of the user input address can be judged, and the input address can be replaced by the extracted administrative division address corresponding to the value (value). The address alias filtering logic of the address text information of the Shenzhen Royal XX cell in Guangdong province is as follows in Table 2:

Table 2 "Shenzhen Royal XX cell in Guangdong province" alias administrative division Hash Table in address text

More specifically, in general, the nickname of the province level immediately adjacent to the administrative division chain of the prefix is inferred, as exemplified by the embodiment with the number 1 in table 2 above: the user inputs 'Shenzhen XX cell in Shenzhen City of Guangdong', the identified alias address is 'Luo lake', the named entity is identified and then the standard name is 'Luo lake region' correspondingly coded as '440303', the front is adjacent to the urban place name 'Shenzhen city' correspondingly coded as '4403', and the rear is adjacent to the word 'XX cell', so that the complete administrative division chain 'Luo lake region in Shenzhen City of Guangdong' is inferred, but the geographical entities of the Luo lake region 'XX cell' are too many, and the administrative division cannot be identified.

More specifically, when the prefix of the alias is not a common country address suffix word such as "street", "town", "country", "road", "office", "flag", "state", "allied" or the like appears in the administrative division chain or the suffix, the alias is not inferred, so as to prevent erroneous judgment, as exemplified by the following specific embodiments with the numbers 2 and 3 in table 2: the method comprises the steps that a user inputs a 'Shenzhen street XX cell' in Shenzhen city in Guangdong, a 'Rohu street' is inputted, or the user inputs a 'Shenzhen street XX cell Rohu' in Guangdong, a complete administrative division chain is not formed herein, but all identifiable alias addresses are 'Rohu', and then the 'Rohu area' is judged according to the previous market place name 'Shenzhen city' which is correspondingly coded as '4403'.

More specifically, when the administrative division chain is complete, the redundant aliases of the same level addresses are not identified as aliases, as exemplified by the following embodiment with the number 4 in table 2: the user inputs 'Shenzhen mountain area Luohu XX district' in Shenzhen city, guangdong, and simultaneously inputs 'nan mountain' and 'Luohu' of two same-level addresses, and only the 'nan mountain area' is identified and correspondingly encoded as '440305'.

More specifically, since the description text of the rural addresses is more complex and diverse, considering that the existing administrative division set may not include all the place names, all the rural address aliases are not inferred in most cases, as exemplified by the following embodiment of table 2 with number 5: the user inputs "Shenzhen Rohu area emerald XX cell in Shenzhen, guangdong province", the alias "emerald" is located at the end of the administrative division chain and is not inferred by extension.

Specifically, in step S504, the mispronounced word correction includes combination of addresses, logical errors of phrases, and errors of words, and the suspected place phrase is distinguished based on the deduced administrative division, providing an example: "Wuhan city Wu Changou" is wrongly written as "Wuhan city Wu cave region", and "Du Ling region of the Loxodes of the Caryoshi city" is wrongly written as "Zhaoling region of the He city".

According to a specific embodiment, as shown in the following table 3, common error address expressions may be recorded in a configuration file in advance, a common error address expression set is established, and the common error address expression set is recorded in a database of a micro service architecture in a data structure of a hash table, a key (key) of the hash table is a twelve-bit administrative division code, and a value (value) is a character string composed of an error address, a corresponding correct address, and a corresponding administrative division code. When the administrative division is inferred, if the complete 'province and city district country' place name is not matched, twelve-bit codes corresponding to the matched discontinuous or tail administrative division are sequentially used as keys to find values (values), and if the character strings immediately behind the administrative division are consistent with the error addresses of the values (values), the user input address can be judged to have error words, and the correct address corresponding to the values (values) is used for replacing the error address.

Table 3 Hash Table for administrative division of "Henan province is in a complex with Henan city XX district XX building

Specifically, in step S505, the method is mainly used to infer a place name from a common abbreviation phrase, and examples are provided: the complete positions of the Chinese medicinal materials of "voxian south bridge", "from street", "tung-cottage square", "pump mouth bridge", "voxian industry", "river dry airport" are described as follows: "Shanghai city vone xian district south bridge", "Guangzhou city from chemical urban street", "Zhejiang Hangzhou city and Tung Lu county square", "Jiangsu province Nanjing city and Pukou district bridge", "Shanghai city vone xian district industry", zhejiang Hangzhou city river dry district airport ".

According to a specific embodiment, all possible abbreviated address terms may be recorded in a configuration file in advance, an abbreviated administrative division hash table is established, the abbreviated administrative division hash table is recorded in a database of a micro service architecture in a data structure of the hash table, keys (keys) of the hash table are twelve-bit administrative division codes, and a value (value) is an inputted address and a corresponding extracted administrative division. When the administrative division is inferred, if the complete administrative division place name is not matched, twelve-bit codes corresponding to the matched administrative division are sequentially used as keys to find values (value), and if the character strings immediately behind the administrative division are consistent with the alias addresses of the values (value), the existence of the abbreviated special place name for the user input address can be judged, and the address input is replaced by the corresponding extracted administrative division address in the value (value). The abbreviation place name group is stored in a computer memory, text matching is carried out in a dictionary tree mode, one or more matched phrases are mapped into corresponding complete position descriptions, and the corresponding complete position descriptions participate in the administrative division matching process.

More specifically, since in step S4 there may be a plurality of second address texts corresponding to one first address text, a weight accumulation table for similarity matching is created based on twelve-bit administrative division codes, each of the administrative divisions has a corresponding code and accumulated weight, then all administrative division fields in the plurality of second address texts are traversed respectively, address code information corresponding to the field is resolved, all code information and weights in the second address texts are traversed, five-level administrative division is performed on the code information, the code information corresponding to each of the administrative divisions and the corresponding weights are accumulated to corresponding keys (keys) of the weight accumulation table, and based on the weight accumulation table, the highest-weight administrative division of each of the administrative divisions is taken out as the target administrative division, which indicates that the target administrative division has the greatest field matching success.

A second embodiment of the present application provides an apparatus for extracting and standardizing an administrative division address, including:

For specific limitations on the apparatus for extracting and standardizing the administrative division addresses, reference may be made to the above limitations on the method for extracting and standardizing the administrative division addresses, and no further description is given here.

A third embodiment of the present application provides a system for extracting and standardizing administrative division addresses under a micro-service architecture, including the apparatus provided in the second embodiment,

the device is used for: acquiring the verified original address text and the original address code; traversing characters in the original address text, and performing program conversion, expansion or deletion on each character to obtain a first address text; segmenting the first address text, and extracting a plurality of administrative division fields; carrying out named entity recognition on a plurality of administrative division fields to obtain a second address text; expanding the original address code to form a administrative region code sequence corresponding to the second address text; and replacing the administrative division field identified by the named entity in the second address text with a standard administrative division field according to the administrative division coding sequence.

A fourth embodiment of the present application provides a method for address normalization processing, including:

s6: providing a micro-service architecture, and inputting an original address text and an original address code through a front-end interface;

s7: invoking the extraction and standardization processing system of the administrative division addresses according to the second embodiment to obtain a standard administrative division;

S8: and finishing the data iteration of the micro-service architecture through a data backtracking and index verification interface.

Specifically, in step S8, the main function is to collect address data, and the microservice provides a data backtracking and index verification interface to enrich the address library.

More specifically, the method for extracting and standardizing the administrative division addresses in the application aims at extracting the effective addresses to the maximum extent from the addresses, the release of the new version is also to cater to the address writing habit of different users, and the final address of the standardized processing method is to ensure that the return logistics network points can be matched according to the address codes of the addresses input by the users. The algorithm configuration of the more standardized processing method can improve the coverage rate and accuracy of the network point identification, and the data sources comprise manual rules and changes of national administrative division. After receiving the manual rule requirement, the corresponding configuration table is modified according to the requirement to generate a new version. And (3) performing index test on the mesh point identification coverage rate and the accuracy rate on the standardization of the new version and the old version, wherein the index test of the new version and the old version respectively performs geographic coding matching, so that the changed new version can not negatively influence the original mesh point identification result, and the demand edition generation with positive influence is generated.

The above device for extracting and standardizing the administrative division addresses and the respective modules in the system for extracting and standardizing the administrative division addresses under the micro-service architecture may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing a trained address text field model and a sequence annotation model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an address text field method.

It will be appreciated by persons skilled in the art that the structure of the apparatus described above is not limiting as to the computer device to which the present application applies, and that a particular computer device may include more or less components than those shown in the figures, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not thereby to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for administrative division address extraction and normalization processing, comprising:

acquiring the verified original address text and the original address code;

2. The method for extracting and normalizing an administrative division address according to claim 1, wherein the performing named entity recognition on the administrative division field, obtaining a second address text includes:

3. The method for extracting and normalizing an administrative division address according to claim 1, wherein expanding the original address code to form an administrative division code sequence for overlaying and corresponding to the second address text comprises:

4. The method for extracting and normalizing an administrative division address according to claim 3, wherein the administrative division fields identified by the named entities in the second address text include at least: one or more of a standard administrative division field, a homonymous administrative division field, an aliased administrative division field, and an abbreviated administrative division field.

5. The method of extracting and normalizing an administrative division address according to claim 1, wherein the replacing the second address text with a standard administrative division according to the administrative division code sequence and standard administrative division metadata comprises:

6. The method for extracting and normalizing treatment of administrative division addresses according to claim 1, wherein the obtaining the verified original address text and original address codes comprises:

7. The method of extracting and normalizing an administrative division address according to claim 1, wherein the preprocessing each of the characters includes at least one of:

8. The method for extracting and standardizing an administrative division address according to claim 7, wherein the redundant pruning of punctuation characters, english characters, arabic numerals and format characters in the original address text comprises:

9. The method for extracting and normalizing process of administrative division addresses according to claim 7, wherein said process of performing complex-to-simplified processing on chinese characters in the original address text comprises:

10. An apparatus for extracting and standardizing an administrative division address, comprising:

and the second matching module is used for replacing the second address text with the standard administrative division according to the administrative division coding sequence and the standard administrative division metadata.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1-9.