CN116414823A - Address positioning method and device based on word segmentation model - Google Patents

Address positioning method and device based on word segmentation model Download PDF

Info

Publication number
CN116414823A
CN116414823A CN202111658539.2A CN202111658539A CN116414823A CN 116414823 A CN116414823 A CN 116414823A CN 202111658539 A CN202111658539 A CN 202111658539A CN 116414823 A CN116414823 A CN 116414823A
Authority
CN
China
Prior art keywords
address
similarity value
information
matching
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111658539.2A
Other languages
Chinese (zh)
Inventor
卢林
周训飞
陈宇
王小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fengtu Technology Shenzhen Co Ltd
Original Assignee
Fengtu Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fengtu Technology Shenzhen Co Ltd filed Critical Fengtu Technology Shenzhen Co Ltd
Priority to CN202111658539.2A priority Critical patent/CN116414823A/en
Publication of CN116414823A publication Critical patent/CN116414823A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an address positioning method and device based on a word segmentation model, wherein the method comprises the following steps: dividing the original address information into a plurality of address elements of different types based on a word segmentation model; constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by corresponding original address information to the index table one by one; acquiring at least one address text to be retrieved; dividing an address text to be searched into a plurality of search elements based on a word segmentation model; and respectively matching each search element with information in the standard address library, and outputting a positioning knot according to the total similarity value calculated by matching. By adopting the method and the device, the accuracy of address positioning can be effectively improved.

Description

Address positioning method and device based on word segmentation model
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to an address positioning method and apparatus based on a word segmentation model.
Background
With the rapid development of modern communication technologies and modern traffic represented by computer networks, satellite technologies and optical cables, the activity space of human beings is rapidly enlarged, social interaction is increasingly frequent, and the production and application of spatial data are increasingly more and more, so that the spatial data scale is increasingly larger. Aiming at the characteristics of space data complexity and mass, how to improve the efficiency and speed of space data retrieval is a difficult point in current research and application, especially application of navigation and service address query. For example, a user can initiate a navigation request to a mobile terminal in a text input mode, a navigation class application selects a navigation route according to the received navigation request, and the selected navigation route is pushed to the user; or administrative authorities provide related business services by providing for retrieval of target entity information in the vicinity of a specified geographic location.
However, with the advent of more and more applications and products related to geographic locations, particularly in mobile-type applications, it is becoming increasingly important to provide a retrieval service that provides information about target entities in the vicinity of a given geographic location. On the one hand, in the existing retrieval system, interval inquiry is generally carried out on the two-dimensional longitude and latitude information of the site to be retrieved in the traditional geographic coordinate system, so that an accurate and efficient retrieval result is provided. However, the search positions are fuzzy, and more strict requirements are often put forward on the consistency of writing of search input due to unification of objective names, and the problems of ambiguous specific names, inconsistent names and the like are faced; the corresponding accurate geographic location is often not retrieved. For example, in the application scenarios of geographic information navigation, police navigation, and public facility maintenance query, because the staff or the user only knows the name, type or approximate geographic location of the entity, the input information is missing, wrong, redundant, etc., so that the algorithm is difficult to accurately identify and locate the address, and the algorithm cannot return to the accurate unit area, but the returned inaccurate or wrong unit area. On the other hand, when the target address is determined, the address name (or building name/unit name) extracted from the text or the voice is singly used for matching with the address information in the address database, so that the determined target address set contains a large number of irrelevant addresses, and the searching precision of the addresses is reduced. In addition, in the administrative scenario, the corresponding service or administrative platform cannot communicate with the corresponding geographic location, so it is necessary to build a database of the corresponding service system and assist the user or staff in querying the relevant information. The existing query system can only query through the address description names or through the attribute or service description information; and requires the accuracy of both types of information input; this may affect the timeliness of the work or result in an increase in manpower or transportation costs.
Disclosure of Invention
In view of the foregoing, it is necessary to provide an address location method and device based on a word segmentation model for improving the accuracy of address location.
In a first aspect, the present application provides an address positioning method based on a word segmentation model, including:
dividing the original address information into a plurality of address elements of different types based on a word segmentation model; constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by corresponding original address information to the index table one by one;
acquiring at least one address text to be retrieved;
dividing an address text to be searched into a plurality of search elements based on a word segmentation model; and respectively matching each search element with information in the standard address library, and outputting a positioning result according to the total similarity value calculated by matching.
In some embodiments of the present application, the address element includes: at least two of AOI information, building information, room information, street information, and administrative division codes; the search element includes: at least two of AOI description information, building description information, room description information, street description information, and digital codes.
In some embodiments of the present application, matching each search element with information in a standard address library includes: when the retrieval element comprises AOI description information, the AOI information in the standard address library is used as an index to be matched with the address text to be retrieved, and a first similarity value is obtained; and/or when the retrieval element comprises building description information, matching building information in the standard address library as an index with the address text to be retrieved to obtain a second similarity value; and/or when the retrieval element comprises room description information, matching the room information in the standard address library with the address text to be retrieved as an index to obtain a third similarity value; and/or when the retrieval element comprises street description information, the street information in the standard address library is used as an index to be matched with the address text to be retrieved, and a fourth similarity value is obtained.
In some embodiments of the present application, the address positioning method based on the word segmentation model further includes: when the retrieval element comprises a digital code, matching an administrative division code in a standard address library as an index with an address text to be retrieved to obtain a code matching result; and if the code matching result is an invalid matching result, matching at least one address element except the administrative division code with the address text to be searched as an index according to a preset element matching sequence until an effective matching result is obtained.
In some embodiments of the present application, the address positioning method based on the word segmentation model further includes: analyzing and obtaining any one similarity value among the first similarity value, the second similarity value and the third similarity value through a preset editing distance algorithm; analyzing to obtain a fourth similarity value through a preset Jacquard distance algorithm; the total similarity value is the sum of at least one similarity value among the first similarity value, the second similarity value, the third similarity value and the fourth similarity value; the first similarity value, the second similarity value, the third similarity value and the fourth similarity value are respectively larger than the corresponding preset similarity threshold value.
In some embodiments of the present application, analyzing, by a preset edit distance algorithm, any one of a first similarity value, a second similarity value, and a third similarity value, includes: determining a first length of the first string and determining a second length of the second string; the first character string is any one of AOI information, building information and room information, and the second character string is any one of AOI description information, building description information and room description information; calculating the editing distance between the first character string and the second character string through a preset editing distance algorithm, and acquiring the maximum value of the length in the first length and the second length; and analyzing the quotient between the editing distance and the length maximum value to obtain any one similarity value among the first similarity value, the second similarity value and the third similarity value.
In some embodiments of the present application, the fourth similarity value is obtained by analysis through a preset jaccard distance algorithm, including: determining a first character string and a second character string; the first character string is street information, and the second character string is street description information; and analyzing the Jacquard distance between the first character string and the second character string through a preset Jacquard distance algorithm to obtain a fourth similarity value.
In some embodiments of the present application, outputting a positioning result according to the total similarity value calculated by matching includes: according to the total similarity value calculated by matching, the total similarity value is arranged to obtain a total similarity value sequence; wherein, the arrangement mode comprises ascending arrangement or descending arrangement; when the addresses are arranged in ascending order, outputting the address positioning results corresponding to the matching from high to low; and outputting the address positioning result corresponding to the matching from low to high when the addresses are arranged in descending order.
In some embodiments of the present application, constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by one-to-one correspondence between original address information and the index table includes: normalizing the address elements to obtain standard address elements; establishing an index by using standard address elements, and accessing corresponding original address information to construct a standard address library; wherein, the normalization processing includes: and carrying out normalized recognition on the address elements based on a standard address word stock.
In a second aspect, the present application provides an address location device based on a word segmentation model, including:
the address library construction module is used for dividing the original address information into a plurality of address elements of different types based on the word segmentation model; constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by corresponding original address information to the index table one by one;
The positioning acquisition module is used for acquiring at least one address text to be retrieved;
the positioning matching module is used for dividing the address text to be searched into a plurality of search elements based on the word segmentation model; and respectively matching each search element with information in the standard address library, and outputting a positioning result from high to low according to the total similarity value calculated by matching.
In a third aspect, the present application also provides a computer device comprising:
one or more processors;
a memory; and one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the word segmentation model-based address location method described above.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program to be loaded by a processor for performing steps in a word segmentation model based address location method.
In a fifth aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided in the first aspect.
According to the address positioning method, the device, the computer equipment and the storage medium based on the word segmentation model, the original address information is segmented into a plurality of address elements of different types, the index table is constructed according to the address elements of different types, the original address information and the index table can be constructed to obtain a standard address library in a one-to-one correspondence mode, at least one address text to be searched is obtained, the address text to be searched can be segmented into a plurality of search elements based on the word segmentation model, the search elements are matched with the information in the standard address library respectively, and finally a positioning result is output according to the total similarity value calculated by matching. Therefore, the method and the device have the advantages that through word segmentation processing is carried out on the address text to be searched, search elements suitable for a standard address library set for each geographic scene are obtained, error problems caused by non-consistency of search input can be avoided, redundant information contained in the address text to be searched can be ignored, and further effective improvement of address positioning accuracy is achieved.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
Fig. 1 is a flow chart of an address positioning method based on a word segmentation model according to an embodiment of the present application;
fig. 2 is a schematic flow chart of step S1 in the address positioning method based on the word segmentation model according to the embodiment of the present application;
fig. 3 is a schematic structural diagram of an address positioning device based on a word segmentation model according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. In addition, technical features described below in the various embodiments of the present application may be combined with each other as long as they do not conflict with each other.
Note that, in the function equations related to the present application, the symbol "x" represents multiplication of two constants or vectors before and after the operation symbol, and "/" represents division of two constants or vectors before and after the operation symbol, and all the function equations in the present application follow the mathematical addition, subtraction, multiplication and division algorithm.
It should be noted that, the term "first\second" referred to in this application merely distinguishes between similar objects, and does not represent a specific ordering for the objects, and it is understood that "first\second" may interchange a specific order or sequence, where allowed. It is to be understood that the "first\second" distinguishing objects may be interchanged where appropriate to enable embodiments of the present application described herein to be implemented in sequences other than those described or illustrated herein.
The method for address location analysis, matching, extraction and standardization processing based on the word segmentation model in the embodiment of the application can be configured in an address location device based on the word segmentation model, and the device can be arranged in a server, a service architecture or a micro-service architecture or in computer equipment, and is not limited in this application.
In the embodiment of the application, the address information is address description information of a certain geographic entity; the geographic entity specifically refers to an entity in a geographic database, and refers to a phenomenon which cannot be divided into similar phenomena in the real world; the geographic entities include building classes; for example, communities, restaurants, malls, hospitals, industrial parks, and the like; including scenic spots; for example, mountains, rivers, lakes, seas, temples, museums, parks, etc.; including utility classes; such as urban roads, bridges, ports, squares, street lamps, guideboards, air defense facilities, etc. Wherein the standard address description information comprises standard address text information and administrative division codes.
Specifically, the administrative division code in the present application may be a national administrative division code that can be currently used in circulation, that is, a postal code, and the standard address text information is formed as follows: < standard address text information > = < administrative district name > [ basic district limiter name ] [ local point location description ]; wherein < administrative district name > = < province level > = < regional level ] < county level > [ county level ], < basic district definition name > = < street > | < lane > | < residential district > | < village >, < local point location description > = < door (building) address > | < marker name > | < point of interest >.
In the existing retrieval system, on one hand, interval inquiry is generally carried out on two-dimensional longitude and latitude information of a site to be retrieved in a traditional geographic coordinate system, so that accurate and efficient retrieval results are provided. However, the search positions are fuzzy, and due to the unification of objective names, stricter requirements are put forward on the consistency of writing of search input, and the problems of ambiguous specific names, inconsistent names and the like are faced. On the other hand, when the target address is determined, the address name (or building name/unit name) extracted from the text is singly used for matching with the address information in the address database, so that the determined target address set contains a large number of irrelevant addresses, and the searching precision of the addresses is reduced. This may affect the timeliness of the work or result in an increase in manpower or transportation costs.
In order to solve the technical problems, the application provides an address positioning method based on a word segmentation model, which is characterized in that a corresponding original address metadata base is obtained, corresponding administrative division, address prefix (street), cell, building and room text information are extracted from the original address metadata base based on the address word segmentation model to construct a standard address base, corresponding information is obtained by matching through a retrieval method, and related specific information can be obtained by only inputting corresponding administrative division, address prefix (street), cell, building and room text data or combination of the corresponding data during retrieval. In addition, the corresponding address is normalized in the process of establishing and constructing the standard address library, and the corresponding address is normalized when the address information to be retrieved is input, so that the condition of missing identification or misidentification after the entity naming text is matched is effectively avoided, and the effects of extracting the address or service information and further normalizing the address can be effectively improved.
Referring to fig. 1, fig. 1 is a flow chart of an address positioning method based on a word segmentation model according to an embodiment of the present application, where the embodiment is mainly illustrated by applying the method to a server, and the method includes the following steps:
s1, dividing original address information into a plurality of address elements of different types based on a word segmentation model; and constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by corresponding the original address information to the index table one by one. As shown in fig. 2, step S1 may include steps S11 to S13, which are specifically as follows:
s11, dividing the original address information into a plurality of address elements of different types based on a word segmentation model.
Among them, word segmentation models include, but are not limited to, natural language processing (NaturalLanguageProcessing, NLP) models, implicit dirichlet distribution (LatentDirichletAllocation, LDA) models, hidden markov (Hidden Markov Model, HMM) models, conditional random field (conditional random field, CRF) models, maximum entropy hidden horse (Maximum Entropy Markov Model, MEMM) models, and the like.
Specifically, the original address information is described above, and will not be described herein again, and the address elements include: at least two of AOI information, building information, room information, street information, and administrative division codes. Where AOI is a term of GIS (Geographic Information Sysytem, geographic information system), AOI information is fully called "Area of interest", translated into "interest plane", and can be understood as POI (Point of interest, interest point) with geographic boundary data, and the unique identifier poi_id of POI is one of the attributes of AOI information. POIs are core elements of an electronic map, and refer to any non-geographically significant point on the map, such as shops, bars, gas stations, hospitals, stations, etc.; the geographically significant coordinates such as cities, rivers, mountains, etc. do not belong to POIs. Thus, in particular, AOI information refers to regional geographic entities in map data, including a particular attraction, government agency, company, mall, restaurant, district, industrial park, etc. on an electronic map.
For example, by traversing the original address information, the suffix of the place name is identified, for example: the information of province, city, district, county, street, lane, district, door, building and the like is subjected to semantic segmentation based on a word segmentation model, and a plurality of address elements of different types can be obtained.
S12, carrying out standardization processing on the address elements to obtain standard address elements.
Specifically, the normalization processing includes: and carrying out normalized recognition on the address elements based on a standard address word stock. That is, each piece of original address information is normalized and word-segmented, administrative division, address prefix, cell, building and room information in the address are extracted, and then the corresponding address information is extracted according to a word-segmented model, so that the address is divided into a plurality of fields including but not limited to: administrative division fields, basic area definition fields, local point location description information, and the like, which will be explained in detail below.
More specifically, for the administrative division field, the space object size generally presents a multi-level or containment relationship according to the basic rules of the administrative division, where the multi-level division relationship includes: province (direct jurisdiction, special administrative district) > city > county (district) > county (street, town) > village (community), etc.
Further, for the basic region definition field, more similar to the fourth and fifth level administrative divisions, the multi-level division relationship includes: rural (street, town) > village (community), etc. That is, the village, street, sappan wood, nationality and region public place is set as a four-level administrative division (i.e., street and country level); the community, village committee is set to a five level administrative division (i.e., community village level). In this embodiment, the identification is performed as a prefix of a certain detailed address.
Further, for the local point location description information, building numbers, unit numbers, floor numbers, room numbers specific to a certain AOI address may even further be specific to a owner of a room, a renter, a contact person, a legal person representative, a corresponding telephone contact manner, etc., and the detailed address may be applied to application scenarios of logistics, service provided by a management department (e.g. a power supply office, a tap water company, a property), administration provided by an administrative institution (e.g. a public security office, a government department), etc.
Finally, in this step, text matching operation can be performed according to the address elements and the pre-constructed geographical name entity metadata, the address names which are not standardized are normalized preliminarily, and the administrative division fields which are cut are mapped into the municipal administrative division codes.
S13, establishing an index by using standard address elements, and accessing corresponding original address information to construct a standard address library.
Specifically, in this embodiment, standard address elements may be imported into the server in a hash tree manner by using the dictionary tree concept, so as to reduce the time complexity of the algorithm, and replace the original address in the original address information with the standard address original appearing in the place name entity metadata. The dictionary tree is also called word searching tree, is a tree structure and is a variant of hash table, and the common prefix of character strings is utilized to reduce inquiry time and maximally reduce unnecessary character string comparison. Each node of the dictionary contains a number of attributes, mainly character values, (phrase) whether to end, child node address, path length to root node. When the character value of the root node is empty, the child node address is the first character of each text in the configuration file, the grandchild node address is the second character, and so on until the end.
It should be noted that, in the present embodiment, each spatial region address is defined by unique original address information; splitting original address information through the suffix of the address name, then matching standard address elements (standard address names) in the metadata of the place name entity, searching and matching corresponding texts in the data set by taking key address elements in the original address information as indexes, and replacing the standard address elements with highest matching degree. The principle to be followed is: the uniqueness of the processing address is guaranteed, and each space region address can have various semantic expressions, but each address element semantic expression is limited by a unique standard detailed address.
In one embodiment, if the original address information is "Wanke Garden 117-102 room 1 of the university garden of Dong lake of Wuhan, hubei province," the specific operations corresponding to the steps S11 to S13 include:
based on the word segmentation model, dividing the address into a plurality of address elements according to the word segmentation specification by carrying out word segmentation and normalization processing on the original address information: the new technology development area (3) university garden path (9) No. 1 (11) Wanke city garden (13) 117 (14) 102 room (15) of Wuhan city (2) Donghu province, hubei province; identifying the segmented words according to rules, and selecting: identifying administrative division information: new technology development area (3) of Wuhan City (1) in Hubei province (2) in eastern lake, which is mapped to administrative district code (city level): 420100; identifying detailed address information: "Wanke City Garden" as AOI information, "117" as building information, "102" as room information; identifying address prefix information: "university road 1" as street information. Finally, index information for inquiry is established according to standard address elements, and the index information is shown in the following table:
ID AOI BUILDING HOUSE PREFIX ADCODE
1 van-keside city garden 117 span 102 chamber University garden path number 1 420100
Specifically, in the actual operation process, english alphabets are used as reference symbols of index information corresponding to the above standard address library, "AOI" represents AOI information, "BUILDING" represents BUILDING information, "home" represents room information, "PREFIX" represents street information, and "ADCODE" represents administrative division codes.
S2, obtaining at least one address text to be retrieved;
specifically, the step is mainly a process of searching and inputting, and in the searching process, at least two of AOI description information, building description information, room description information, street description information or digital codes can be input, so that a corresponding positioning result can be obtained through the following steps.
Specifically, in this step, the address text to be retrieved after verification needs to be obtained; the server can provide an interface to obtain the address text to be searched, and verify the information in the address text to be searched so as to obtain the verified address text to be searched as a follow-up analysis basis.
More specifically, the address text to be retrieved is obtained through the server providing interface, and the address text to be retrieved comprises: at least one of AOI description information, building description information, room description information, street description information, or digital codes. The server can provide an interface for the web front end/mobile front end page, a user can input an original address text to be searched through the PC terminal/mobile terminal, and the server can call the interface to obtain the original address text to be searched, so that preliminary verification is carried out on the original address text to be searched. More specifically, the numeric code is an administrative division code that identifies the address entity, and consists of Arabic numerals; other text description information is a field presented for an index of the corresponding standard address knowledge base for the address entity.
Thus, the checking step of the digital code mainly comprises: firstly, judging whether a digital code exists, if so, the digital code needs to be all Arabic numerals, and can not be blank or splice of the numerals and other characters, otherwise, the digital code can be identified as other text description information. And the verification step of the text description information mainly comprises the following steps: whether the input original field is blank, messy code, special symbol or irregular letter ordering is judged, and if so, the verification cannot be passed. The special characters specifically refer to other characters except punctuation marks, english letters and Arabic numerals. An example is provided, if the input address text is only "AAAAA" or "% > and #", it is determined after verification that it cannot be the address text, and the subsequent steps cannot be entered. If the verification is not passed, the server feeds back error/unrecognizable information to the interface, and the original address is input again for verification again.
S3, segmenting the address text to be searched into a plurality of search elements based on a word segmentation model; and respectively matching each search element with information in the standard address library, and outputting a positioning result according to the total similarity value calculated by matching. Step S3 may include steps S31 to S34, which are specifically as follows:
S31, segmenting an address text to be searched into a plurality of search elements based on a word segmentation model;
s32, when the retrieval element comprises the AOI description information, the AOI information in the standard address library is used as an index to be matched with the address text to be retrieved, and a first similarity value is obtained; and/or when the retrieval element comprises building description information, matching building information in the standard address library as an index with the address text to be retrieved to obtain a second similarity value; and/or when the retrieval element comprises room description information, matching the room information in the standard address library with the address text to be retrieved as an index to obtain a third similarity value; and/or when the retrieval element comprises street description information, matching the street information in the standard address library with the address text to be retrieved as an index to obtain a fourth similarity value;
specifically, the first similarity value, the second similarity value, or the third similarity value may be obtained by analysis through a preset edit distance algorithm (Levenshtein Distance); the fourth similarity value may be obtained by analysis of a predetermined Jaccard distance algorithm (Jaccard).
Specifically, through a preset edit distance algorithm, analyzing to obtain any one similarity value among the first similarity value, the second similarity value and the third similarity value, including: calculated using the edit distance (Levenshtein Distance), it is assumed that there are two strings "A" and "B" of length "L", respectively A ”、“L B ", while the edit distance recurrence equation is as follows:
Figure BDA0003449061300000121
wherein "A i "first i bytes representing character string" A ", and" B i "representing the first i bytes of string" B ", EDA i B i "represents the distance of the first i bytes of" A "to the first j bytes of" B ", ED (A) i-1 ,B j ) +1 represents "A" deleted one byte to match "B", ED (A) i ,B j-1 ) +1 represents "A" with one byte added to match "B", ED (A) i-1 ,B j-1 ) And ED (A) i-1 ,B j-1 ) +1 represents a match or a mismatch, depending on whether the respective symbols are identical. Then, when the first character string "a" is any one of AOI information, building information, and room information, and the second character string "B" corresponds to any one of AOI description information, building description information, and room description information, the required similarity is:
Figure BDA0003449061300000122
wherein "ED AB "is the edit distance between the first character string" A "and the second character string" B "," L A "is the length of the first character string" A "," L B "is the length of the second string" B ". For example, when the first string "a" is AOI information, the second string "B" should be AOI description information.
Thus, the first similarity value is recorded as: s is S aoi =100×sim (a, B), the second similarity value is recorded as: s is S building =100×sim (a, B), the third similarity value is recorded as: s is S house =100*sim(A,B)。
Specifically, a fourth similarity value S is obtained through analysis by a preset Jacquard distance algorithm prefix ", comprising: calculation using Jaccard coefficients, and JaccThe ard coefficient-related index is called Jaccard distance, and the equation is as follows:
Figure BDA0003449061300000123
wherein, "A" is a first character string, "B" is a second character string, the first character string "A" is street information, the second character string "B" is street description information, A and B are the union set between A and B, A and B represent the number of elements in the set, namely the number of A character strings and the number of B character strings respectively, and J (A, B) is defined as 1 when the sets "A" and "B" are all empty.
It should be noted that the total similarity value is a sum of at least one similarity value among the first similarity value, the second similarity value, the third similarity value, and the fourth similarity value; the first similarity value, the second similarity value, the third similarity value and the fourth similarity value are respectively larger than the corresponding preset similarity threshold value.
Wherein, the formula of the total similarity value is as follows:
S total =(S aoi +S building +S house )*100+S prefix
wherein, "S total "is the total similarity value," S aoi "is the first similarity value," S building "is a second similarity value," S house "is a third similarity value," S prefix "is the fourth similarity value. In the practical application process, the matching is carried out only through at least two search elements, and the total similarity value S is carried out total In the process of calculation, the similarity value corresponding to the non-related retrieval element is 0, other similarity values are normally calculated according to the steps, and finally the total similarity value S is brought total And (3) performing calculation in a formula.
Specifically, when calculating each similarity value, it is further required to determine whether the obtained similarity value is valid, and if not, it is required to recalculate to obtain a valid similarity value, where the validity determination method includes: if at least one similarity value among the first similarity value, the second similarity value, the third similarity value and the fourth similarity value is smaller than or equal to a corresponding preset similarity threshold value, abnormal prompt information is generated; the abnormal prompting information is used for prompting to acquire a corresponding similarity value again to serve as a target similarity value until the target similarity value is larger than a corresponding preset similarity threshold value. That is, it is necessary to determine whether the similarity value is greater than a set similarity threshold; if yes, the value is judged to be valid, and the subsequent steps can be continuously executed; if not, the server judges that the value is invalid, and can feed back error/unrecognizable information to the interface until a similarity value larger than a similarity threshold value is obtained. The similarity threshold value can be set according to actual service requirements, and the application is not limited.
S33, outputting a positioning result according to the total similarity value calculated by matching.
Specifically, the step may include: according to the total similarity value calculated by matching, the total similarity value is arranged to obtain a total similarity value sequence; wherein, the arrangement mode comprises ascending arrangement or descending arrangement; when the addresses are arranged in ascending order, outputting the address positioning results corresponding to the matching from high to low; and outputting the address positioning result corresponding to the matching from low to high when the addresses are arranged in descending order.
In one embodiment, step S3 may further include step S34, which is specifically as follows:
s34, when the retrieval element comprises a digital code, matching an administrative division code in a standard address library as an index with an address text to be retrieved to obtain a code matching result; and if the code matching result is an invalid matching result, matching at least one address element except the administrative division code with the address text to be searched as an index according to a preset element matching sequence until an effective matching result is obtained.
Specifically, when the search element includes a digital code, preferentially matching the address text to be searched with an administrative division code as an index; and then matching other retrieval elements. That is, screening and filtering are performed according to administrative division codes of addresses to be searched, and if administrative division codes are identified, matching of other search elements is searched in an index table after screening and filtering; if no administrative division code is identified, matches for other retrieval elements will be found in the index table of the full library.
In this embodiment, when the address text to be retrieved after verification is any two fields or three fields of AOI description information, building description information, room description information or digital codes, political region code matching is preferentially performed; when the fact that the political region code index matching exists is not recognized, carrying out AOI information matching; when the AOI information index matching is not recognized, building information matching is performed; when the building information index matching is not recognized, room information matching is performed, and the matching is performed on the address main body; furthermore, in general, street information matching is required. Street information matching and address body matching are combined. When the corresponding information is identified to be matched, the server gives a corresponding matching result list through the port, and particularly gives the corresponding matching result list according to the priority of the similarity.
In the embodiment of the application, the standard address library is constructed based on the address word segmentation model, and corresponding information is obtained by matching through a retrieval method, so that the condition of missing recognition or misrecognition after the entity naming text is matched can be avoided, and the effects of extracting the address or service information and further normalizing the address can be improved.
In one embodiment, referring to fig. 3, the present application further provides an address location device 300 based on a word segmentation model, and the method described above is applied, including:
an address library construction module 310, configured to segment original address information into a plurality of different types of address elements based on a word segmentation model; constructing an index table according to a plurality of address elements of different types, and constructing a standard address library by corresponding original address information to the index table one by one;
a location obtaining module 320, configured to obtain at least one address text to be retrieved;
the positioning matching module 330 is configured to segment the address text to be retrieved into a plurality of retrieval elements based on the word segmentation model; and respectively matching each search element with information in the standard address library, and outputting a positioning result from high to low according to the total similarity value calculated by matching.
In one embodiment, the address element includes: at least two of AOI information, building information, room information, street information, and administrative division codes; the search element includes: at least two of AOI description information, building description information, room description information, street description information, and digital codes.
In one embodiment, the location matching module 330 is further configured to: when the retrieval element comprises AOI description information, the AOI information in the standard address library is used as an index to be matched with the address text to be retrieved, and a first similarity value is obtained; and/or when the retrieval element comprises building description information, matching building information in the standard address library as an index with the address text to be retrieved to obtain a second similarity value; and/or when the retrieval element comprises room description information, matching the room information in the standard address library with the address text to be retrieved as an index to obtain a third similarity value; and/or when the retrieval element comprises street description information, the street information in the standard address library is used as an index to be matched with the address text to be retrieved, and a fourth similarity value is obtained.
In one embodiment, the location matching module 330 is further configured to: when the retrieval element comprises a digital code, matching an administrative division code in a standard address library as an index with an address text to be retrieved to obtain a code matching result; and if the code matching result is an invalid matching result, matching at least one address element except the administrative division code with the address text to be searched as an index according to a preset element matching sequence until an effective matching result is obtained.
In one embodiment, the location matching module 330 is further configured to: analyzing and obtaining any one similarity value among the first similarity value, the second similarity value and the third similarity value through a preset editing distance algorithm; analyzing to obtain a fourth similarity value through a preset Jacquard distance algorithm; the total similarity value is the sum of at least one similarity value among the first similarity value, the second similarity value, the third similarity value and the fourth similarity value; the first similarity value, the second similarity value, the third similarity value and the fourth similarity value are respectively larger than the corresponding preset similarity threshold value.
In one embodiment, the location matching module 330 is further configured to: determining a first length of the first string and determining a second length of the second string; the first character string is any one of AOI information, building information and room information, and the second character string is any one of AOI description information, building description information and room description information; calculating the editing distance between the first character string and the second character string through a preset editing distance algorithm, and acquiring the maximum value of the length in the first length and the second length; and analyzing the quotient between the editing distance and the length maximum value to obtain any one similarity value among the first similarity value, the second similarity value and the third similarity value.
In one embodiment, the location matching module 330 is further configured to: determining a first character string and a second character string; the first character string is street information, and the second character string is street description information; and analyzing the Jacquard distance between the first character string and the second character string through a preset Jacquard distance algorithm to obtain a fourth similarity value.
In one embodiment, the location matching module 330 is further configured to: according to the total similarity value calculated by matching, the total similarity value is arranged to obtain a total similarity value sequence; wherein, the arrangement mode comprises ascending arrangement or descending arrangement; when the addresses are arranged in ascending order, outputting the address positioning results corresponding to the matching from high to low; and outputting the address positioning result corresponding to the matching from low to high when the addresses are arranged in descending order.
In one embodiment, the address library construction module 310 is further configured to: normalizing the address elements to obtain standard address elements; establishing an index by using standard address elements, and accessing corresponding original address information to construct a standard address library; wherein, the normalization processing includes: and carrying out normalized recognition on the address elements based on a standard address word stock.
More specifically, the method for normalization processing and preprocessing in the application aims at extracting the effective address to the maximum extent from the address, iterating the version in the normalization process, issuing the new version to meet the address writing habit of different users, and finally ensuring that the address of the normalization processing method is matched with the corresponding address net point according to the address coding of the address input by the user. The algorithm configuration of the more standardized processing method can improve the coverage rate and accuracy of the network point identification, and the data sources comprise manual rules and changes of national administrative division. After receiving the manual rule requirement, the corresponding configuration table is modified according to the requirement to generate a new version. And (3) performing index test on the mesh point identification coverage rate and the accuracy rate on the standardization of the new version and the old version, wherein the index test of the new version and the old version respectively performs geographic coding matching, so that the changed new version can not negatively influence the original mesh point identification result, and the demand edition generation with positive influence is generated.
Specifically, each module in the address location device based on the address word segmentation model may be implemented in whole or in part by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing a trained address text field model and a sequence annotation model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements an address location method based on a word segmentation model.
It will be appreciated by persons skilled in the art that the structure of the apparatus described above is not limiting as to the computer device to which the present application applies, and that a particular computer device may include more or less components than those shown in the figures, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. An address positioning method based on a word segmentation model is characterized by comprising the following steps:
dividing the original address information into a plurality of address elements of different types based on a word segmentation model; constructing an index table according to the plurality of address elements of different types, and constructing a standard address library by corresponding the original address information to the index table one by one;
acquiring at least one address text to be retrieved;
Dividing the address text to be searched into a plurality of search elements based on a word segmentation model; and matching each search element with the information in the standard address library respectively, and outputting a positioning result according to the total similarity value calculated by matching.
2. The method of claim 1, wherein the address element comprises: at least two of AOI information, building information, room information, street information, and administrative division codes;
the search element includes: at least two of AOI description information, building description information, room description information, street description information, and digital codes.
3. The method of claim 2, wherein said matching each of said search elements with information in said standard address library, respectively, comprises:
when the retrieval element comprises the AOI description information, the AOI information in the standard address library is used as an index to be matched with the address text to be retrieved, and a first similarity value is obtained; and/or the number of the groups of groups,
when the retrieval element comprises the building description information, building information in the standard address library is used as an index to be matched with the address text to be retrieved, and a second similarity value is obtained; and/or the number of the groups of groups,
When the retrieval element comprises the room description information, matching room information in the standard address library with the address text to be retrieved as an index to obtain a third similarity value; and/or the number of the groups of groups,
and when the retrieval element comprises the street description information, matching the street information in the standard address library with the address text to be retrieved as an index to obtain a fourth similarity value.
4. A method according to claim 3, characterized in that the method further comprises:
when the retrieval element comprises a digital code, matching an administrative division code in the standard address library with the address text to be retrieved as an index to obtain a code matching result;
and if the code matching result is an invalid matching result, matching at least one address element except the administrative division code with the address text to be searched as an index according to a preset element matching sequence until an effective matching result is obtained.
5. A method according to claim 3, characterized in that the method further comprises:
analyzing and obtaining any one similarity value among the first similarity value, the second similarity value and the third similarity value through a preset editing distance algorithm;
Analyzing and obtaining the fourth similarity value through a preset Jacquard distance algorithm;
wherein the total similarity value is a sum of at least one similarity value among the first similarity value, the second similarity value, the third similarity value, and the fourth similarity value; the first similarity value, the second similarity value, the third similarity value and the fourth similarity value are respectively larger than a corresponding preset similarity threshold value.
6. The address location method according to claim 5, wherein analyzing, by a preset edit distance algorithm, any one of the first similarity value, the second similarity value, and the third similarity value includes:
determining a first length of the first string and determining a second length of the second string; the first character string is any one of the AOI information, the building information and the room information, and the second character string is any one of the AOI description information, the building description information and the room description information;
calculating the editing distance between the first character string and the second character string through a preset editing distance algorithm, and acquiring the maximum value of the lengths in the first length and the second length;
And analyzing the quotient value between the editing distance and the length maximum value to obtain any one similarity value among the first similarity value, the second similarity value and the third similarity value.
7. The address location method of claim 5, wherein the analyzing the fourth similarity value by a predetermined jaccard distance algorithm comprises:
determining a first character string and a second character string; the first character string is the street information, and the second character string is the street description information;
and analyzing the Jacquard distance between the first character string and the second character string through a preset Jacquard distance algorithm to obtain the fourth similarity value.
8. The address location method of any of claims 3 to 7, wherein outputting the location result from the total similarity value calculated by the matching comprises:
according to the total similarity value calculated by matching, arranging the total similarity value to obtain a total similarity value sequence; wherein the arrangement mode comprises ascending arrangement or descending arrangement;
outputting the address positioning result of the corresponding match from high to low when the address positioning result is arranged in the ascending order;
And outputting the address positioning result corresponding to the matching from low to high when the addresses are arranged in the descending order.
9. The address location method of claim 1, wherein constructing an index table from the plurality of different types of address elements, constructing a standard address library in one-to-one correspondence of the original address information with the index table, comprises:
normalizing the address elements to obtain standard address elements;
establishing an index by using the standard address elements, and accessing corresponding original address information to construct the standard address library; wherein the normalization process includes:
and carrying out normalized recognition on the address elements based on a standard address word stock.
10. An address locating device based on a word segmentation model, comprising:
the address library construction module is used for dividing the original address information into a plurality of address elements of different types based on the word segmentation model; constructing an index table according to the plurality of address elements of different types, and constructing a standard address library by corresponding the original address information to the index table one by one;
the positioning acquisition module is used for acquiring at least one address text to be retrieved;
the positioning matching module is used for dividing the address text to be searched into a plurality of search elements based on a word segmentation model; and respectively matching the plurality of search elements with the information in the standard address library, and outputting a positioning result from high to low according to the total similarity value calculated by matching.
CN202111658539.2A 2021-12-30 2021-12-30 Address positioning method and device based on word segmentation model Pending CN116414823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111658539.2A CN116414823A (en) 2021-12-30 2021-12-30 Address positioning method and device based on word segmentation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111658539.2A CN116414823A (en) 2021-12-30 2021-12-30 Address positioning method and device based on word segmentation model

Publications (1)

Publication Number Publication Date
CN116414823A true CN116414823A (en) 2023-07-11

Family

ID=87058281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111658539.2A Pending CN116414823A (en) 2021-12-30 2021-12-30 Address positioning method and device based on word segmentation model

Country Status (1)

Country Link
CN (1) CN116414823A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240825A (en) * 2023-11-09 2023-12-15 北京火山引擎科技有限公司 Address library construction method, device, equipment and medium applied to CDN
CN117312476A (en) * 2023-11-15 2023-12-29 金田产业发展(山东)集团有限公司 Territorial space planning method and system based on GIS
CN117349451A (en) * 2023-12-01 2024-01-05 广东中思拓大数据研究院有限公司 Data processing method, data processing apparatus, computer device, and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117240825A (en) * 2023-11-09 2023-12-15 北京火山引擎科技有限公司 Address library construction method, device, equipment and medium applied to CDN
CN117240825B (en) * 2023-11-09 2024-02-02 北京火山引擎科技有限公司 Address library construction method, device, equipment and medium applied to CDN
CN117312476A (en) * 2023-11-15 2023-12-29 金田产业发展(山东)集团有限公司 Territorial space planning method and system based on GIS
CN117349451A (en) * 2023-12-01 2024-01-05 广东中思拓大数据研究院有限公司 Data processing method, data processing apparatus, computer device, and storage medium

Similar Documents

Publication Publication Date Title
CN108388559B (en) Named entity identification method and system under geographic space application and computer program
CN108628811B (en) Address text matching method and device
CN116414823A (en) Address positioning method and device based on word segmentation model
WO2018177316A1 (en) Information identification method, computing device, and storage medium
CN107145577A (en) Address standardization method, device, storage medium and computer
CN110020433B (en) Industrial and commercial high-management name disambiguation method based on enterprise incidence relation
CN107203526B (en) Query string semantic demand analysis method and device
CN106909611B (en) Hotel automatic matching method based on text information extraction
CN112347222A (en) Method and system for converting non-standard address into standard address based on knowledge base reasoning
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN113127506B (en) Target query statement construction method and device, storage medium and electronic device
CN106874287A (en) A kind of processing method and processing device of point of interest POI geocodings
CN108733810B (en) Address data matching method and device
CN111291099B (en) Address fuzzy matching method and system and computer equipment
Christen et al. A probabilistic geocoding system based on a national address file
CN116414824A (en) Administrative division information identification and standardization processing method, device and storage medium
CN108345662A (en) A kind of microblog data weighted statistical method of registering considering user distribution area differentiation
CN105159885A (en) Point-of-interest name identification method and device
CN114091454A (en) Method for extracting place name information and positioning space in internet text
CN113761137B (en) Method and device for extracting address information
CN116431625A (en) Positioning analysis method and device for geographic entity and computer equipment
CN117010373A (en) Recommendation method for category and group to which asset management data of power equipment belong
CN116414808A (en) Method, device, computer equipment and storage medium for normalizing detailed address
CN110175219A (en) A kind of K12 stage repeats school&#39;s recognition methods, device, equipment and storage medium
CN116303854A (en) Positioning method and device based on address knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination