WO2021000831A1 - Procédé et appareil de mise en correspondance d'adresses, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de mise en correspondance d'adresses, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2021000831A1
WO2021000831A1 PCT/CN2020/098804 CN2020098804W WO2021000831A1 WO 2021000831 A1 WO2021000831 A1 WO 2021000831A1 CN 2020098804 W CN2020098804 W CN 2020098804W WO 2021000831 A1 WO2021000831 A1 WO 2021000831A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
matching
word segmentation
segments
segmentation
Prior art date
Application number
PCT/CN2020/098804
Other languages
English (en)
Chinese (zh)
Inventor
申超波
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021000831A1 publication Critical patent/WO2021000831A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Definitions

  • This application relates to the field of big data, in particular to address matching methods, devices, computer equipment and storage media.
  • the main purpose of this application is to provide an address matching method, which aims to solve the technical problem of existing address matching defects.
  • the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • This application also provides an address matching device, the first address is the address to be retrieved input by the user, the second address is stored in the index server, and the device includes:
  • the word segmentation module is used to call a preset matching algorithm, and respectively segment the first address and the second address according to a first preset rule to obtain the first segmentation group and the second segment corresponding to the first address
  • the second word segmentation group corresponding to the address wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • a dividing module configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group;
  • a second acquiring module configured to acquire matching results of all the first segments and all the second segments according to a second preset rule
  • the judgment module is configured to judge whether the first address and the second address are the same according to the matching result.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, the processor implements an address matching method when the computer program is executed, and the first address is the address to be retrieved input by the user, The second address is stored in the index server, and the method includes:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • This application also provides a computer-readable storage medium on which a computer program is stored, which implements an address matching method when the computer program is executed by a processor, the first address is the address to be retrieved input by the user, and the second address is stored in
  • the methods include:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • the first four administrative-level addresses of the segmented addresses are accurately matched according to the national provinces, municipalities, counties and towns address database (tree-shaped).
  • partial missing is effectively completed, and the massive amount of pre-stored in the index server
  • the data is built into an index structure, combined with the Elasticsearch component's own computing architecture and powerful distributed computing capabilities, to achieve real-time fast query of the first address in the preset index structure.
  • FIG. 1 is a schematic flowchart of an address matching method according to an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of an address matching device according to an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the internal structure of a computer device according to an embodiment of the present application.
  • the first address is an address to be retrieved input by a user
  • the second address is stored in an index server
  • S1 Invoke a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address ,
  • the preset matching algorithm includes word segmentation calculation and matching calculation.
  • the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific.
  • the first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address.
  • the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide.
  • the general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong province, the word segmentation results are as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • word segmentation is performed through semantic segmentation.
  • S2 Divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
  • the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation.
  • the "first", "second”, etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated.
  • the word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address.
  • the long-named "development zone of a certain city” corresponds to two participles “a certain city/development zone”, but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city” belongs to one Segmented.
  • the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result.
  • the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
  • This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels.
  • the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different.
  • the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
  • the first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user.
  • the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition.
  • the preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on.
  • the aforementioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building. For example, “Rongyuan of Jiangnan Mingju Residential Quarter" included in the first address is the mark data.
  • the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number” is sign data.
  • first address and the second address respectively include a range address and a flag address
  • the preset matching algorithm is invoked, and the first address and the second address are respectively segmented according to a first preset rule to obtain all
  • the step S1 of the first word segmentation corresponding to the first address and the second word segmentation corresponding to the second address includes:
  • S11 Perform word segmentation on the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation part corresponding to the first address and the first segmentation part respectively.
  • the first segmentation part corresponding to the second address is
  • the scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town.
  • the range address in this embodiment is segmented through a pre-associated address dictionary.
  • the address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model.
  • the preset matching algorithm in this embodiment includes analysis calculation and matching calculation.
  • the crawler address library is added when the open source word segmentation algorithm package jieba is used for word segmentation calculation, and it is used in combination with the national address library to treat word segmentation The address is corrected, and then word segmentation is performed according to the administrative level to improve the accuracy of word segmentation.
  • the address dictionary is called for word segmentation calculation.
  • the result of word segmentation is as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • S12 Perform word segmentation according to the first grammar model in the natural language processing model with the flag addresses corresponding to the first address and the second address, respectively, to obtain the second segmentation part corresponding to the first address and the The second segmentation part corresponding to the second address.
  • the logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan” in the above address.
  • the token address is segmented according to the first grammar model in the natural language processing model.
  • the first grammar model includes, but is not limited to, "a certain cell” and "a certain building”. For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan".
  • the first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before “building and house number” are the sign addresses.
  • the first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address.
  • the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first address corresponds to the first address
  • One sub-phrase is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan”
  • the second sub-phrase corresponding to the second address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan”.
  • the first address and the second address respectively include detailed addresses, and the marking addresses corresponding to the first address and the second address are performed according to the grammar model in the natural language processing model.
  • the detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs.
  • the second grammar model of this embodiment includes but is not limited to "a certain building”, “a certain building and a certain floor”, "a certain building and a certain room” and so on.
  • S15 Combine a first word segmentation part corresponding to the first address, a second word segmentation part corresponding to the first address, and a third word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address Forming the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.
  • the first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address.
  • the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first segment corresponding to the first address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306"
  • the second segment corresponding to the second address is "Guangdong province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide
  • the range address includes four administrative levels of province/city/district, county/township, and town
  • the mark address includes a cell name or a building name
  • the first points are obtained according to a second preset rule.
  • Step S3 of the matching result of the segment and all the second segments includes:
  • S31 Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, where the structure tree includes multiple nodes, each The nodes respectively correspond to each of the first segments or each of the second segments in a one-to-one correspondence.
  • one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level.
  • the participle "Guangdong province” corresponding to the highest administrative level “province” contained in the first address is used as the root node, and then the participle "Foshan City” corresponding to the next-level sub-node “city” is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on.
  • the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
  • the matching calculation in this embodiment is to map the corresponding relationship between the nodes and the nodes between the two structure trees according to the corresponding relationship of administrative levels, and obtain and calculate the matching value corresponding to each node according to the above-mentioned corresponding relationship.
  • the matching value includes matching The segment is divided by all the segments corresponding to the node. For example, if a node corresponding to the first address is a "province” node, it is assigned the value "Guangdong", and the "province” node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
  • different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements.
  • the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
  • S34 Calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address.
  • the formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
  • S35 The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
  • step S32 of obtaining the matching value corresponding to each node of the two structure trees includes:
  • S321 Perform precise and full matching of each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address according to the node correspondence relationship. , Get each first matching value.
  • the matching methods for nodes corresponding to different administrative levels in this embodiment are different.
  • the four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match.
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • S322 Perform a one-to-one correspondence between each first segment corresponding to the flag address in the first address and each second segment corresponding to the flag address in the second address to perform model keywords according to the node correspondence relationship. Match to obtain each second matching value.
  • the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including.
  • NLP Natural Language Processing
  • the matching relationship can be realized by including or including.
  • “Jiangnan Mingju Community/Rongyuan” and “Jiangnan Mingju/Rongyuan” although the characters do not have a full matching relationship, but "Jiangnan Mingju Community” contains the characters “Jiangnan Mingju”, still There is a one-to-one matching relationship.
  • S323 Perform digital matching for each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence. Obtain each third matching value.
  • the detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
  • S324 Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  • the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306;
  • the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502;
  • the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1".
  • step S33 of separately acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address the method includes:
  • S331 Input a specified number of training samples pre-labeled with similarity values into the natural language processing model for training.
  • S332 Make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value by adjusting the training parameter to the first parameter.
  • the default weights in this embodiment are obtained through training of the training model, and the training parameters are continuously adjusted during the training process, so that the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range.
  • the above training parameters include Each weight value to determine each weight value.
  • Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
  • the range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the first segmentation part corresponding to the first address and Before step S11 of the first word segmentation part corresponding to the second address, the method includes:
  • S10 Call the address database to perform address correction on the first address and the second address respectively according to the third preset rule.
  • the first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
  • step S1 of the second word grouping include:
  • S1a Indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
  • the data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data.
  • NOSQL is a non-relational database, an open source technology.
  • Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
  • S1b Receive the interface plug-in uploaded to the designated directory of the index server, where the interface plug-in is formed by packaging and encapsulating the preset matching algorithm.
  • the index server in this embodiment is an open source component and supports a plug-in mode.
  • the interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
  • S1d Establish a calculation association relationship between the preset index structure and the interface plug-in by running the configuration parameter.
  • the preset matching algorithm after the preset matching algorithm is developed, it is packaged and packaged and then uploaded to the specified directory of the index server and configured for related configuration parameters, so as to realize the calculation of the preset index structure and the interface plug-in by loading and running configuration parameters
  • the association relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
  • the index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data.
  • the query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2)
  • the ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model.
  • This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment.
  • the first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure.
  • the matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ;
  • the fifth administrative level adopts the fuzzy matching model of inclusion or inclusion;
  • the sixth administrative level adopts the digital matching model to match.
  • a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one.
  • Matching calculation when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed.
  • a filtering mechanism by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
  • the first address is an address to be retrieved input by a user
  • the second address is stored in an index server
  • the device includes:
  • the word segmentation module 1 is used to call the preset matching algorithm, and respectively segment the first address and the second address according to the first preset rule to obtain the first segmentation group and the second address corresponding to the first address The corresponding second word segmentation group, wherein the preset matching algorithm includes word segmentation calculation and matching calculation.
  • the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific.
  • the first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address.
  • the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide.
  • the general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong province, the word segmentation results are as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • word segmentation is performed through semantic segmentation.
  • the dividing module 2 is configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
  • the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation.
  • the "first", "second”, etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated.
  • the word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address.
  • the long-named "development zone of a certain city” corresponds to two participles “a certain city/development zone”, but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city” belongs to one Segmented.
  • the first obtaining module 3 is configured to obtain the matching results of all the first segments and all the second segments according to a second preset rule.
  • the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result.
  • the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
  • the judging module 4 is configured to judge whether the first address and the second address are the same according to the matching result.
  • This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels.
  • the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different.
  • the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
  • the first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user.
  • the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition.
  • the preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on.
  • the above-mentioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building.
  • the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number" is sign data.
  • word segmentation module 1 includes:
  • the first word segmentation unit is used to segment the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation corresponding to the first address. Part and the first word segmentation part corresponding to the second address.
  • the scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town.
  • the range address in this embodiment is segmented through a pre-associated address dictionary.
  • the address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model.
  • this embodiment adds a crawler address library when performing word segmentation calculations in the open source word segmentation algorithm package jieba, and uses it in combination with the national address library to correct the address to be segmented, and then perform word segmentation according to the administrative level to improve The accuracy of word segmentation.
  • the address dictionary By judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, if so, calling the address dictionary for word segmentation.
  • the result of word segmentation is as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • the second word segmentation unit is used to segment the flag addresses corresponding to the first address and the second address respectively according to the first grammar model in the natural language processing model to obtain the second address corresponding to the first address.
  • the word segmentation part and the second word segmentation part corresponding to the second address are used to segment the flag addresses corresponding to the first address and the second address respectively according to the first grammar model in the natural language processing model to obtain the second address corresponding to the first address.
  • the logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan” in the above address.
  • the token address is segmented according to the first grammar model in the natural language processing model.
  • the first grammar model includes, but is not limited to, "a certain cell” and "a certain building”. For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan".
  • the first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before “building and house number” are the sign addresses.
  • the first component unit is configured to combine a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and to combine the second address
  • the corresponding first segmentation part and the second segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address.
  • the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first address corresponds to the first address
  • One sub-phrase is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan”
  • the second sub-phrase corresponding to the second address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan”.
  • first address and the second address also respectively include detailed addresses
  • word segmentation module 1 includes:
  • the third word segmentation unit is used to segment the detailed addresses corresponding to the first address and the second address respectively according to the second grammar model in the natural language processing model to obtain the third address corresponding to the first address.
  • the word segmentation part and the third word segmentation part corresponding to the second address are used to segment the detailed addresses corresponding to the first address and the second address respectively according to the second grammar model in the natural language processing model to obtain the third address corresponding to the first address.
  • the detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs.
  • the second grammar model of this embodiment includes but is not limited to "a certain building”, “a certain building and a certain floor”, "a certain building and a certain room” and so on.
  • the second component unit is used to combine the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address into the first address corresponding
  • the first word segmentation group of the second address, the first word segmentation portion corresponding to the second address, the second word segmentation portion corresponding to the second address, and the third word segmentation portion corresponding to the second address form the second address corresponding to the The second sub-phrase.
  • the first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address.
  • the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first segment corresponding to the first address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306"
  • the second segment corresponding to the second address is "Guangdong province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide
  • the first acquisition module 3 includes:
  • the mapping unit is configured to map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple Nodes, each node corresponds to each of the first segment or each of the second segments, respectively.
  • one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level.
  • the participle "Guangdong province” corresponding to the highest administrative level “province” contained in the first address is used as the root node, and then the participle "Foshan City” corresponding to the next-level sub-node “city” is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on.
  • the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
  • the first obtaining unit is used to obtain the matching values corresponding to the respective nodes of the two structure trees.
  • the corresponding relationship between nodes and nodes between two structure trees is mapped according to the corresponding relationship of administrative levels, and the matching value corresponding to each node is obtained according to the above corresponding relationship.
  • the matching value includes the matching segment divided by the node All corresponding segments. For example, if a node corresponding to the first address is a "province” node, it is assigned the value "Guangdong", and the "province” node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
  • the second obtaining unit is configured to obtain the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address, respectively.
  • different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements.
  • the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
  • the calculation unit is configured to calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address, respectively .
  • the formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
  • the summation unit is configured to sum the first matching rate, the second matching rate, and the third matching rate as the sum of all the first segments and all the second segments Match results.
  • the first obtaining unit includes:
  • the first matching subunit is used to compare each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address, according to the node correspondence relationship one One-to-one correspondence performs accurate full matching, and each first matching value is obtained.
  • the matching methods for nodes corresponding to different administrative levels in this embodiment are different.
  • the four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match.
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the second matching subunit is used to compare each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address, according to the node correspondence relationship.
  • One-to-one matching of model keywords is performed to obtain each second matching value.
  • the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including.
  • NLP Natural Language Processing
  • the matching relationship can be realized by including or including.
  • “Jiangnan Mingju Community/Rongyuan” and “Jiangnan Mingju/Rongyuan” although the characters do not have a full matching relationship, but "Jiangnan Mingju Community” contains the characters “Jiangnan Mingju”, still There is a one-to-one matching relationship.
  • the third matching subunit is used to connect each first segment corresponding to the detail address in the first address to each second segment corresponding to the detail address in the second address, and perform a number one-to-one correspondence according to the node correspondence. Match to obtain each third matching value.
  • the detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
  • the summarizing subunit is used to summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  • the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306;
  • the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502;
  • the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1".
  • the first obtaining module 3 includes:
  • the input unit is used to input a specified number of training samples with pre-labeled similarity values into the natural language processing model for training.
  • the adjustment unit is configured to adjust the training parameter to the first parameter to make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value.
  • the corresponding unit is configured to correspond the corresponding weight value in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • the default weights in this embodiment are obtained through training of the training model. By continuously adjusting the training parameters during the training process, the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range.
  • the above training parameters include Each weight value to determine each weight value.
  • Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
  • word segmentation module 1 includes:
  • the calling unit is configured to call the address database to perform address correction on the first address and the second address respectively according to a third preset rule.
  • the first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
  • the address matching device further includes:
  • the index module is used for indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
  • the data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data.
  • NOSQL is a non-relational database, an open source technology.
  • Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
  • the receiving module is configured to receive the interface plug-ins uploaded to the designated directory of the index server, wherein the interface plug-ins are formed by packaging the preset matching algorithm.
  • the index server in this embodiment is an open source component and supports a plug-in mode.
  • the interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
  • the second acquiring module is used to acquire the configuration parameters of the interface plug-in.
  • the establishment module is used to establish a calculation association relationship between the preset index structure and the interface plug-in through the operation configuration parameter.
  • the address matching algorithm after the address matching algorithm is developed, it is packaged and packaged and uploaded to the specified directory of the index server and configured with related configuration parameters, so as to realize the calculation association between the preset index structure and the interface plug-in by loading and operating configuration parameters
  • the relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
  • the index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data.
  • the query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2)
  • the ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model.
  • This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment.
  • the first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure.
  • the matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ;
  • the fifth administrative level adopts the fuzzy matching model of inclusion or inclusion;
  • the sixth administrative level adopts the digital matching model to match.
  • a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one.
  • Matching calculation when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed.
  • a filtering mechanism by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store all the data needed for the address matching process.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize the address matching method.
  • the above-mentioned processor executes the above-mentioned address matching method, the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes: invoking a preset matching algorithm, and separately comparing the first address and the second address.
  • the address is word segmented according to a first preset rule, and a first word group corresponding to the first address and a second word group corresponding to the second address are obtained; the first address is divided according to the first word group Multiple first segments, divide the second address into multiple second segments according to the second word segmentation; obtain all the first segments and all the second segments according to a second preset rule
  • the matching result of the segment judging whether the first address and the second address are the same according to the matching result.
  • the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to the column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data. It is proposed based on multiple addresses.
  • the address name is segmented through the natural language processing model to form segmented phrases, and the segmented phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure, fully considered
  • the addresses are divided into sections according to administrative levels. Each administrative level section matches different weights, and the weights can be fine-tuned in actual business scenarios.
  • the first address and the second address include a range address and a flag address, respectively
  • the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset
  • the step of performing word segmentation according to rules to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively;
  • the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address
  • the first address and the second address further include detailed addresses
  • the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the range address includes four administrative levels of province, city/district, county, and township/town
  • the mark address includes the name of a cell or a building
  • the processor obtains all the addresses according to the second preset rule.
  • the step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low.
  • Structure trees with the same structure wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
  • the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
  • the method before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar A specified number of training samples with a degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value; The corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer program is executed when the processor is executed.
  • the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes: calling a preset matching algorithm, and respectively comparing the first address and the second address according to the first preset Set rules for word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address; according to the first segmentation group, the first address is divided into multiple first segments Segment, dividing the second address into a plurality of second segments according to the second word segmentation; obtaining matching results of all the first segments and all the second segments according to a second preset rule; Determine whether the first address and the second address are the same according to the matching result.
  • the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to text, image, voice, etc. formed based on NoSQL storage technology
  • Column storage the amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data.
  • a configurable weight address matching model based on multi-level address division.
  • the address name is segmented through a natural language processing model to form sub-phrases, and the sub-phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure , Taking full account of the tree structure of addresses, the addresses are divided into sections according to administrative levels. Each administrative level is matched with different weights, and the weights can be fine-tuned in actual business scenarios.
  • the first address and the second address include a range address and a flag address, respectively
  • the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset Rule word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address, respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively;
  • the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segment
  • the first address and the second address further include detailed addresses
  • the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the range address includes four administrative levels of province, city/district, county, and township/town
  • the mark address includes the name of a cell or a building
  • the processor obtains all the addresses according to the second preset rule.
  • the step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low.
  • Structure trees with the same structure wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
  • the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
  • the method before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar
  • the specified number of training samples of the degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is the same as the pre-labeled similarity value Consistent; the corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention se rapporte au domaine des mégadonnées, et concerne un procédé et un appareil de mise en correspondance d'adresses, un dispositif informatique, et un support de stockage, où, dans le procédé de mise en correspondance d'adresses, une première adresse est une adresse à récupérer entrée par un utilisateur, et une deuxième adresse est stockée dans un serveur d'index, le procédé consistant à : invoquer un algorithme de correspondance prédéfini, et effectuer respectivement une segmentation en mots sur une première adresse et une deuxième adresse en fonction d'une première règle prédéfinie pour obtenir un premier groupe de segmentation en mots correspondant à la première adresse et un deuxième groupe de segmentation en mots correspondant à la deuxième adresse, l'algorithme de correspondance prédéfini comprenant un calcul de segmentation en mots et un calcul de mise en correspondance ; en fonction du premier groupe de segmentation en mots, diviser la première adresse en une pluralité de premiers segments et, en fonction du deuxième groupe de segmentation en mots, diviser la deuxième adresse en une pluralité de deuxièmes segments ; en fonction d'une deuxième règle prédéfinie, acquérir un résultat de mise en correspondance des premiers segments et des deuxièmes segments, et déterminer si la première adresse et la deuxième adresse sont les mêmes. Pour les quatre premières adresses de niveau administratif de l'adresse segmentée, une mise en correspondance précise est mise en œuvre sur la base d'une base de données d'adresses (de type arbre) de provinces, de municipalités, de comtés et de villes dans tous le pays, et des omissions partielles sont remplies efficacement.
PCT/CN2020/098804 2019-07-03 2020-06-29 Procédé et appareil de mise en correspondance d'adresses, dispositif informatique et support de stockage WO2021000831A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910601364.8 2019-07-03
CN201910601364.8A CN110442603B (zh) 2019-07-03 2019-07-03 地址匹配方法、装置、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021000831A1 true WO2021000831A1 (fr) 2021-01-07

Family

ID=68428771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098804 WO2021000831A1 (fr) 2019-07-03 2020-06-29 Procédé et appareil de mise en correspondance d'adresses, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN110442603B (fr)
WO (1) WO2021000831A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935293A (zh) * 2021-12-16 2022-01-14 湖南四方天箭信息科技有限公司 地址拆分和补全方法、装置、计算机设备和存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442603B (zh) * 2019-07-03 2024-01-19 平安科技(深圳)有限公司 地址匹配方法、装置、计算机设备及存储介质
CN111144117B (zh) * 2019-12-26 2023-08-29 同济大学 知识图谱中文地址消除歧义方法
CN111563806A (zh) * 2020-07-20 2020-08-21 平安国际智慧城市科技股份有限公司 网络平台中商家合规性识别方法、装置、介质及电子设备
CN114064827A (zh) * 2020-08-05 2022-02-18 北京四维图新科技股份有限公司 位置搜索方法、装置以及设备
CN112256821B (zh) * 2020-09-23 2024-05-17 北京捷通华声科技股份有限公司 中文地址补全的方法、装置、设备及存储介质
CN112163070B (zh) * 2020-09-27 2024-02-27 杭州海康威视系统技术有限公司 地名匹配方法、装置、电子设备及机器可读存储介质
CN112835897B (zh) * 2021-01-29 2024-03-15 上海寻梦信息技术有限公司 地理区域划分管理方法、数据转换方法及相关设备
CN113343688A (zh) * 2021-06-22 2021-09-03 南京星云数字技术有限公司 地址相似度确定方法、装置和计算机设备
CN113987114B (zh) * 2021-09-17 2023-04-07 上海燃气有限公司 一种基于语义分析的地址匹配方法、装置和电子设备
CN114756654A (zh) * 2022-04-25 2022-07-15 广州城市信息研究所有限公司 动态地名地址匹配方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1487444A (zh) * 2002-09-13 2004-04-07 富士施乐株式会社 文本语句比较装置
CN101770499A (zh) * 2009-01-07 2010-07-07 上海聚力传媒技术有限公司 搜索引擎中的信息检索方法及相应搜索引擎
CN102402533A (zh) * 2010-09-13 2012-04-04 方正国际软件有限公司 地址匹配方法及系统
CN108763215A (zh) * 2018-05-30 2018-11-06 中智诚征信有限公司 一种基于地址分词的地址存储方法、装置及计算机设备
US10216837B1 (en) * 2014-12-29 2019-02-26 Google Llc Selecting pattern matching segments for electronic communication clustering
CN110442603A (zh) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 地址匹配方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257516B (zh) * 2008-03-11 2011-07-13 中兴通讯股份有限公司 一种源地址修正的方法
KR20180057853A (ko) * 2016-11-23 2018-05-31 잠쉬딘 허지무하메도브 주소 변환 방법, 시스템 및 컴퓨터 프로그램
CN106874384B (zh) * 2017-01-10 2020-12-04 航天精一(广东)信息科技有限公司 一种异构地址标准转换及匹配方法
CN109145169B (zh) * 2018-07-26 2021-03-26 浙江省测绘科学技术研究院 一种基于统计分词的地址匹配方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1487444A (zh) * 2002-09-13 2004-04-07 富士施乐株式会社 文本语句比较装置
CN101770499A (zh) * 2009-01-07 2010-07-07 上海聚力传媒技术有限公司 搜索引擎中的信息检索方法及相应搜索引擎
CN102402533A (zh) * 2010-09-13 2012-04-04 方正国际软件有限公司 地址匹配方法及系统
US10216837B1 (en) * 2014-12-29 2019-02-26 Google Llc Selecting pattern matching segments for electronic communication clustering
CN108763215A (zh) * 2018-05-30 2018-11-06 中智诚征信有限公司 一种基于地址分词的地址存储方法、装置及计算机设备
CN110442603A (zh) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 地址匹配方法、装置、计算机设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935293A (zh) * 2021-12-16 2022-01-14 湖南四方天箭信息科技有限公司 地址拆分和补全方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN110442603B (zh) 2024-01-19
CN110442603A (zh) 2019-11-12

Similar Documents

Publication Publication Date Title
WO2021000831A1 (fr) Procédé et appareil de mise en correspondance d'adresses, dispositif informatique et support de stockage
WO2021139283A1 (fr) Procédé et appareil de questions-réponses à partir d'un graphe de connaissances faisant appel à la technologie d'apprentissage profond, et dispositif
CN104866593A (zh) 一种基于知识图谱的数据库搜索方法
WO2020063092A1 (fr) Procédé et appareil de traitement de graphe de connaissances
US10783171B2 (en) Address search method and device
CN104636466B (zh) 一种面向开放网页的实体属性抽取方法和系统
US8626681B1 (en) Training a probabilistic spelling checker from structured data
WO2022142613A1 (fr) Procédé et appareil d'expansion de corpus de formation et procédé et appareil de formation de modèle de reconnaissance d'intention
US9773053B2 (en) Method and apparatus for processing electronic data
CN111291161A (zh) 法律案件知识图谱查询方法、装置、设备及存储介质
CN109033314B (zh) 内存受限情况下的大规模知识图谱的实时查询方法和系统
CN106844380A (zh) 一种数据库操作方法、信息处理方法和相应装置
WO2021184627A1 (fr) Procédé et appareil de traçabilité des polluants basée sur les r-arbres et dispositif associé
CN105224622A (zh) 面向互联网的地名地址提取与标准化方法
CN104657439A (zh) 用于自然语言精准检索的结构化查询语句生成系统及方法
WO2019169858A1 (fr) Procédé et système d'analyse de données fondés sur une technologie de moteur de recherche
CN103838837B (zh) 基于语义模板的遥感元数据集成方法
CN104657440A (zh) 结构化查询语句生成系统及方法
CN110134780B (zh) 文档摘要的生成方法、装置、设备、计算机可读存储介质
CN106951526B (zh) 一种实体集扩展方法及装置
CN110347810B (zh) 对话式检索回答方法、装置、计算机设备及存储介质
CN103473224A (zh) 基于问题求解过程的习题语义化方法
CN112528174A (zh) 基于知识图谱和多重匹配的地址修整补全方法及应用
CN104794163A (zh) 实体集合扩展方法
CN116414823A (zh) 一种基于分词模型的地址定位方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20835215

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20835215

Country of ref document: EP

Kind code of ref document: A1