WO2021000831A1 - Address matching method and apparatus, computer device and storage medium - Google Patents

Address matching method and apparatus, computer device and storage medium Download PDF

Info

Publication number
WO2021000831A1
WO2021000831A1 PCT/CN2020/098804 CN2020098804W WO2021000831A1 WO 2021000831 A1 WO2021000831 A1 WO 2021000831A1 CN 2020098804 W CN2020098804 W CN 2020098804W WO 2021000831 A1 WO2021000831 A1 WO 2021000831A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
matching
word segmentation
segments
segmentation
Prior art date
Application number
PCT/CN2020/098804
Other languages
French (fr)
Chinese (zh)
Inventor
申超波
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021000831A1 publication Critical patent/WO2021000831A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2468Fuzzy queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Definitions

  • This application relates to the field of big data, in particular to address matching methods, devices, computer equipment and storage media.
  • the main purpose of this application is to provide an address matching method, which aims to solve the technical problem of existing address matching defects.
  • the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • This application also provides an address matching device, the first address is the address to be retrieved input by the user, the second address is stored in the index server, and the device includes:
  • the word segmentation module is used to call a preset matching algorithm, and respectively segment the first address and the second address according to a first preset rule to obtain the first segmentation group and the second segment corresponding to the first address
  • the second word segmentation group corresponding to the address wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • a dividing module configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group;
  • a second acquiring module configured to acquire matching results of all the first segments and all the second segments according to a second preset rule
  • the judgment module is configured to judge whether the first address and the second address are the same according to the matching result.
  • the present application also provides a computer device, including a memory and a processor, the memory stores a computer program, the processor implements an address matching method when the computer program is executed, and the first address is the address to be retrieved input by the user, The second address is stored in the index server, and the method includes:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • This application also provides a computer-readable storage medium on which a computer program is stored, which implements an address matching method when the computer program is executed by a processor, the first address is the address to be retrieved input by the user, and the second address is stored in
  • the methods include:
  • a preset matching algorithm respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
  • the first four administrative-level addresses of the segmented addresses are accurately matched according to the national provinces, municipalities, counties and towns address database (tree-shaped).
  • partial missing is effectively completed, and the massive amount of pre-stored in the index server
  • the data is built into an index structure, combined with the Elasticsearch component's own computing architecture and powerful distributed computing capabilities, to achieve real-time fast query of the first address in the preset index structure.
  • FIG. 1 is a schematic flowchart of an address matching method according to an embodiment of the present application
  • Fig. 2 is a schematic structural diagram of an address matching device according to an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the internal structure of a computer device according to an embodiment of the present application.
  • the first address is an address to be retrieved input by a user
  • the second address is stored in an index server
  • S1 Invoke a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address ,
  • the preset matching algorithm includes word segmentation calculation and matching calculation.
  • the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific.
  • the first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address.
  • the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide.
  • the general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong province, the word segmentation results are as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • word segmentation is performed through semantic segmentation.
  • S2 Divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
  • the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation.
  • the "first", "second”, etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated.
  • the word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address.
  • the long-named "development zone of a certain city” corresponds to two participles “a certain city/development zone”, but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city” belongs to one Segmented.
  • the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result.
  • the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
  • This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels.
  • the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different.
  • the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
  • the first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user.
  • the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition.
  • the preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on.
  • the aforementioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building. For example, “Rongyuan of Jiangnan Mingju Residential Quarter" included in the first address is the mark data.
  • the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number” is sign data.
  • first address and the second address respectively include a range address and a flag address
  • the preset matching algorithm is invoked, and the first address and the second address are respectively segmented according to a first preset rule to obtain all
  • the step S1 of the first word segmentation corresponding to the first address and the second word segmentation corresponding to the second address includes:
  • S11 Perform word segmentation on the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation part corresponding to the first address and the first segmentation part respectively.
  • the first segmentation part corresponding to the second address is
  • the scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town.
  • the range address in this embodiment is segmented through a pre-associated address dictionary.
  • the address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model.
  • the preset matching algorithm in this embodiment includes analysis calculation and matching calculation.
  • the crawler address library is added when the open source word segmentation algorithm package jieba is used for word segmentation calculation, and it is used in combination with the national address library to treat word segmentation The address is corrected, and then word segmentation is performed according to the administrative level to improve the accuracy of word segmentation.
  • the address dictionary is called for word segmentation calculation.
  • the result of word segmentation is as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • S12 Perform word segmentation according to the first grammar model in the natural language processing model with the flag addresses corresponding to the first address and the second address, respectively, to obtain the second segmentation part corresponding to the first address and the The second segmentation part corresponding to the second address.
  • the logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan” in the above address.
  • the token address is segmented according to the first grammar model in the natural language processing model.
  • the first grammar model includes, but is not limited to, "a certain cell” and "a certain building”. For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan".
  • the first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before “building and house number” are the sign addresses.
  • the first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address.
  • the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first address corresponds to the first address
  • One sub-phrase is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan”
  • the second sub-phrase corresponding to the second address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan”.
  • the first address and the second address respectively include detailed addresses, and the marking addresses corresponding to the first address and the second address are performed according to the grammar model in the natural language processing model.
  • the detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs.
  • the second grammar model of this embodiment includes but is not limited to "a certain building”, “a certain building and a certain floor”, "a certain building and a certain room” and so on.
  • S15 Combine a first word segmentation part corresponding to the first address, a second word segmentation part corresponding to the first address, and a third word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address Forming the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.
  • the first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address.
  • the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first segment corresponding to the first address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306"
  • the second segment corresponding to the second address is "Guangdong province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide
  • the range address includes four administrative levels of province/city/district, county/township, and town
  • the mark address includes a cell name or a building name
  • the first points are obtained according to a second preset rule.
  • Step S3 of the matching result of the segment and all the second segments includes:
  • S31 Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, where the structure tree includes multiple nodes, each The nodes respectively correspond to each of the first segments or each of the second segments in a one-to-one correspondence.
  • one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level.
  • the participle "Guangdong province” corresponding to the highest administrative level “province” contained in the first address is used as the root node, and then the participle "Foshan City” corresponding to the next-level sub-node “city” is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on.
  • the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
  • the matching calculation in this embodiment is to map the corresponding relationship between the nodes and the nodes between the two structure trees according to the corresponding relationship of administrative levels, and obtain and calculate the matching value corresponding to each node according to the above-mentioned corresponding relationship.
  • the matching value includes matching The segment is divided by all the segments corresponding to the node. For example, if a node corresponding to the first address is a "province” node, it is assigned the value "Guangdong", and the "province” node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
  • different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements.
  • the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
  • S34 Calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address.
  • the formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
  • S35 The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
  • step S32 of obtaining the matching value corresponding to each node of the two structure trees includes:
  • S321 Perform precise and full matching of each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address according to the node correspondence relationship. , Get each first matching value.
  • the matching methods for nodes corresponding to different administrative levels in this embodiment are different.
  • the four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match.
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • S322 Perform a one-to-one correspondence between each first segment corresponding to the flag address in the first address and each second segment corresponding to the flag address in the second address to perform model keywords according to the node correspondence relationship. Match to obtain each second matching value.
  • the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including.
  • NLP Natural Language Processing
  • the matching relationship can be realized by including or including.
  • “Jiangnan Mingju Community/Rongyuan” and “Jiangnan Mingju/Rongyuan” although the characters do not have a full matching relationship, but "Jiangnan Mingju Community” contains the characters “Jiangnan Mingju”, still There is a one-to-one matching relationship.
  • S323 Perform digital matching for each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence. Obtain each third matching value.
  • the detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
  • S324 Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  • the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306;
  • the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502;
  • the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1".
  • step S33 of separately acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address the method includes:
  • S331 Input a specified number of training samples pre-labeled with similarity values into the natural language processing model for training.
  • S332 Make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value by adjusting the training parameter to the first parameter.
  • the default weights in this embodiment are obtained through training of the training model, and the training parameters are continuously adjusted during the training process, so that the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range.
  • the above training parameters include Each weight value to determine each weight value.
  • Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
  • the range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the first segmentation part corresponding to the first address and Before step S11 of the first word segmentation part corresponding to the second address, the method includes:
  • S10 Call the address database to perform address correction on the first address and the second address respectively according to the third preset rule.
  • the first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
  • step S1 of the second word grouping include:
  • S1a Indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
  • the data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data.
  • NOSQL is a non-relational database, an open source technology.
  • Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
  • S1b Receive the interface plug-in uploaded to the designated directory of the index server, where the interface plug-in is formed by packaging and encapsulating the preset matching algorithm.
  • the index server in this embodiment is an open source component and supports a plug-in mode.
  • the interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
  • S1d Establish a calculation association relationship between the preset index structure and the interface plug-in by running the configuration parameter.
  • the preset matching algorithm after the preset matching algorithm is developed, it is packaged and packaged and then uploaded to the specified directory of the index server and configured for related configuration parameters, so as to realize the calculation of the preset index structure and the interface plug-in by loading and running configuration parameters
  • the association relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
  • the index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data.
  • the query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2)
  • the ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model.
  • This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment.
  • the first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure.
  • the matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ;
  • the fifth administrative level adopts the fuzzy matching model of inclusion or inclusion;
  • the sixth administrative level adopts the digital matching model to match.
  • a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one.
  • Matching calculation when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed.
  • a filtering mechanism by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
  • the first address is an address to be retrieved input by a user
  • the second address is stored in an index server
  • the device includes:
  • the word segmentation module 1 is used to call the preset matching algorithm, and respectively segment the first address and the second address according to the first preset rule to obtain the first segmentation group and the second address corresponding to the first address The corresponding second word segmentation group, wherein the preset matching algorithm includes word segmentation calculation and matching calculation.
  • the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific.
  • the first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address.
  • the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide.
  • the general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong province, the word segmentation results are as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • word segmentation is performed through semantic segmentation.
  • the dividing module 2 is configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
  • the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation.
  • the "first", "second”, etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated.
  • the word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address.
  • the long-named "development zone of a certain city” corresponds to two participles “a certain city/development zone”, but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city” belongs to one Segmented.
  • the first obtaining module 3 is configured to obtain the matching results of all the first segments and all the second segments according to a second preset rule.
  • the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result.
  • the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
  • the judging module 4 is configured to judge whether the first address and the second address are the same according to the matching result.
  • This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels.
  • the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different.
  • the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
  • the first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user.
  • the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition.
  • the preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on.
  • the above-mentioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building.
  • the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number" is sign data.
  • word segmentation module 1 includes:
  • the first word segmentation unit is used to segment the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation corresponding to the first address. Part and the first word segmentation part corresponding to the second address.
  • the scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town.
  • the range address in this embodiment is segmented through a pre-associated address dictionary.
  • the address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model.
  • this embodiment adds a crawler address library when performing word segmentation calculations in the open source word segmentation algorithm package jieba, and uses it in combination with the national address library to correct the address to be segmented, and then perform word segmentation according to the administrative level to improve The accuracy of word segmentation.
  • the address dictionary By judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, if so, calling the address dictionary for word segmentation.
  • the result of word segmentation is as follows: Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong province/Foshan City/Nanhai District/Guicheng Town.
  • the second word segmentation unit is used to segment the flag addresses corresponding to the first address and the second address respectively according to the first grammar model in the natural language processing model to obtain the second address corresponding to the first address.
  • the word segmentation part and the second word segmentation part corresponding to the second address are used to segment the flag addresses corresponding to the first address and the second address respectively according to the first grammar model in the natural language processing model to obtain the second address corresponding to the first address.
  • the logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan” in the above address.
  • the token address is segmented according to the first grammar model in the natural language processing model.
  • the first grammar model includes, but is not limited to, "a certain cell” and "a certain building”. For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan".
  • the first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before “building and house number” are the sign addresses.
  • the first component unit is configured to combine a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and to combine the second address
  • the corresponding first segmentation part and the second segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address.
  • the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first address corresponds to the first address
  • One sub-phrase is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan”
  • the second sub-phrase corresponding to the second address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan”.
  • first address and the second address also respectively include detailed addresses
  • word segmentation module 1 includes:
  • the third word segmentation unit is used to segment the detailed addresses corresponding to the first address and the second address respectively according to the second grammar model in the natural language processing model to obtain the third address corresponding to the first address.
  • the word segmentation part and the third word segmentation part corresponding to the second address are used to segment the detailed addresses corresponding to the first address and the second address respectively according to the second grammar model in the natural language processing model to obtain the third address corresponding to the first address.
  • the detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs.
  • the second grammar model of this embodiment includes but is not limited to "a certain building”, “a certain building and a certain floor”, "a certain building and a certain room” and so on.
  • the second component unit is used to combine the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address into the first address corresponding
  • the first word segmentation group of the second address, the first word segmentation portion corresponding to the second address, the second word segmentation portion corresponding to the second address, and the third word segmentation portion corresponding to the second address form the second address corresponding to the The second sub-phrase.
  • the first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address.
  • the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong province”
  • the first segment corresponding to the first address is "Guangdong province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306"
  • the second segment corresponding to the second address is "Guangdong province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide
  • the first acquisition module 3 includes:
  • the mapping unit is configured to map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple Nodes, each node corresponds to each of the first segment or each of the second segments, respectively.
  • one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level.
  • the participle "Guangdong province” corresponding to the highest administrative level “province” contained in the first address is used as the root node, and then the participle "Foshan City” corresponding to the next-level sub-node “city” is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on.
  • the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
  • the first obtaining unit is used to obtain the matching values corresponding to the respective nodes of the two structure trees.
  • the corresponding relationship between nodes and nodes between two structure trees is mapped according to the corresponding relationship of administrative levels, and the matching value corresponding to each node is obtained according to the above corresponding relationship.
  • the matching value includes the matching segment divided by the node All corresponding segments. For example, if a node corresponding to the first address is a "province” node, it is assigned the value "Guangdong", and the "province” node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
  • the second obtaining unit is configured to obtain the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address, respectively.
  • different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements.
  • the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
  • the calculation unit is configured to calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address, respectively .
  • the formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
  • the summation unit is configured to sum the first matching rate, the second matching rate, and the third matching rate as the sum of all the first segments and all the second segments Match results.
  • the first obtaining unit includes:
  • the first matching subunit is used to compare each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address, according to the node correspondence relationship one One-to-one correspondence performs accurate full matching, and each first matching value is obtained.
  • the matching methods for nodes corresponding to different administrative levels in this embodiment are different.
  • the four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match.
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the "province” node corresponding to the first address is assigned the value "Guangdong”
  • the second matching subunit is used to compare each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address, according to the node correspondence relationship.
  • One-to-one matching of model keywords is performed to obtain each second matching value.
  • the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including.
  • NLP Natural Language Processing
  • the matching relationship can be realized by including or including.
  • “Jiangnan Mingju Community/Rongyuan” and “Jiangnan Mingju/Rongyuan” although the characters do not have a full matching relationship, but "Jiangnan Mingju Community” contains the characters “Jiangnan Mingju”, still There is a one-to-one matching relationship.
  • the third matching subunit is used to connect each first segment corresponding to the detail address in the first address to each second segment corresponding to the detail address in the second address, and perform a number one-to-one correspondence according to the node correspondence. Match to obtain each third matching value.
  • the detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
  • the summarizing subunit is used to summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  • the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306;
  • the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502;
  • the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1".
  • the first obtaining module 3 includes:
  • the input unit is used to input a specified number of training samples with pre-labeled similarity values into the natural language processing model for training.
  • the adjustment unit is configured to adjust the training parameter to the first parameter to make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value.
  • the corresponding unit is configured to correspond the corresponding weight value in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • the default weights in this embodiment are obtained through training of the training model. By continuously adjusting the training parameters during the training process, the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range.
  • the above training parameters include Each weight value to determine each weight value.
  • Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
  • word segmentation module 1 includes:
  • the calling unit is configured to call the address database to perform address correction on the first address and the second address respectively according to a third preset rule.
  • the first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on.
  • the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
  • the address matching device further includes:
  • the index module is used for indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
  • the data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data.
  • NOSQL is a non-relational database, an open source technology.
  • Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
  • the receiving module is configured to receive the interface plug-ins uploaded to the designated directory of the index server, wherein the interface plug-ins are formed by packaging the preset matching algorithm.
  • the index server in this embodiment is an open source component and supports a plug-in mode.
  • the interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
  • the second acquiring module is used to acquire the configuration parameters of the interface plug-in.
  • the establishment module is used to establish a calculation association relationship between the preset index structure and the interface plug-in through the operation configuration parameter.
  • the address matching algorithm after the address matching algorithm is developed, it is packaged and packaged and uploaded to the specified directory of the index server and configured with related configuration parameters, so as to realize the calculation association between the preset index structure and the interface plug-in by loading and operating configuration parameters
  • the relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
  • the index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data.
  • the query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2)
  • the ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model.
  • This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment.
  • the first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure.
  • the matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ;
  • the fifth administrative level adopts the fuzzy matching model of inclusion or inclusion;
  • the sixth administrative level adopts the digital matching model to match.
  • a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one.
  • Matching calculation when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed.
  • a filtering mechanism by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in FIG. 3.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus.
  • the computer designed processor is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer device is used to store all the data needed for the address matching process.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize the address matching method.
  • the above-mentioned processor executes the above-mentioned address matching method, the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes: invoking a preset matching algorithm, and separately comparing the first address and the second address.
  • the address is word segmented according to a first preset rule, and a first word group corresponding to the first address and a second word group corresponding to the second address are obtained; the first address is divided according to the first word group Multiple first segments, divide the second address into multiple second segments according to the second word segmentation; obtain all the first segments and all the second segments according to a second preset rule
  • the matching result of the segment judging whether the first address and the second address are the same according to the matching result.
  • the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to the column storage formed based on NoSQL storage technology such as text, image, and voice.
  • the amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data. It is proposed based on multiple addresses.
  • the address name is segmented through the natural language processing model to form segmented phrases, and the segmented phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure, fully considered
  • the addresses are divided into sections according to administrative levels. Each administrative level section matches different weights, and the weights can be fine-tuned in actual business scenarios.
  • the first address and the second address include a range address and a flag address, respectively
  • the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset
  • the step of performing word segmentation according to rules to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively;
  • the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address
  • the first address and the second address further include detailed addresses
  • the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the range address includes four administrative levels of province, city/district, county, and township/town
  • the mark address includes the name of a cell or a building
  • the processor obtains all the addresses according to the second preset rule.
  • the step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low.
  • Structure trees with the same structure wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
  • the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
  • the method before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar A specified number of training samples with a degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value; The corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the computer program is executed when the processor is executed.
  • the first address is the address to be retrieved input by the user, and the second address is stored in the index server.
  • the method includes: calling a preset matching algorithm, and respectively comparing the first address and the second address according to the first preset Set rules for word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address; according to the first segmentation group, the first address is divided into multiple first segments Segment, dividing the second address into a plurality of second segments according to the second word segmentation; obtaining matching results of all the first segments and all the second segments according to a second preset rule; Determine whether the first address and the second address are the same according to the matching result.
  • the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs.
  • Unstructured data refers to text, image, voice, etc. formed based on NoSQL storage technology
  • Column storage the amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation.
  • the index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data.
  • a configurable weight address matching model based on multi-level address division.
  • the address name is segmented through a natural language processing model to form sub-phrases, and the sub-phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure , Taking full account of the tree structure of addresses, the addresses are divided into sections according to administrative levels. Each administrative level is matched with different weights, and the weights can be fine-tuned in actual business scenarios.
  • the first address and the second address include a range address and a flag address, respectively
  • the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset Rule word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address, respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively;
  • the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segment
  • the first address and the second address further include detailed addresses
  • the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
  • the range address includes four administrative levels of province, city/district, county, and township/town
  • the mark address includes the name of a cell or a building
  • the processor obtains all the addresses according to the second preset rule.
  • the step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low.
  • Structure trees with the same structure wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
  • the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
  • the method before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar
  • the specified number of training samples of the degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is the same as the pre-labeled similarity value Consistent; the corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Remote Sensing (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the field of big data, and discloses an address matching method and apparatus, a computer device, and a storage medium, wherein in the address matching method a first address is an address to be retrieved inputted by a user, and a second address is stored in an index server, the method comprising: invoking a preset matching algorithm, and respectively performing word segmentation on a first address and a second address on the basis of a first preset rule to obtain a first word segmentation group corresponding to the first address and a second word segmentation group corresponding to the second address, the preset matching algorithm comprising word segmentation calculation and matching calculation; on the basis of the first word segmentation group, dividing the first address into a plurality of first segments and, on the basis of the second word segmentation group, dividing the second address into a plurality of second segments; on the basis of a second preset rule, acquiring a matching result of the first segments and the second segments, and determining whether the first address and the second address are the same. For the first four administrative level addresses of the segmented address, precise matching is implemented on the basis of an address database (tree type) of nationwide provinces, municipalities, counties and towns, and partial omissions are effectively completed.

Description

地址匹配方法、装置、计算机设备及存储介质Address matching method, device, computer equipment and storage medium
本申请要求于2019年07月03日提交中国专利局、申请号为201910601364.8,发明名称为“地址匹配方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 3, 2019, the application number is 201910601364.8, and the invention title is "Address matching method, device, computer equipment and storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及到大数据领域,特别是涉及到地址匹配方法、装置、计算机设备及存储介质。This application relates to the field of big data, in particular to address matching methods, devices, computer equipment and storage media.
背景技术Background technique
传统的地址模糊匹配往往将地址作为一个完整的个体基于NLP进行模糊匹配,但发明人意识到这种方式存在以下几个缺陷:1)地址的结构是地址名称的树形结构,越接近树形结构的底层相似才更为接近,但作为整体匹配的地址名称是平行结构比较,比较不符合地址名称的实际分布结构;2)对于短地址比较效果会比较差,但大部分短地址是具有较好价值。3)同一个地址的地址名称作为词个体价值同等性,而实际中是不一致的,例如深圳市/南山区/腾讯大厦,其中地址名称腾讯大厦作为有效地址明显会更有价值。Traditional address fuzzy matching often takes the address as a complete individual and performs fuzzy matching based on NLP, but the inventor realizes that this method has the following defects: 1) The structure of the address is a tree structure of address names, the closer it is to a tree. The bottom layer of the structure is similar, but the address names matched as a whole are parallel structure comparison, which does not conform to the actual distribution structure of the address name; 2) The comparison effect for short addresses will be relatively poor, but most short addresses have relatively short addresses. Good value. 3) The address name of the same address has the same value as individual words, but it is inconsistent in practice. For example, Shenzhen/Nanshan District/Tencent Building, where the address name Tencent Building is obviously more valuable as an effective address.
技术问题technical problem
本申请的主要目的为提供地址匹配方法,旨在解决现有地址匹配存在缺陷的技术问题。The main purpose of this application is to provide an address matching method, which aims to solve the technical problem of existing address matching defects.
技术解决方案Technical solutions
本申请提出一种地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:This application proposes an address matching method. The first address is the address to be retrieved input by the user, and the second address is stored in the index server. The method includes:
调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
本申请还提供了一种地址匹配装置,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,装置包括:This application also provides an address matching device, the first address is the address to be retrieved input by the user, the second address is stored in the index server, and the device includes:
分词模块,用于调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;The word segmentation module is used to call a preset matching algorithm, and respectively segment the first address and the second address according to a first preset rule to obtain the first segmentation group and the second segment corresponding to the first address The second word segmentation group corresponding to the address, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
划分模块,用于根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;A dividing module, configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group;
第二获取模块,用于根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;A second acquiring module, configured to acquire matching results of all the first segments and all the second segments according to a second preset rule;
判断模块,用于根据所述匹配结果判断所述第一地址和所述第二地址是否相同。The judgment module is configured to judge whether the first address and the second address are the same according to the matching result.
本申请还提供了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:The present application also provides a computer device, including a memory and a processor, the memory stores a computer program, the processor implements an address matching method when the computer program is executed, and the first address is the address to be retrieved input by the user, The second address is stored in the index server, and the method includes:
调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:This application also provides a computer-readable storage medium on which a computer program is stored, which implements an address matching method when the computer program is executed by a processor, the first address is the address to be retrieved input by the user, and the second address is stored in In the index server, the methods include:
调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
有益效果Beneficial effect
本申请对于分段地址前四个行政级别地址,根据全国省市区县镇地址库(树型的)进行精确匹配,此外,对于部分缺失进行有效补全,且通过对索引服务器中预存的海量数据建立索引结构,结合Elastic search组件自身的计算架构以及强大的分布式计算能力,实现对第一地址在预设索引结构中,进行实时快速查询。In this application, the first four administrative-level addresses of the segmented addresses are accurately matched according to the national provinces, municipalities, counties and towns address database (tree-shaped). In addition, partial missing is effectively completed, and the massive amount of pre-stored in the index server The data is built into an index structure, combined with the Elasticsearch component's own computing architecture and powerful distributed computing capabilities, to achieve real-time fast query of the first address in the preset index structure.
附图说明Description of the drawings
图1本申请一实施例的地址匹配方法流程示意图;FIG. 1 is a schematic flowchart of an address matching method according to an embodiment of the present application;
图2本申请一实施例的地址匹配装置结构示意图;Fig. 2 is a schematic structural diagram of an address matching device according to an embodiment of the present application;
图3本申请一实施例的计算机设备内部结构示意图。Fig. 3 is a schematic diagram of the internal structure of a computer device according to an embodiment of the present application.
本发明的最佳实施方式The best mode of the invention
参照图1,本申请一实施例的地址匹配方法,所述第一地址为用户输入的待检索地址,所述第二地址存储于索引服务器中,方法包括:1, in an address matching method of an embodiment of the present application, the first address is an address to be retrieved input by a user, and the second address is stored in an index server, and the method includes:
S1:调用预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算。S1: Invoke a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address , Wherein the preset matching algorithm includes word segmentation calculation and matching calculation.
本实施例中,以比较第一地址和第二地址的相似性为例,上述第一地址和第二地址为依据行政级别由高到低,由范围到具体的方式进行书写。本实施例的第一预设规则根据地址中所处的行政级别不同具有不同的分词规则,比如全国通用的省/市/区、县/乡、镇四个行政级别对应的分词,通常借用全国通用地址数据库进行分词。比如广东省佛山市南海区桂城镇,分词结果如下:广东省/佛山市/南海区/桂城镇。对于上述省/市/区、县/乡、镇四个行政级别之外的地址信息,通过语义分词的方式进行分词。In this embodiment, taking the comparison of the similarity between the first address and the second address as an example, the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific. The first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address. For example, the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide. The general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong Province, the word segmentation results are as follows: Guangdong Province/Foshan City/Nanhai District/Guicheng Town. For the address information outside the four administrative levels of provinces/cities/districts, counties/townships, and towns, word segmentation is performed through semantic segmentation.
S2:根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段。S2: Divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
本实施例根据地址对应的分词词组对地址进行分段和/或划分行政等级,每个分段或每个行政等级对应一个或多个分词。为方便区分第一地址对应各第一分段,第二地址对应各第二分段,本实施例的“第一”、“第二”等仅用于区别,不用于限定,其他处的类似用语作用相同,不赘述。分词组为实际地址的分词排列,按照原地址的书写次序形成。比如名称比较长的“某市开发区”,对应两个分词“某市/开发区”,但分段是在分词的基础上依据行政等级进行的分段,比如“某市开发区”属于一个分段。In this embodiment, the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation. In order to facilitate the distinction between the first address corresponding to each first segment, and the second address corresponding to each second segment, the "first", "second", etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated. The word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address. For example, the long-named "development zone of a certain city" corresponds to two participles "a certain city/development zone", but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city" belongs to one Segmented.
S3:根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果。S3: Obtain matching results of all the first segments and all the second segments according to a second preset rule.
本实施例将第一分段与第二分段按照行政级别的对应关系,一一进行匹配后得到匹配结果。举例地,第一地址的省级别对应的第一分段与第二地址的省级别对应的第二分段进行对比,以便提高信息对比的对称性和可靠性。In this embodiment, the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result. For example, the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
S4:根据所述匹配结果判断所述第一地址和所述第二地址是否相同。S4: Determine whether the first address and the second address are the same according to the matching result.
本实施例通过行政级别的对应关系,一一对应比较第一地址和第二地址,当第一地址和第二地址的匹配率达到预设范围,则判定第一地址和第二地址相同,否则不同。本申请其他实施例中,不仅要求匹配率达到预设范围,且要求指定行政级别对应的分段匹配度达到100%,方可判定第一地址和第二地址相同,否则不同,以便提高匹配准确度。This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels. When the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different. In other embodiments of the present application, not only is the matching rate required to reach the preset range, but also the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
本实施例的第一地址为用户输入的待查询地址,第一地址的数据组成结构不作限定,均可实现对待查询地址的匹配计算,提高用户使用的灵活度和自由度。例如,第一地址包括根据省、市/区/县/镇、乡/路、小区、大厦/栋及门牌号六个行政级别依次排布的数据组成,或包括缺失某个或某几个行政级别的数据组成。本实施例的预设匹配条件包括匹配率达到预设阈值,或第一地址中的标志数据达到100%匹配等。上述标志数据指第一地址中能详细说明地理位置的数据信息,比如某个小区的名称、某栋大厦的名称。比如第一地址中包括的“江南名居小区荣苑”为标志数据。本申请另一实施例的第一地址的标志数据为“镇、乡”行政级别之后,“栋及门牌号”之前的数据信息为标志数据。The first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user. For example, the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition. The preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on. The aforementioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building. For example, "Rongyuan of Jiangnan Mingju Residential Quarter" included in the first address is the mark data. In another embodiment of the present application, the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number" is sign data.
进一步地,所述第一地址和所述第二地址分别包括范围地址和标志地址,所述调用预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤S1,包括:Further, the first address and the second address respectively include a range address and a flag address, and the preset matching algorithm is invoked, and the first address and the second address are respectively segmented according to a first preset rule to obtain all The step S1 of the first word segmentation corresponding to the first address and the second word segmentation corresponding to the second address includes:
S11:将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分。S11: Perform word segmentation on the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation part corresponding to the first address and the first segmentation part respectively. The first segmentation part corresponding to the second address.
本实施例的范围地址至少包括省/市/区、县/乡、镇四个行政级别中的一个行政级别。本实施例的范围地址通过预关联地址词典进行分词,上述地址词典为全国地址数据库中对应的词库,通过预先关联于自然语言处理模型对地址名称进行分词。本实施例的预设匹配算法包括分析计算和匹配计算,为了提高了地址匹配精度,通过在开源分词算法包j ieba进行分词计算时,添加了爬虫地址库,与全国地址库相结合使用对待分词地址进行校正,然后根据行政级别进行分词,提高分词的准确率。通过判断当前地址所包含的行政级别是否为调用地址词典对应的行政级别,若是,则调用地址词典进行分词计算。举例地,地址:广东省佛山市南海区桂城镇江南名居小区荣苑1座306中,包括调用地址词典对应的四级行政级别,则将地址对应的四级行政级别根据地址词典进行分词,分词结果如下:广东省/佛山市/南海区/桂城镇/江南名居小区荣苑1座306。则第一分词部分对应为广东省/佛山市/南海区/桂城镇。The scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town. The range address in this embodiment is segmented through a pre-associated address dictionary. The address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model. The preset matching algorithm in this embodiment includes analysis calculation and matching calculation. In order to improve the accuracy of address matching, the crawler address library is added when the open source word segmentation algorithm package jieba is used for word segmentation calculation, and it is used in combination with the national address library to treat word segmentation The address is corrected, and then word segmentation is performed according to the administrative level to improve the accuracy of word segmentation. By judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, if so, the address dictionary is called for word segmentation calculation. For example, the address: 306, Building 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong Province, including the four-level administrative level corresponding to the address dictionary, then the four-level administrative level corresponding to the address is segmented according to the address dictionary. The result of word segmentation is as follows: Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong Province/Foshan City/Nanhai District/Guicheng Town.
S12:将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分。S12: Perform word segmentation according to the first grammar model in the natural language processing model with the flag addresses corresponding to the first address and the second address, respectively, to obtain the second segmentation part corresponding to the first address and the The second segmentation part corresponding to the second address.
本实施例的标志地址包括能详细说明地理位置的信息,比如某个小区的名称、某栋大厦的名称。比如上述地址中的“江南名居小区荣苑”。本实施例根据自然语言处理模型中的第一语法模型对标志地址进行分词,上述第一语法模型包括但不限于“某某小区”、“某某大厦”等。比如“桂城镇江南名居小区荣苑1座306”,对应的第二分词部分为“桂城/江南名居小区/荣苑”。本申请另一实施例的第一语法模型为提取“镇、乡”之后,“栋及门牌号”之前的字符为标志地址。The logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan" in the above address. In this embodiment, the token address is segmented according to the first grammar model in the natural language processing model. The first grammar model includes, but is not limited to, "a certain cell" and "a certain building". For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan". The first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before "building and house number" are the sign addresses.
S13:将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。S13: Combine the first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and group the first word segmentation corresponding to the second address The part and the second word segmentation part corresponding to the second address form a second word segmentation group corresponding to the second address.
本实施例的第一地址或第二地址均包括了范围地址和标志地址,且从左到右依次排布组成第一地址或第二地址。举例地,第一地址为“广东省佛山市南海区桂城镇江南名居小区荣苑”;第二地址为“广东省佛山市南海区桂城镇江南名居荣苑”;第一地址对应的第一分词组为“广东省/佛山市/南海区/桂城镇/江南名居小区/荣苑”和第二地址对应的第 二分词组为“广东省/佛山市/南海区/桂城镇/江南名居/荣苑”。The first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address. For example, the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province"; the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province"; the first address corresponds to the first address One sub-phrase is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan" and the second sub-phrase corresponding to the second address is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan".
进一步地,所述第一地址和所述第二地址还分别包括细节地址,所述将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤S13之后,包括:Further, the first address and the second address respectively include detailed addresses, and the marking addresses corresponding to the first address and the second address are performed according to the grammar model in the natural language processing model. After the step S13 of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, the method includes:
S14:将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分。S14: The detailed addresses corresponding to the first address and the second address are segmented according to the second grammar model in the natural language processing model to obtain the third segmentation part corresponding to the first address and the The third word segmentation part corresponding to the second address.
本实施例的细节地址为具体的“栋及门牌号”,对于匹配两个地址相似性具有微小作用及影响,甚至在其他实施例中可忽略此部分内容。但对于某些具体的应用场景需要精准到细节地址,以满足业务需求。本实施例的第二语法模型包括但不限于“某栋”、“某栋某楼层”、“某栋某楼层某室”等。The detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs. The second grammar model of this embodiment includes but is not limited to "a certain building", "a certain building and a certain floor", "a certain building and a certain room" and so on.
S15:将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。S15: Combine a first word segmentation part corresponding to the first address, a second word segmentation part corresponding to the first address, and a third word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address Forming the first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address into a second word segmentation group corresponding to the second address.
本实施例的第一地址或第二地址均包括了范围地址、标志地址和细节地址,且从左到右依次排布组成第一地址或第二地址。举例地,第一地址为“广东省佛山市南海区桂城镇江南名居小区荣苑1座306”;第二地址为“广东省佛山市南海区桂城镇江南名居荣苑1座502”;第一地址对应的第一分词组为“广东省/佛山市/南海区/桂城镇/江南名居小区/荣苑/1座/306”和第二地址对应的第二分词组为“广东省/佛山市/南海区/桂城镇/江南名居/荣苑/1座/502”,以便根据上述分词词组对第一地址或第二地址进行分段或划分行政级别。The first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address. For example, the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong Province”; the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province”; The first segment corresponding to the first address is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306" and the second segment corresponding to the second address is "Guangdong Province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide the first or second address into sections or administrative levels according to the above-mentioned word segmentation phrases.
进一步地,所述范围地址包括省/市/区、县/乡、镇四个行政级别,所述标志地址包括小区名称或大厦名称,所述根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤S3,包括:Further, the range address includes four administrative levels of province/city/district, county/township, and town, the mark address includes a cell name or a building name, and the first points are obtained according to a second preset rule. Step S3 of the matching result of the segment and all the second segments includes:
S31:将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应。S31: Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, where the structure tree includes multiple nodes, each The nodes respectively correspond to each of the first segments or each of the second segments in a one-to-one correspondence.
本实施例通过将第一地址对应的所有第一分段,或第二地址对应的所有第二分段,按照行政级别由高到低的顺序映射为两颗相同结构的结构树,一个节点至少对应一个分段,或一个节点对应同一行政级别的多个分词。比如将第一地址中含有的最高行政级别“省”对应的分词“广东省”作为根节点,然后依次连接下一级子节点“市”对应的分词“佛山市”,然后依次类推连接至末端节点“1座502”等。根据具体地址信息的不同,根节点和末端节点分别对应的行政级别不同,可以是覆盖所有行政级别的全地址,也可以是覆盖部分行政级别的短地址。In this embodiment, by mapping all first segments corresponding to the first address, or all second segments corresponding to the second address, into two structure trees with the same structure in the order of administrative level, one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level. For example, the participle "Guangdong Province" corresponding to the highest administrative level "province" contained in the first address is used as the root node, and then the participle "Foshan City" corresponding to the next-level sub-node "city" is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on. Depending on the specific address information, the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
S32:获取两颗所述结构树各节点分别对应的匹配值。S32: Obtain matching values corresponding to each node of the two structure trees.
本实施例的匹配计算,是根据行政级别的对应关系,映射两颗结构树之间节点与节点之间的对应关系,并根据上述对应关系获取计算各节点分别对应的匹配值,匹配值包括匹配分段除以该节点对应的所有分段。举例地,第一地址对应的一个节点,且为“省”节点,赋值为“广东”,第二地址对应的“省”节点赋值也为“广东”则为匹配,否则不匹配。The matching calculation in this embodiment is to map the corresponding relationship between the nodes and the nodes between the two structure trees according to the corresponding relationship of administrative levels, and obtain and calculate the matching value corresponding to each node according to the above-mentioned corresponding relationship. The matching value includes matching The segment is divided by all the segments corresponding to the node. For example, if a node corresponding to the first address is a "province" node, it is assigned the value "Guangdong", and the "province" node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
S33:分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重。S33: Obtain the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address, respectively.
本实施例根据各行政级别对应的分段对地址的影响不同,设置不同的权重,以提高满足业务需求的灵活度。例如标志地址对应的第二权重高于所述范围地址对应的第一权重等。In this embodiment, different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements. For example, the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
S34:根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率。S34: Calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address.
本实施例匹配率的计算公式为:各分段匹配结果*各分段配置权重等于各分段的匹配率,将各分段的匹配率进行加和,得到第一地址与第二地址的匹配结果。The formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
S35:将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。S35: The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
进一步地,所述获取两颗所述结构树各节点分别对应的匹配值的步骤S32,包括:Further, the step S32 of obtaining the matching value corresponding to each node of the two structure trees includes:
S321:将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值。S321: Perform precise and full matching of each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address according to the node correspondence relationship. , Get each first matching value.
本实施例的不同行政级别对应节点的匹配方法不同,省/市/区、县/乡、镇四个行政级别通过全匹配的精准对应方式进行匹配,即对应字符100%对应相同,则为匹配,否则不匹配。比如,第一地址对应的“省”节点赋值为“广东”,第一地址对应的“省”节点赋值也为“广东”则为匹配。The matching methods for nodes corresponding to different administrative levels in this embodiment are different. The four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match. For example, if the "province" node corresponding to the first address is assigned the value "Guangdong", and the "province" node corresponding to the first address is assigned the value "Guangdong", it is a match.
S322:将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值。S322: Perform a one-to-one correspondence between each first segment corresponding to the flag address in the first address and each second segment corresponding to the flag address in the second address to perform model keywords according to the node correspondence relationship. Match to obtain each second matching value.
本实施例对标志地址对应分段,通过NLP(Natural Language Processing,自然语言处理)模型匹配的方式实现匹配,包括或包含即可实现匹配关系。比如“江南名居小区/荣苑”与“江南名居/荣苑”,虽然字符上不具有全匹配的对等关系,但“江南名居小区”中包含了字符“江南名居”,依然具有一一对应的匹配关系。In this embodiment, the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including. For example, "Jiangnan Mingju Community/Rongyuan" and "Jiangnan Mingju/Rongyuan", although the characters do not have a full matching relationship, but "Jiangnan Mingju Community" contains the characters "Jiangnan Mingju", still There is a one-to-one matching relationship.
S323:将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值。S323: Perform digital matching for each first segment corresponding to the detail address in the first address and each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence. Obtain each third matching value.
本实施例的细节地址包括第一指定数量的分段,但符合匹配关系的分段数量为第二指定数量,则细节地址对应的匹配值为第二指定数量除以第一指定数量。The detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
S324:汇总各所述第一匹配值、各所述第二匹配值以及各所述第三匹配值,得到两颗所述结构树各节点分别对应的匹配值。S324: Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
比如,第一地址对应的分词词组为:广东/佛山市/南海/桂城/江南名居小区/荣苑/1/306;第二地址对应的分词词组为:广东/佛山市/南海/桂城/江南名居/荣苑/1/502;分段后将第一地址和第二地址划分为六个行政级别,包括省/市/区、县/镇、乡/路、小区、大厦/栋及门牌号,分别对应分成六个节点,各节点默认权重分别为“0.1/0.1/0.1/0.1/0.5/0.1”。前四行政等级为字符100%匹配:广东/佛山市/南海/桂城,匹配结果分别为0.1*1/0.1*1/0.1*1/0.1*1;第五行政等级匹配为字符包含关系的模型匹配:江南名居小区/荣苑和江南名居/荣苑的匹配结果为0.5*1;第六行政等级匹配为模糊匹配:1/306和1/502匹配中,对应的两个字段只有一个字段具有匹配关系,306和502不匹配,则对应的匹配值为0.5,则匹配结果为0.5*0.1,即0.05。则上述第一地址和第二地址的匹配率为:0.1+0.1+0.1+0.1+0.5+0.05=0.95。For example, the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306; the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502; after segmentation, the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1". The first four administrative levels are 100% character matching: Guangdong/Foshan City/Nanhai/Guicheng, the matching results are 0.1*1/0.1*1/0.1*1/0.1*1; the fifth administrative level matching is a model of character inclusion relations Matching: The matching result of Jiangnan Mingju Community/Rongyuan and Jiangnan Mingju/Rongyuan is 0.5*1; the sixth administrative level matching is fuzzy matching: 1/306 and 1/502 matching, there is only one corresponding two fields The field has a matching relationship. If 306 and 502 do not match, the corresponding matching value is 0.5, and the matching result is 0.5*0.1, that is, 0.05. Then the matching ratio between the first address and the second address is: 0.1+0.1+0.1+0.1+0.5+0.05=0.95.
进一步地,所述分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤S33之前,包括:Further, before step S33 of separately acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address, the method includes:
S331:将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练。S331: Input a specified number of training samples pre-labeled with similarity values into the natural language processing model for training.
S332:通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致。S332: Make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value by adjusting the training parameter to the first parameter.
S333:将所述第一参数中对应的权重值,分别按照节点对应关系对应为所述第一权重、所述第二权重以及所述第三权重。S333: Corresponding to the corresponding weight values in the first parameter as the first weight, the second weight, and the third weight according to the node correspondence relationship.
本实施例的默认权重通过训练模型训练得到,通过在训练过程中不断调节训练参数, 使模型训练输出的相似度与预先标注的相似度值一致,或在预设偏差范围内,上述训练参数包括各权重值,以确定各权重值。本申请其他实施例也可根据具体的应用场景将调整默认权重中的一个或多个,使匹配模型更加符合当前应用场景。The default weights in this embodiment are obtained through training of the training model, and the training parameters are continuously adjusted during the training process, so that the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range. The above training parameters include Each weight value to determine each weight value. Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
进一步地,所述将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分的步骤S11之前,包括:Further, the range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the first segmentation part corresponding to the first address and Before step S11 of the first word segmentation part corresponding to the second address, the method includes:
S10:调用地址数据库按照第三预设规则,分别对所述第一地址和所述第二地址进行地址修正。S10: Call the address database to perform address correction on the first address and the second address respectively according to the third preset rule.
本实施例的第一地址或第二地址可以是不符合全国地址数据库中的地址数据,可通过调用地址数据库进行地址修正,包括地址补全,去除限定词等。本实施例地址补全时,依据子结点补全根结点,如南海区可以向上补全佛山市;或根据前后节点补全中间节点,如佛山市和桂城镇可以中间补全南海区等方式进行地址补全。The first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on. When completing the address in this embodiment, the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
进一步地,调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤S1之前,包括:Further, a preset matching algorithm is called, and the first address and the second address are respectively segmented according to a first preset rule to obtain the first segmentation group corresponding to the first address and the corresponding second address Before step S1 of the second word grouping, include:
S1a:将所述索引服务器中预存储的指定数量的非结构化的地址数据进行索引化,以得到所述预设索引结构。S1a: Indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
本实施例的索引服务器中预存储的数据为非结构化数据,其存储方式是键值对的列存储形式,非结构化数据是指文本、图像、语音等基于NoSQL存储技术形成的列存储,数据量非常大,需要采用分布式架构的NoSQL技术进行存储与计算,索引服务器正是结合了NoSQL的分布式架构存储和索引结构实现了对海量数据的实时快速查询及计算。NOSQL即非关系型数据库,为开源技术。elasticsearch基于Key-value键值对和倒排索引的存储方式,计算则主要大量基于内存,实现快速实时计算。The data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs. Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice. The amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation. The index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data. NOSQL is a non-relational database, an open source technology. Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
S1b:接收上传至所述索引服务器的指定目录下的接口插件,其中所述接口插件通过将所述预设匹配算法进行打包封装后形成。S1b: Receive the interface plug-in uploaded to the designated directory of the index server, where the interface plug-in is formed by packaging and encapsulating the preset matching algorithm.
本实施例的索引服务器是开源组件,支持插件模式,可以将接口插件继承其rg.索引服务器.plugins.Plugin类,进行自定义扩展开发的地址匹配算法插件,重启索引服务器即可加载使用。The index server in this embodiment is an open source component and supports a plug-in mode. The interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
S1c:获取所述接口插件的配置参数。S1c: Obtain the configuration parameters of the interface plug-in.
S1d:通过运行所述配置参数将所述预设索引结构与所述接口插件建立计算关联关系。S1d: Establish a calculation association relationship between the preset index structure and the interface plug-in by running the configuration parameter.
本实施例将预设匹配算法开发完后,打包封装后上传到索引服务器指定目录并进行相关配置参数配置,以实现通过加载运行配置参数,将所述预设索引结构与所述接口插件建立计算关联关系,实现通过调用插件中地址匹配算法,将第一地址在预设索引结构中完成匹配计算,以实现地址数据查询。In this embodiment, after the preset matching algorithm is developed, it is packaged and packaged and then uploaded to the specified directory of the index server and configured for related configuration parameters, so as to realize the calculation of the preset index structure and the interface plug-in by loading and running configuration parameters The association relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
本实施例的索引服务器为开源的Elastic search组件(Elastic search用于分布式全文检索),基于RESTful web接口提供了分布式计算能力的全文搜索引擎,能够对海量数据进行实时快速查询。查询步骤包括:(1)将海量地址库的地址按照elasticsearch的数据导入接口以key-value键值对的形式导入elasticsearch的底层存储,并对key建立索引。(2)将(1)的地地匹配模型按照elasticsearch自定义扩展search模型改造,并添加到elasticsearch主节点扩展模块,并重启elasticsearch,使之成为可以基于利用elasticsearch的分布式存储与高并发计算的地址匹配模型。(3)利用该自定义模型,在elasticsearch上开发一对多海量地址匹配接口。(4)通过在elasticsearch上开发上层接口,使得可输入一个新的地址,并选择待匹配的海量地址库和自定义模型,即可以基于elasticsearch实现新地址与海量地址库中地址的快速实计算,并返回最相似的TOPN地址,其中N可程序设定传参。本实施例通过对索引服务器中预存的海量数据建立索引结构,结 合Elasticsearch组件自身的计算架构以及强大的分布式计算能力,实现对第一地址在预设索引结构中,进行实时快速查询。The index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data. The query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2) The ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model. (3) Use this custom model to develop a one-to-many mass address matching interface on elasticsearch. (4) By developing the upper-level interface on elasticsearch, it is possible to enter a new address, and select the mass address library and custom model to be matched, that is, based on elasticsearch, the new address and the address in the mass address library can be quickly calculated. And return the most similar TOPN address, where N can be programmed to pass parameters. In this embodiment, by establishing an index structure for the massive data pre-stored in the index server, combining with the computing architecture of the Elasticsearch component itself and powerful distributed computing capabilities, real-time fast querying of the first address in the preset index structure is realized.
本实施例针对第一地址不同行政级别对应的不同分段的匹配方法不同,匹配模型不同,而且各分段对应的匹配权重也不同。本实施例的第一地址分成六个分段,分别对应六个行政级别,对应树结构中的六个节点,六个行政级别中前四个行政级别的匹配模型相同,为字符一一对应匹配;第五个行政级别通过包含或包括的模糊匹配模型;第六个行政级别通过数字匹配模型匹配。本实施例通过在匹配计算过程中设置过滤机制,首先对“省/市、区/县/镇、乡、路”四个行政级别对应的目标分段,通过字符一一匹配的方式,进行精准匹配计算,当对于上述四个行政级别对应的目标分段的匹配计算结果低于预设阈值时,判定所述预设索引结构中不存在与所述第一地址满足预设匹配条件的地址数据,直接输出匹配结论,以降低匹配计算量,提高响应速度。本实施例通过设置过滤机制,能过滤了至少90%以上的地址。这样使得一个地址最终只需要与剩余10%左右的地址进行全匹配,大大节省了计算资源。This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment. The first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure. The matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ; The fifth administrative level adopts the fuzzy matching model of inclusion or inclusion; the sixth administrative level adopts the digital matching model to match. In this embodiment, a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one. Matching calculation, when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed. In this embodiment, by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
参照图2,本申请一实施例的地址匹配装置,所述第一地址为用户输入的待检索地址,所述第二地址存储于索引服务器中,装置包括:2, in the address matching device of an embodiment of the present application, the first address is an address to be retrieved input by a user, and the second address is stored in an index server, and the device includes:
分词模块1,用于调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算。The word segmentation module 1 is used to call the preset matching algorithm, and respectively segment the first address and the second address according to the first preset rule to obtain the first segmentation group and the second address corresponding to the first address The corresponding second word segmentation group, wherein the preset matching algorithm includes word segmentation calculation and matching calculation.
本实施例中,以比较第一地址和第二地址的相似性为例,上述第一地址和第二地址为依据行政级别由高到低,由范围到具体的方式进行书写。本实施例的第一预设规则根据地址中所处的行政级别不同具有不同的分词规则,比如全国通用的省/市/区、县/乡、镇四个行政级别对应的分词,通常借用全国通用地址数据库进行分词。比如广东省佛山市南海区桂城镇,分词结果如下:广东省/佛山市/南海区/桂城镇。对于上述省/市/区、县/乡、镇四个行政级别之外的地址信息,通过语义分词的方式进行分词。In this embodiment, taking the comparison of the similarity between the first address and the second address as an example, the above-mentioned first address and the second address are written according to the administrative level from high to low, and from range to specific. The first preset rule of this embodiment has different word segmentation rules according to the administrative level in the address. For example, the word segmentation corresponding to the four administrative levels of province/city/district, county/township, and town is commonly used nationwide. The general address database performs word segmentation. For example, in Guicheng Town, Nanhai District, Foshan City, Guangdong Province, the word segmentation results are as follows: Guangdong Province/Foshan City/Nanhai District/Guicheng Town. For the address information outside the four administrative levels of provinces/cities/districts, counties/townships, and towns, word segmentation is performed through semantic segmentation.
划分模块2,用于根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段。The dividing module 2 is configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group.
本实施例根据地址对应的分词词组对地址进行分段和/或划分行政等级,每个分段或每个行政等级对应一个或多个分词。为方便区分第一地址对应各第一分段,第二地址对应各第二分段,本实施例的“第一”、“第二”等仅用于区别,不用于限定,其他处的类似用语作用相同,不赘述。分词组为实际地址的分词排列,按照原地址的书写次序形成。比如名称比较长的“某市开发区”,对应两个分词“某市/开发区”,但分段是在分词的基础上依据行政等级进行的分段,比如“某市开发区”属于一个分段。In this embodiment, the address is segmented and/or administrative levels are divided according to the word segmentation phrase corresponding to the address, and each segment or each administrative level corresponds to one or more word segmentation. In order to facilitate the distinction between the first address corresponding to each first segment, and the second address corresponding to each second segment, the "first", "second", etc. in this embodiment are only used for distinction and are not used for limitation. Other places are similar The terms have the same effect and will not be repeated. The word segmentation group is the word segmentation arrangement of the actual address, which is formed according to the writing order of the original address. For example, the long-named "development zone of a certain city" corresponds to two participles "a certain city/development zone", but the segmentation is based on the word segmentation based on administrative levels. For example, "development zone of a certain city" belongs to one Segmented.
第一获取模块3,用于根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果。The first obtaining module 3 is configured to obtain the matching results of all the first segments and all the second segments according to a second preset rule.
本实施例将第一分段与第二分段按照行政级别的对应关系,一一进行匹配后得到匹配结果。举例地,第一地址的省级别对应的第一分段与第二地址的省级别对应的第二分段进行对比,以便提高信息对比的对称性和可靠性。In this embodiment, the first segment and the second segment are matched one by one according to the corresponding relationship of the administrative level to obtain the matching result. For example, the first segment corresponding to the province level of the first address is compared with the second segment corresponding to the province level of the second address, so as to improve the symmetry and reliability of information comparison.
判断模块4,用于根据所述匹配结果判断所述第一地址和所述第二地址是否相同。The judging module 4 is configured to judge whether the first address and the second address are the same according to the matching result.
本实施例通过行政级别的对应关系,一一对应比较第一地址和第二地址,当第一地址和第二地址的匹配率达到预设范围,则判定第一地址和第二地址相同,否则不同。本申请其他实施例中,不仅要求匹配率达到预设范围,且要求指定行政级别对应的分段匹配度达到100%,方可判定第一地址和第二地址相同,否则不同,以便提高匹配准确度。This embodiment compares the first address and the second address in a one-to-one correspondence through the correspondence of administrative levels. When the matching rate of the first address and the second address reaches the preset range, it is determined that the first address and the second address are the same, otherwise different. In other embodiments of the present application, not only is the matching rate required to reach the preset range, but also the segment matching degree corresponding to the designated administrative level is required to reach 100%, before it can be determined that the first address and the second address are the same, otherwise different, in order to improve matching accuracy degree.
本实施例的第一地址为用户输入的待查询地址,第一地址的数据组成结构不作限定,均可实现对待查询地址的匹配计算,提高用户使用的灵活度和自由度。例如,第一地址包括根据省、市/区/县/镇、乡/路、小区、大厦/栋及门牌号六个行政级别依次排布的数 据组成,或包括缺失某个或某几个行政级别的数据组成。本实施例的预设匹配条件包括匹配率达到预设阈值,或第一地址中的标志数据达到100%匹配等。上述标志数据指第一地址中能详细说明地理位置的数据信息,比如某个小区的名称、某栋大厦的名称。比如第一地址中包括的“江南名居小区荣苑”为标志数据。本申请另一实施例的第一地址的标志数据为“镇、乡”行政级别之后,“栋及门牌号”之前的数据信息为标志数据。The first address in this embodiment is the address to be queried entered by the user, and the data composition structure of the first address is not limited, and it can all realize the matching calculation of the address to be queried, which improves the flexibility and freedom of the user. For example, the first address includes data arranged in sequence according to six administrative levels: province, city/district/county/town, township/road, community, building/building, and house number, or includes missing one or several administrative levels Level of data composition. The preset matching condition in this embodiment includes the matching rate reaching a preset threshold, or the marking data in the first address reaching 100% matching, and so on. The above-mentioned sign data refers to the data information in the first address that can specify the geographic location, such as the name of a certain community or the name of a certain building. For example, "Rongyuan of Jiangnan Mingju Residential Quarter" included in the first address is the mark data. In another embodiment of the present application, the sign data of the first address is after the administrative level of "town, township", and the data information before "building and house number" is sign data.
进一步地,所述分词模块1,包括:Further, the word segmentation module 1 includes:
第一分词单元,用于将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分。The first word segmentation unit is used to segment the range addresses corresponding to the first address and the second address respectively according to the pre-associated address dictionary in the natural language processing model to obtain the first segmentation corresponding to the first address. Part and the first word segmentation part corresponding to the second address.
本实施例的范围地址至少包括省/市/区、县/乡、镇四个行政级别中的一个行政级别。本实施例的范围地址通过预关联地址词典进行分词,上述地址词典为全国地址数据库中对应的词库,通过预先关联于自然语言处理模型对地址名称进行分词。本实施例为了提高了地址匹配精度,通过在开源分词算法包jieba,进行分词计算时,添加了爬虫地址库,与全国地址库相结合使用对待分词地址进行校正,然后根据行政级别进行分词,提高分词的准确率。通过判断当前地址所包含的行政级别是否为调用地址词典对应的行政级别,若是,则调用地址词典进行分词。举例地,地址:广东省佛山市南海区桂城镇江南名居小区荣苑1座306中,包括调用地址词典对应的四级行政级别,则将地址对应的四级行政级别根据地址词典进行分词,分词结果如下:广东省/佛山市/南海区/桂城镇/江南名居小区荣苑1座306。则第一分词部分对应为广东省/佛山市/南海区/桂城镇。The scope address of this embodiment includes at least one of the four administrative levels of province/city/district, county/township, and town. The range address in this embodiment is segmented through a pre-associated address dictionary. The address dictionary is a corresponding vocabulary in a national address database, and the address name is segmented by pre-associating with a natural language processing model. In order to improve the accuracy of address matching, this embodiment adds a crawler address library when performing word segmentation calculations in the open source word segmentation algorithm package jieba, and uses it in combination with the national address library to correct the address to be segmented, and then perform word segmentation according to the administrative level to improve The accuracy of word segmentation. By judging whether the administrative level contained in the current address is the administrative level corresponding to the calling address dictionary, if so, calling the address dictionary for word segmentation. For example, the address: 306, Building 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong Province, including the four-level administrative level corresponding to the address dictionary, then the four-level administrative level corresponding to the address is segmented according to the address dictionary. The result of word segmentation is as follows: Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Residential District Rongyuan Block 306. The first participle corresponds to Guangdong Province/Foshan City/Nanhai District/Guicheng Town.
第二分词单元,用于将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分。The second word segmentation unit is used to segment the flag addresses corresponding to the first address and the second address respectively according to the first grammar model in the natural language processing model to obtain the second address corresponding to the first address. The word segmentation part and the second word segmentation part corresponding to the second address.
本实施例的标志地址包括能详细说明地理位置的信息,比如某个小区的名称、某栋大厦的名称。比如上述地址中的“江南名居小区荣苑”。本实施例根据自然语言处理模型中的第一语法模型对标志地址进行分词,上述第一语法模型包括但不限于“某某小区”、“某某大厦”等。比如“桂城镇江南名居小区荣苑1座306”,对应的第二分词部分为“桂城/江南名居小区/荣苑”。本申请另一实施例的第一语法模型为提取“镇、乡”之后,“栋及门牌号”之前的字符为标志地址。The logo address in this embodiment includes information that can specify a geographic location, such as the name of a certain community or the name of a certain building. For example, "Jiangnan Mingju Community Rongyuan" in the above address. In this embodiment, the token address is segmented according to the first grammar model in the natural language processing model. The first grammar model includes, but is not limited to, "a certain cell" and "a certain building". For example, "306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town", the corresponding second participle is "Guicheng/Jiangnan Mingju Community/Rongyuan". The first grammar model of another embodiment of the present application is that after extracting "town, township", the characters before "building and house number" are the sign addresses.
第一组成单元,用于将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。The first component unit is configured to combine a first word segmentation part corresponding to the first address and a second word segmentation part corresponding to the first address into a first word segmentation group corresponding to the first address, and to combine the second address The corresponding first segmentation part and the second segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
本实施例的第一地址或第二地址均包括了范围地址和标志地址,且从左到右依次排布组成第一地址或第二地址。举例地,第一地址为“广东省佛山市南海区桂城镇江南名居小区荣苑”;第二地址为“广东省佛山市南海区桂城镇江南名居荣苑”;第一地址对应的第一分词组为“广东省/佛山市/南海区/桂城镇/江南名居小区/荣苑”和第二地址对应的第二分词组为“广东省/佛山市/南海区/桂城镇/江南名居/荣苑”。The first address or the second address in this embodiment both include a range address and a mark address, and are arranged from left to right to form the first address or the second address. For example, the first address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province"; the second address is "Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province"; the first address corresponds to the first address One sub-phrase is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan" and the second sub-phrase corresponding to the second address is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan".
进一步地,第一地址和所述第二地址还分别包括细节地址,所述分词模块1,包括:Further, the first address and the second address also respectively include detailed addresses, and the word segmentation module 1 includes:
第三分词单元,用于将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分。The third word segmentation unit is used to segment the detailed addresses corresponding to the first address and the second address respectively according to the second grammar model in the natural language processing model to obtain the third address corresponding to the first address. The word segmentation part and the third word segmentation part corresponding to the second address.
本实施例的细节地址为具体的“栋及门牌号”,对于匹配两个地址相似性具有微小作用及影响,甚至在其他实施例中可忽略此部分内容。但对于某些具体的应用场景需要精准到细节地址,以满足业务需求。本实施例的第二语法模型包括但不限于“某栋”、“某栋某楼层”、“某栋某楼层某室”等。The detailed address in this embodiment is the specific "building and house number", which has a small effect and influence on matching the similarity of two addresses, and this part of content can even be ignored in other embodiments. However, for some specific application scenarios, the detailed address needs to be accurate to meet business needs. The second grammar model of this embodiment includes but is not limited to "a certain building", "a certain building and a certain floor", "a certain building and a certain room" and so on.
第二组成单元,用于将所述第一地址对应的第一分词部分、所述第一地址对应的第二 分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。The second component unit is used to combine the first word segmentation part corresponding to the first address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address into the first address corresponding The first word segmentation group of the second address, the first word segmentation portion corresponding to the second address, the second word segmentation portion corresponding to the second address, and the third word segmentation portion corresponding to the second address form the second address corresponding to the The second sub-phrase.
本实施例的第一地址或第二地址均包括了范围地址、标志地址和细节地址,且从左到右依次排布组成第一地址或第二地址。举例地,第一地址为“广东省佛山市南海区桂城镇江南名居小区荣苑1座306”;第二地址为“广东省佛山市南海区桂城镇江南名居荣苑1座502”;第一地址对应的第一分词组为“广东省/佛山市/南海区/桂城镇/江南名居小区/荣苑/1座/306”和第二地址对应的第二分词组为“广东省/佛山市/南海区/桂城镇/江南名居/荣苑/1座/502”,以便根据上述分词词组对第一地址或第二地址进行分段或划分行政级别。The first address or the second address in this embodiment both include a range address, a mark address, and a detail address, and are arranged from left to right to form the first address or the second address. For example, the first address is “306, Block 1, Rongyuan, Jiangnan Mingju Community, Guicheng Town, Nanhai District, Foshan City, Guangdong Province”; the second address is “502, Building 1, Jiangnan Mingju Rongyuan, Guicheng Town, Nanhai District, Foshan City, Guangdong Province”; The first segment corresponding to the first address is "Guangdong Province/Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju Community/Rongyuan/1 Block/306" and the second segment corresponding to the second address is "Guangdong Province /Foshan City/Nanhai District/Guicheng Town/Jiangnan Mingju/Rongyuan/1 Block/502" in order to divide the first or second address into sections or administrative levels according to the above-mentioned word segmentation phrases.
进一步地,范围地址包括省/市/区、县/乡、镇四个行政级别,第一获取模块3,包括:Further, the scope address includes four administrative levels of province/city/district, county/township, and town. The first acquisition module 3 includes:
映射单元,用于将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应。The mapping unit is configured to map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple Nodes, each node corresponds to each of the first segment or each of the second segments, respectively.
本实施例通过将第一地址对应的所有第一分段,或第二地址对应的所有第二分段,按照行政级别由高到低的顺序映射为两颗相同结构的结构树,一个节点至少对应一个分段,或一个节点对应同一行政级别的多个分词。比如将第一地址中含有的最高行政级别“省”对应的分词“广东省”作为根节点,然后依次连接下一级子节点“市”对应的分词“佛山市”,然后依次类推连接至末端节点“1座502”等。根据具体地址信息的不同,根节点和末端节点分别对应的行政级别不同,可以是覆盖所有行政级别的全地址,也可以是覆盖部分行政级别的短地址。In this embodiment, by mapping all first segments corresponding to the first address, or all second segments corresponding to the second address, into two structure trees with the same structure in the order of administrative level, one node is at least Correspond to a segment, or a node corresponds to multiple word segments of the same administrative level. For example, the participle "Guangdong Province" corresponding to the highest administrative level "province" contained in the first address is used as the root node, and then the participle "Foshan City" corresponding to the next-level sub-node "city" is sequentially connected, and then connected to the end by analogy Node "1 Block 502" and so on. Depending on the specific address information, the root node and the end node respectively correspond to different administrative levels. It can be a full address covering all administrative levels, or a short address covering some administrative levels.
第一获取单元,用于获取两颗所述结构树各节点分别对应的匹配值。The first obtaining unit is used to obtain the matching values corresponding to the respective nodes of the two structure trees.
本实施例根据行政级别的对应关系,映射两颗结构树之间节点与节点之间的对应关系,并根据上述对应关系获取各节点分别对应的匹配值,匹配值包括匹配分段除以该节点对应的所有分段。举例地,第一地址对应的一个节点,且为“省”节点,赋值为“广东”,第二地址对应的“省”节点赋值也为“广东”则为匹配,否则不匹配。In this embodiment, the corresponding relationship between nodes and nodes between two structure trees is mapped according to the corresponding relationship of administrative levels, and the matching value corresponding to each node is obtained according to the above corresponding relationship. The matching value includes the matching segment divided by the node All corresponding segments. For example, if a node corresponding to the first address is a "province" node, it is assigned the value "Guangdong", and the "province" node corresponding to the second address is also assigned a value of "Guangdong", it is a match, otherwise it does not match.
第二获取单元,用于分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重。The second obtaining unit is configured to obtain the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address, respectively.
本实施例根据各行政级别对应的分段对地址的影响不同,设置不同的权重,以提高满足业务需求的灵活度。例如标志地址对应的第二权重高于范围地址对应的第一权重等。In this embodiment, different weights are set according to the different impacts of the corresponding segments of each administrative level on the address, so as to improve the flexibility of meeting business requirements. For example, the second weight corresponding to the flag address is higher than the first weight corresponding to the range address.
计算单元,用于根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率。The calculation unit is configured to calculate the matching rate according to the matching value multiplied by the corresponding weight to obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address, respectively .
本实施例匹配率的计算公式为:各分段匹配结果*各分段配置权重等于各分段的匹配率,将各分段的匹配率进行加和,得到第一地址与第二地址的匹配结果。The formula for calculating the matching rate in this embodiment is: the matching result of each segment * the configuration weight of each segment is equal to the matching rate of each segment, and the matching rates of each segment are added to obtain the matching between the first address and the second address result.
加和单元,用于将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。The summation unit is configured to sum the first matching rate, the second matching rate, and the third matching rate as the sum of all the first segments and all the second segments Match results.
进一步地,所述第一获取单元,包括:Further, the first obtaining unit includes:
第一匹配子单元,用于将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值。The first matching subunit is used to compare each first segment corresponding to the range address in the first address with each second segment corresponding to the range address in the second address, according to the node correspondence relationship one One-to-one correspondence performs accurate full matching, and each first matching value is obtained.
本实施例的不同行政级别对应节点的匹配方法不同,省/市/区、县/乡、镇四个行政级别通过全匹配的精准对应方式进行匹配,即对应字符100%对应相同,则为匹配,否则不匹配。比如,第一地址对应的“省”节点赋值为“广东”,第一地址对应的“省”节点赋值也为“广东”则为匹配。The matching methods for nodes corresponding to different administrative levels in this embodiment are different. The four administrative levels of province/city/district, county/township, and town are matched through the exact correspondence method of full matching, that is, if the corresponding characters are 100% corresponding to the same, it is a match , Otherwise it does not match. For example, if the "province" node corresponding to the first address is assigned the value "Guangdong", and the "province" node corresponding to the first address is assigned the value "Guangdong", it is a match.
第二匹配子单元,用于将所述第一地址中的标志地址对应的各第一分段,与将所述第 二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值。The second matching subunit is used to compare each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address, according to the node correspondence relationship. One-to-one matching of model keywords is performed to obtain each second matching value.
本实施例对标志地址对应分段,通过NLP(Natural Language Processing,自然语言处理)模型匹配的方式实现匹配,包括或包含即可实现匹配关系。比如“江南名居小区/荣苑”与“江南名居/荣苑”,虽然字符上不具有全匹配的对等关系,但“江南名居小区”中包含了字符“江南名居”,依然具有一一对应的匹配关系。In this embodiment, the corresponding segment of the mark address is matched by NLP (Natural Language Processing) model matching, and the matching relationship can be realized by including or including. For example, "Jiangnan Mingju Community/Rongyuan" and "Jiangnan Mingju/Rongyuan", although the characters do not have a full matching relationship, but "Jiangnan Mingju Community" contains the characters "Jiangnan Mingju", still There is a one-to-one matching relationship.
第三匹配子单元,用于将第一地址中的细节地址对应的各第一分段,与将第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值。The third matching subunit is used to connect each first segment corresponding to the detail address in the first address to each second segment corresponding to the detail address in the second address, and perform a number one-to-one correspondence according to the node correspondence. Match to obtain each third matching value.
本实施例的细节地址包括第一指定数量的分段,但符合匹配关系的分段数量为第二指定数量,则细节地址对应的匹配值为第二指定数量除以第一指定数量。The detailed address in this embodiment includes the first specified number of segments, but the number of segments that meet the matching relationship is the second specified number, and the matching value corresponding to the detailed address is the second specified number divided by the first specified number.
汇总子单元,用于汇总各所述第一匹配值、各所述第二匹配值以及各所述第三匹配值,得到两颗所述结构树各节点分别对应的匹配值。The summarizing subunit is used to summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
比如,第一地址对应的分词词组为:广东/佛山市/南海/桂城/江南名居小区/荣苑/1/306;第二地址对应的分词词组为:广东/佛山市/南海/桂城/江南名居/荣苑/1/502;分段后将第一地址和第二地址划分为六个行政级别,包括省/市/区、县/镇、乡/路、小区、大厦/栋及门牌号,分别对应分成六个节点,各节点默认权重分别为“0.1/0.1/0.1/0.1/0.5/0.1”。前四行政等级为字符100%匹配:广东/佛山市/南海/桂城,匹配结果分别为0.1*1/0.1*1/0.1*1/0.1*1;第五行政等级匹配为字符包含关系的模型匹配:江南名居小区/荣苑和江南名居/荣苑的匹配结果为0.5*1;第六行政等级匹配为模糊匹配:1/306和1/502匹配中,对应的两个字段只有一个字段具有匹配关系,306和502不匹配,则对应的匹配值为0.5,则匹配结果为0.5*0.1,即0.05。则上述第一地址和第二地址的匹配率为:0.1+0.1+0.1+0.1+0.5+0.05=0.95。For example, the word segmentation phrase corresponding to the first address is: Guangdong/Foshan City/Nanhai/Guicheng/Jiangnan Mingju Community/Rongyuan/1/306; the word segmentation phrase corresponding to the second address is: Guangdong/Foshan City/Nanhai/Guicheng/ Jiangnan Mingju/Rongyuan/1/502; after segmentation, the first and second addresses are divided into six administrative levels, including province/city/district, county/town, township/road, community, building/building and The house number is divided into six nodes respectively, and the default weight of each node is "0.1/0.1/0.1/0.1/0.5/0.1". The first four administrative levels are 100% character matching: Guangdong/Foshan City/Nanhai/Guicheng, the matching results are 0.1*1/0.1*1/0.1*1/0.1*1; the fifth administrative level matching is a model of character inclusion relations Matching: The matching result of Jiangnan Mingju Community/Rongyuan and Jiangnan Mingju/Rongyuan is 0.5*1; the sixth administrative level matching is fuzzy matching: 1/306 and 1/502 matching, there is only one corresponding two fields The field has a matching relationship. If 306 and 502 do not match, the corresponding matching value is 0.5, and the matching result is 0.5*0.1, that is, 0.05. Then the matching ratio between the first address and the second address is: 0.1+0.1+0.1+0.1+0.5+0.05=0.95.
进一步地,所述第一获取模块3,包括:Further, the first obtaining module 3 includes:
输入单元,用于将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练。The input unit is used to input a specified number of training samples with pre-labeled similarity values into the natural language processing model for training.
调整单元,用于通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致。The adjustment unit is configured to adjust the training parameter to the first parameter to make the similarity value output by the natural language processing model consistent with the pre-labeled similarity value.
对应单元,用于将所述第一参数中对应的权重值,分别按照节点对应关系对应为所述第一权重、所述第二权重以及所述第三权重。The corresponding unit is configured to correspond the corresponding weight value in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
本实施例的默认权重通过训练模型训练得到,通过在训练过程中不断调节训练参数,使模型训练输出的相似度与预先标注的相似度值一致,或在预设偏差范围内,上述训练参数包括各权重值,以确定各权重值。本申请其他实施例也可根据具体的应用场景将调整默认权重中的一个或多个,使匹配模型更加符合当前应用场景。The default weights in this embodiment are obtained through training of the training model. By continuously adjusting the training parameters during the training process, the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range. The above training parameters include Each weight value to determine each weight value. Other embodiments of the present application may also adjust one or more of the default weights according to specific application scenarios, so that the matching model is more in line with the current application scenarios.
进一步地,所述分词模块1,包括:Further, the word segmentation module 1 includes:
调用单元,用于调用地址数据库按照第三预设规则,分别对所述第一地址和所述第二地址进行地址修正。The calling unit is configured to call the address database to perform address correction on the first address and the second address respectively according to a third preset rule.
本实施例的第一地址或第二地址可以是不符合全国地址数据库中的地址数据,可通过调用地址数据库进行地址修正,包括地址补全,去除限定词等。本实施例地址补全时,依据子结点补全根结点,如南海区可以向上补全佛山市;或根据前后节点补全中间节点,如佛山市和桂城镇可以中间补全南海区等方式进行地址补全。The first address or the second address in this embodiment may be inconsistent with address data in the national address database, and address correction can be performed by calling the address database, including address completion, removal of qualifiers, and so on. When completing the address in this embodiment, the root node is complemented based on the sub-nodes. For example, Nanhai District can complement Foshan City upwards; or the intermediate nodes can be complemented based on the front and rear nodes, such as Foshan City and Guicheng Town, which can complement Nanhai District in the middle. Method for address completion.
进一步地,地址匹配装置,还包括:Further, the address matching device further includes:
索引模块,用于将所述索引服务器中预存储的指定数量的非结构化的地址数据进行索引化,以得到所述预设索引结构。The index module is used for indexing a specified number of unstructured address data pre-stored in the index server to obtain the preset index structure.
本实施例的索引服务器中预存储的数据为非结构化数据,其存储方式是键值对的列存 储形式,非结构化数据是指文本、图像、语音等基于NoSQL存储技术形成的列存储,数据量非常大,需要采用分布式架构的NoSQL技术进行存储与计算,索引服务器正是结合了NoSQL的分布式架构存储和索引结构实现了对海量数据的实时快速查询及计算。NOSQL即非关系型数据库,为开源技术。elasticsearch基于Key-value键值对和倒排索引的存储方式,计算则主要大量基于内存,实现快速实时计算。The data pre-stored in the index server of this embodiment is unstructured data, and its storage method is the column storage form of key-value pairs. Unstructured data refers to column storage formed based on NoSQL storage technology such as text, image, and voice. The amount of data is very large, and the distributed architecture of NoSQL technology needs to be used for storage and calculation. The index server combines the NoSQL distributed architecture storage and index structure to achieve real-time and fast query and calculation of massive data. NOSQL is a non-relational database, an open source technology. Elasticsearch is based on the storage method of Key-value key-value pairs and inverted indexes, and the calculation is mainly based on memory to achieve fast real-time calculation.
接收模块,用于接收上传至所述索引服务器的指定目录下的接口插件,其中所述接口插件通过将所述预设匹配算法进行打包封装后形成。The receiving module is configured to receive the interface plug-ins uploaded to the designated directory of the index server, wherein the interface plug-ins are formed by packaging the preset matching algorithm.
本实施例的索引服务器是开源组件,支持插件模式,可以将接口插件继承其rg.索引服务器.plugins.Plugin类,进行自定义扩展开发的地址匹配算法插件,重启索引服务器即可加载使用。The index server in this embodiment is an open source component and supports a plug-in mode. The interface plug-in can inherit its rg. index server. plugins. Plugin class to customize and expand the address matching algorithm plug-in developed by restarting the index server to load and use.
第二获取模块,用于获取所述接口插件的配置参数。The second acquiring module is used to acquire the configuration parameters of the interface plug-in.
建立模块,用于通过运行配置参数将预设索引结构与接口插件建立计算关联关系。The establishment module is used to establish a calculation association relationship between the preset index structure and the interface plug-in through the operation configuration parameter.
本实施例将地址匹配算法开发完后,打包封装后上传到索引服务器指定目录并进行相关配置参数配置,以实现通过加载运行配置参数,将所述预设索引结构与所述接口插件建立计算关联关系,实现通过调用插件中地址匹配算法,将第一地址在预设索引结构中完成匹配计算,以实现地址数据查询。In this embodiment, after the address matching algorithm is developed, it is packaged and packaged and uploaded to the specified directory of the index server and configured with related configuration parameters, so as to realize the calculation association between the preset index structure and the interface plug-in by loading and operating configuration parameters The relationship is realized by calling the address matching algorithm in the plug-in to complete the matching calculation of the first address in the preset index structure to realize the address data query.
本实施例的索引服务器为开源的Elastic search组件(Elastic search用于分布式全文检索),基于RESTful web接口提供了分布式计算能力的全文搜索引擎,能够对海量数据进行实时快速查询。查询步骤包括:(1)将海量地址库的地址按照elasticsearch的数据导入接口以key-value键值对的形式导入elasticsearch的底层存储,并对key建立索引。(2)将(1)的地地匹配模型按照elasticsearch自定义扩展search模型改造,并添加到elasticsearch主节点扩展模块,并重启elasticsearch,使之成为可以基于利用elasticsearch的分布式存储与高并发计算的地址匹配模型。(3)利用该自定义模型,在elasticsearch上开发一对多海量地址匹配接口。(4)通过在elasticsearch上开发上层接口,使得可输入一个新的地址,并选择待匹配的海量地址库和自定义模型,即可以基于elasticsearch实现新地址与海量地址库中地址的快速实计算,并返回最相似的TOPN地址,其中N可程序设定传参。本实施例通过对索引服务器中预存的海量数据建立索引结构,结合Elasticsearch组件自身的计算架构以及强大的分布式计算能力,实现对第一地址在预设索引结构中,进行实时快速查询。The index server in this embodiment is an open-source Elasticsearch component (Elasticsearch is used for distributed full-text search), which provides a full-text search engine with distributed computing capabilities based on a RESTful web interface, and can perform real-time and fast queries on massive data. The query steps include: (1) Import the addresses of the massive address library into the underlying storage of elasticsearch in the form of key-value pairs according to the data import interface of elasticsearch, and index the keys. (2) The ground matching model of (1) is transformed according to the elasticsearch custom extended search model, and added to the elasticsearch master node extension module, and elasticsearch is restarted to make it a distributed storage and high concurrent computing based on the use of elasticsearch Address matching model. (3) Use this custom model to develop a one-to-many mass address matching interface on elasticsearch. (4) By developing the upper-level interface on elasticsearch, it is possible to enter a new address, and select the mass address library and custom model to be matched, that is, based on elasticsearch, the new address and the address in the mass address library can be quickly calculated. And return the most similar TOPN address, where N can be programmed to pass parameters. In this embodiment, an index structure is established on the massive data pre-stored in the index server, combined with the computing architecture of the Elasticsearch component itself and the powerful distributed computing capability, to implement real-time fast query of the first address in the preset index structure.
本实施例针对第一地址不同行政级别对应的不同分段的匹配方法不同,匹配模型不同,而且各分段对应的匹配权重也不同。本实施例的第一地址分成六个分段,分别对应六个行政级别,对应树结构中的六个节点,六个行政级别中前四个行政级别的匹配模型相同,为字符一一对应匹配;第五个行政级别通过包含或包括的模糊匹配模型;第六个行政级别通过数字匹配模型匹配。本实施例通过在匹配计算过程中设置过滤机制,首先对“省/市、区/县/镇、乡、路”四个行政级别对应的目标分段,通过字符一一匹配的方式,进行精准匹配计算,当对于上述四个行政级别对应的目标分段的匹配计算结果低于预设阈值时,判定所述预设索引结构中不存在与所述第一地址满足预设匹配条件的地址数据,直接输出匹配结论,以降低匹配计算量,提高响应速度。本实施例通过设置过滤机制,能过滤了至少90%以上的地址。这样使得一个地址最终只需要与剩余10%左右的地址进行全匹配,大大节省了计算资源。This embodiment has different matching methods for different segments corresponding to different administrative levels of the first address, different matching models, and different matching weights corresponding to each segment. The first address in this embodiment is divided into six segments, corresponding to six administrative levels, corresponding to six nodes in the tree structure. The matching models of the first four administrative levels in the six administrative levels are the same, and the characters are matched one by one. ; The fifth administrative level adopts the fuzzy matching model of inclusion or inclusion; the sixth administrative level adopts the digital matching model to match. In this embodiment, a filtering mechanism is set in the matching calculation process. First, the target segmentation corresponding to the four administrative levels of "province/city, district/county/town, township, and road" is accurately matched by character one by one. Matching calculation, when the matching calculation result for the target segment corresponding to the four administrative levels is lower than a preset threshold, it is determined that there is no address data in the preset index structure that meets the preset matching condition with the first address , Output the matching conclusion directly to reduce the amount of matching calculation and improve the response speed. In this embodiment, by setting a filtering mechanism, at least 90% of addresses can be filtered. In this way, an address only needs to be fully matched with the remaining 10% of the addresses, which greatly saves computing resources.
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图3所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计 算机设备的数据库用于存储地址匹配过程需要的所有数据。该计算机设备的网络接口用于与外部的端通过网络连接通信。该计算机程序被处理器执行时以实现地址匹配方法。3, an embodiment of the present application also provides a computer device. The computer device may be a server, and its internal structure may be as shown in FIG. 3. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. The computer designed processor is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store all the data needed for the address matching process. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program is executed by the processor to realize the address matching method.
上述处理器执行上述地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组;根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;根据匹配结果判断所述第一地址和所述第二地址是否相同。The above-mentioned processor executes the above-mentioned address matching method, the first address is the address to be retrieved input by the user, and the second address is stored in the index server. The method includes: invoking a preset matching algorithm, and separately comparing the first address and the second address. The address is word segmented according to a first preset rule, and a first word group corresponding to the first address and a second word group corresponding to the second address are obtained; the first address is divided according to the first word group Multiple first segments, divide the second address into multiple second segments according to the second word segmentation; obtain all the first segments and all the second segments according to a second preset rule The matching result of the segment; judging whether the first address and the second address are the same according to the matching result.
上述计算机设备,索引服务器中预存储的数据为非结构化数据,其存储方式是键值对的列存储形式,非结构化数据是指文本、图像、语音等基于NoSQL存储技术形成的列存储,数据量非常大,需要采用分布式架构的NoSQL技术进行存储与计算,索引服务器正是结合了NoSQL的分布式架构存储和索引结构实现了对海量数据的实时快速查询及计算,提出了基于地址多级划分的可配置权重地址匹配模型,先通过自然语言处理模型对地址名称进行分词形成分词组,将分词词组按照行政级别划分成分段,并将分段映射为树型结构中的节点,充分考虑了地址的树型结构,将地址按照行政级别进行分级划段,每一行政级别分段匹配不同权重,实际业务场景可微调权重。通过对索引服务器中预存的海量数据建立索引结构,结合Elastic search组件自身的计算架构以及强大的分布式计算能力,实现对第一地址在预设索引结构中,进行实时快速查询。对于分段地址前四个行政级别地址,根据全国省市区县镇地址库(树型的)进行精确匹配,此外,对于部分缺失进行有效补全。默认权重通过训练模型训练得到,通过在训练过程中不断调节训练参数,使模型训练输出的相似度与预先标注的相似度值一致,或在预设偏差范围内,上述训练参数包括各权重值,以确定各权重值,使权重设置更可靠。For the above computer equipment, the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs. Unstructured data refers to the column storage formed based on NoSQL storage technology such as text, image, and voice. The amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation. The index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data. It is proposed based on multiple addresses. The address matching model with configurable weight for level division. Firstly, the address name is segmented through the natural language processing model to form segmented phrases, and the segmented phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure, fully considered Based on the tree structure of addresses, the addresses are divided into sections according to administrative levels. Each administrative level section matches different weights, and the weights can be fine-tuned in actual business scenarios. By establishing an index structure for the massive data pre-stored in the index server, combined with the computing architecture of the Elasticsearch component itself and powerful distributed computing capabilities, real-time fast query of the first address in the preset index structure is realized. For the first four administrative-level addresses of the segmented address, exact matching is performed according to the address database (tree-shaped) of the provinces, municipalities, counties and towns across the country. In addition, partial missing is effectively completed. The default weights are obtained through training of the training model. By continuously adjusting the training parameters during the training process, the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range. The above training parameters include each weight value, To determine each weight value, make the weight setting more reliable.
在一个实施例中,所述第一地址和所述第二地址分别包括范围地址和标志地址,上述处理器调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤,包括:将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分;将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分;将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。In an embodiment, the first address and the second address include a range address and a flag address, respectively, and the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset The step of performing word segmentation according to rules to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively; The first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segmentation part corresponding to the second address and the The second word segmentation part corresponding to the second address forms a second word segmentation group corresponding to the second address.
在一个实施例中,所述第一地址和所述第二地址还分别包括细节地址,上述处理器将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤之后,包括:将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分;将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。In one embodiment, the first address and the second address further include detailed addresses, and the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
在一个实施例中,所述范围地址包括省、市/区、县和乡/镇四个行政级别,所述标志地址包括小区名称或大厦名称,上述处理器根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤,包括:将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个 节点,各节点分别与各所述第一分段或各所述第二分段一一对应;获取两颗所述结构树各节点分别对应的匹配值;分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重;根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率;将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。In one embodiment, the range address includes four administrative levels of province, city/district, county, and township/town, the mark address includes the name of a cell or a building, and the processor obtains all the addresses according to the second preset rule. The step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low. Structure trees with the same structure, wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
在一个实施例中,上述处理器获取两颗所述结构树各节点分别对应的匹配值的步骤,包括:将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值;将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值;将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值;汇总各所述第一匹配值、各所述第二匹配值以及各第三匹配值,得到两颗结构树各节点分别对应的匹配值。In one embodiment, the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
在一个实施例中,上述处理器分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤之前,包括:将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练;通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与预标注相似度值一致;将所述第一参数中对应的权重值,分别按照节点对应关系对应为第一权重、第二权重以及第三权重。In one embodiment, before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar A specified number of training samples with a degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value; The corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机可读存储介质可以是非易失性,也可以是易失性,计算机程序被处理器执行时实现地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组;根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;根据所述匹配结果判断所述第一地址和所述第二地址是否相同。An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored. The computer-readable storage medium may be non-volatile or volatile. The computer program is executed when the processor is executed. In the address matching method, the first address is the address to be retrieved input by the user, and the second address is stored in the index server. The method includes: calling a preset matching algorithm, and respectively comparing the first address and the second address according to the first preset Set rules for word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address; according to the first segmentation group, the first address is divided into multiple first segments Segment, dividing the second address into a plurality of second segments according to the second word segmentation; obtaining matching results of all the first segments and all the second segments according to a second preset rule; Determine whether the first address and the second address are the same according to the matching result.
上述计算机可读存储介质,索引服务器中预存储的数据为非结构化数据,其存储方式是键值对的列存储形式,非结构化数据是指文本、图像、语音等基于NoSQL存储技术形成的列存储,数据量非常大,需要采用分布式架构的NoSQL技术进行存储与计算,索引服务器正是结合了NoSQL的分布式架构存储和索引结构实现了对海量数据的实时快速查询及计算,提出了基于地址多级划分的可配置权重地址匹配模型,先通过自然语言处理模型对地址名称进行分词形成分词组,将分词词组按照行政级别划分成分段,并将分段映射为树型结构中的节点,充分考虑了地址的树型结构,将地址按照行政级别进行分级划段,每一行政级别分段匹配不同权重,实际业务场景可微调权重。通过对索引服务器中预存的海量数据建立索引结构,结合Elastic search组件自身的计算架构以及强大的分布式计算能力,实现对第一地址在预设索引结构中,进行实时快速查询。对于分段地址前四个行政级别地址,根据全国省市区县镇地址库(树型的)进行精确匹配,此外,对于部分缺失进行有效补全。默认权重通过训练模型训练得到,通过在训练过程中不断调节训练参数,使模型训练输出的相似度与预先标注的相似度值一致,或在预设偏差范围内,上述训练参数包括各权重值,以确定各权重值,使权重设置更可靠。For the above computer-readable storage medium, the data pre-stored in the index server is unstructured data, and its storage method is the column storage form of key-value pairs. Unstructured data refers to text, image, voice, etc. formed based on NoSQL storage technology Column storage, the amount of data is very large, and it is necessary to use the distributed architecture of NoSQL technology for storage and calculation. The index server combines the NoSQL distributed architecture storage and index structure to achieve real-time fast query and calculation of massive data. A configurable weight address matching model based on multi-level address division. First, the address name is segmented through a natural language processing model to form sub-phrases, and the sub-phrases are divided into segments according to administrative levels, and the segments are mapped to nodes in a tree structure , Taking full account of the tree structure of addresses, the addresses are divided into sections according to administrative levels. Each administrative level is matched with different weights, and the weights can be fine-tuned in actual business scenarios. By establishing an index structure for the massive data pre-stored in the index server, combined with the computing architecture of the Elasticsearch component itself and powerful distributed computing capabilities, real-time fast query of the first address in the preset index structure is realized. For the first four administrative-level addresses of the segmented address, exact matching is performed according to the address database (tree-shaped) of the provinces, municipalities, counties and towns across the country. In addition, partial missing is effectively completed. The default weights are obtained through training of the training model. By continuously adjusting the training parameters during the training process, the similarity of the model training output is consistent with the pre-marked similarity value, or within the preset deviation range. The above training parameters include each weight value, To determine each weight value, make the weight setting more reliable.
在一个实施例中,所述第一地址和所述第二地址分别包括范围地址和标志地址,上述处理器调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词, 得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤,包括:将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分;将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分;将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。In an embodiment, the first address and the second address include a range address and a flag address, respectively, and the processor invokes the preset matching algorithm, and respectively sets the first address and the second address according to the first preset Rule word segmentation to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes: corresponding range addresses of the first address and the second address, respectively , Perform word segmentation according to the pre-associated address dictionary in the natural language processing model, and obtain the first word segmentation part corresponding to the first address and the first word segmentation part corresponding to the second address respectively; combine the first address and the first address Mark addresses corresponding to the two addresses, and perform word segmentation according to the first grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively; The first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segmentation part corresponding to the second address and the The second word segmentation part corresponding to the second address forms a second word segmentation group corresponding to the second address.
在一个实施例中,所述第一地址和所述第二地址还分别包括细节地址,上述处理器将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤之后,包括:将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分;将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。In one embodiment, the first address and the second address further include detailed addresses, and the processor above sets the flag addresses corresponding to the first address and the second address according to the natural language processing model After the steps of obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, including: dividing the first address and the second address separately The corresponding detailed address is segmented according to the second grammar model in the natural language processing model, and the third segmentation part corresponding to the first address and the third segmentation part corresponding to the second address are obtained respectively; The first word segmentation part corresponding to the address, the second word segmentation part corresponding to the first address, and the third word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the second address The corresponding first segmentation part, the second segmentation part corresponding to the second address, and the third segmentation part corresponding to the second address form a second segmentation group corresponding to the second address.
在一个实施例中,所述范围地址包括省、市/区、县和乡/镇四个行政级别,所述标志地址包括小区名称或大厦名称,上述处理器根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤,包括:将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应;获取两颗所述结构树各节点分别对应的匹配值;分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重;根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率;将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。In one embodiment, the range address includes four administrative levels of province, city/district, county, and township/town, the mark address includes the name of a cell or a building, and the processor obtains all the addresses according to the second preset rule. The step of matching results between the first segment and all the second segments includes: mapping all the first segments and all the second segments into two in the order of administrative level from high to low. Structure trees with the same structure, wherein the structure tree includes a plurality of nodes, and each node corresponds to each of the first segment or each of the second segment respectively; each node of the two structure trees is obtained Respectively corresponding matching values; respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address; the matching rate is calculated according to the matching value multiplied by the corresponding weight, respectively Obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the mark address, and the third matching rate corresponding to the detail address; the first matching rate, the second matching rate, and the The sum of the third matching rate is used as a matching result of all the first segments and all the second segments.
在一个实施例中,上述处理器获取两颗所述结构树各节点分别对应的匹配值的步骤,包括:将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值;将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值;将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值;汇总各所述第一匹配值、各所述第二匹配值以及各第三匹配值,得到两颗结构树各节点分别对应的匹配值。In one embodiment, the step of obtaining the matching value corresponding to each node of the two structure trees by the above-mentioned processor includes: combining each first segment corresponding to the range address in the first address with the Each second segment corresponding to the range address in the second address is matched exactly in one-to-one correspondence according to the node correspondence to obtain each first matching value; and each first segment corresponding to the flag address in the first address Segment, corresponding to each second segment corresponding to the flag address in the second address, perform a one-to-one matching of model keywords according to the node correspondence relationship to obtain each second matching value; combine the details in the first address Each first segment corresponding to the address, and each second segment corresponding to the detailed address in the second address, perform digital matching in one-to-one correspondence according to the node correspondence to obtain each third matching value; summarize each of the The first matching value, each of the second matching values, and each of the third matching values obtain matching values corresponding to each node of the two structure trees.
在一个实施例中,上述处理器分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤之前,包括:将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练;通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致;将第一参数中对应的权重值,分别按照节点对应关系对应为第一权重、第二权重以及第三权重。In one embodiment, before the step of obtaining the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address by the above-mentioned processor respectively, the method includes: pre-marking similar The specified number of training samples of the degree value are input into the natural language processing model for training; by adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is the same as the pre-labeled similarity value Consistent; the corresponding weight values in the first parameter are respectively corresponding to the first weight, the second weight, and the third weight according to the node correspondence relationship.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,上述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的和实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可以包括只读存储器(ROM)、可编 程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双速据率SDRAM(SSRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Persons of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by computer programs instructing relevant hardware. The above-mentioned computer programs can be stored in a non-volatile computer readable storage medium. Here, when the computer program is executed, it may include the procedures of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media provided in this application and used in the embodiments may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements that are not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
以上所述仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of this application description and drawings, or directly or indirectly applied to other related The technical field is equally included in the scope of patent protection of this application.

Claims (20)

  1. 一种地址匹配方法,其中,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:An address matching method, wherein the first address is the address to be retrieved input by the user, and the second address is stored in an index server, and the method includes:
    调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
    根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
    根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
    根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
  2. 根据权利要求1所述的地址匹配方法,其中,所述第一地址和所述第二地址分别包括范围地址和标志地址,所述调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤,包括:The address matching method according to claim 1, wherein the first address and the second address respectively include a range address and a flag address, and the preset matching algorithm is invoked, and the first address and the second address are respectively The address segmentation is performed according to a first preset rule to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address, including:
    将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分;The range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the second address corresponding to the first address are obtained respectively The corresponding first participle part;
    将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分;The mark addresses corresponding to the first address and the second address are segmented according to the first grammar model in the natural language processing model to obtain the second segmentation part corresponding to the first address and the second segmentation respectively. The second segmentation part corresponding to the address;
    将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。The first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segmentation part corresponding to the second address is combined with The second word segmentation part corresponding to the second address forms a second word segmentation group corresponding to the second address.
  3. 根据权利要求2所述的地址匹配方法,其中,所述第一地址和所述第二地址还分别包括细节地址,所述将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤之后,包括:4. The address matching method according to claim 2, wherein the first address and the second address further include detailed addresses, and the first address and the second address are respectively corresponding to the flag addresses, After the step of performing word segmentation according to the grammar model in the natural language processing model, and obtaining the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address respectively, the method includes:
    将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分;The detailed addresses corresponding to the first address and the second address are segmented according to the second grammar model in the natural language processing model, and the third segmentation part and the second segmentation part corresponding to the first address are obtained respectively. The third participle part corresponding to the address;
    将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。The first segmentation part corresponding to the first address, the second segmentation part corresponding to the first address, and the third segmentation part corresponding to the first address form the first segmentation group corresponding to the first address, and The first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address form a second word segmentation group corresponding to the second address.
  4. 根据权利要求3所述的地址匹配方法,其中,所述范围地址包括省、市/区、县和乡/镇四个行政级别,所述标志地址包括小区名称或大厦名称,所述根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤,包括:The address matching method according to claim 3, wherein the range address includes four administrative levels of province, city/district, county, and township/town, and the mark address includes a cell name or a building name, and the second The step of obtaining the matching results of all the first segments and all the second segments by a preset rule includes:
    将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应;Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple nodes, and each node is One-to-one correspondence with each of the first segments or each of the second segments;
    获取两颗所述结构树各节点分别对应的匹配值;Obtaining matching values corresponding to each node of the two structure trees;
    分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重;Acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address respectively;
    根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率;Calculate the matching rate according to the matching value multiplied by the corresponding weight, and obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address respectively;
    将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
  5. 根据权利要求4所述的地址匹配方法,其中,所述获取两颗所述结构树各节点分别对应的匹配值的步骤,包括:The address matching method according to claim 4, wherein the step of obtaining the matching value corresponding to each node of the two structure trees respectively comprises:
    将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值;Each first segment corresponding to the range address in the first address is matched with each second segment corresponding to the range address in the second address in a one-to-one correspondence according to the node correspondence relationship to perform precise and full matching, to obtain Each first matching value;
    将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值;Match each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address in a one-to-one correspondence of model keywords according to the node correspondence, Get each second matching value;
    将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值;Match each first segment corresponding to the detail address in the first address with each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence to obtain each Third matching value;
    汇总各所述第一匹配值、各所述第二匹配值以及各所述第三匹配值,得到两颗所述结构树各节点分别对应的匹配值。Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  6. 根据权利要求5所述的地址匹配方法,其中,所述分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤之前,包括:The address matching method according to claim 5, wherein before the step of respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address, include:
    将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练;Inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training;
    通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致;By adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value;
    将所述第一参数中对应的权重值,分别按照节点对应关系对应为所述第一权重、所述第二权重以及所述第三权重。Corresponding weight values in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  7. 根据权利要求2所述的地址匹配方法,其中,所述调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤之前,包括:The address matching method according to claim 2, wherein said calling a preset matching algorithm respectively performs word segmentation on said first address and said second address according to a first preset rule, to obtain said first address corresponding Before the steps of the first segmentation group of and the second segmentation group corresponding to the second address, the step includes:
    将所述索引服务器中预存储的指定数量的非结构化的地址数据进行索引化,以得到预设索引结构;Indexing a specified number of unstructured address data pre-stored in the index server to obtain a preset index structure;
    接收上传至所述索引服务器的指定目录下的接口插件,其中,所述接口插件通过将所述预设匹配算法进行打包封装后形成;Receiving an interface plug-in uploaded to a designated directory of the index server, wherein the interface plug-in is formed by packaging and encapsulating the preset matching algorithm;
    获取所述接口插件的配置参数;Obtaining configuration parameters of the interface plug-in;
    通过运行所述配置参数将所述预设索引结构与所述接口插件建立计算关联关系。Establish a calculation association relationship between the preset index structure and the interface plug-in by running the configuration parameter.
  8. 一种地址匹配装置,其中,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,装置包括:An address matching device, wherein the first address is the address to be retrieved input by the user, the second address is stored in an index server, and the device includes:
    分词模块,用于调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;The word segmentation module is used to call a preset matching algorithm, and respectively segment the first address and the second address according to a first preset rule to obtain the first segmentation group and the second segment corresponding to the first address The second word segmentation group corresponding to the address, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
    划分模块,用于根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;A dividing module, configured to divide the first address into a plurality of first segments according to the first phrase group, and divide the second address into a plurality of second segments according to the second phrase group;
    第二获取模块,用于根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;A second acquiring module, configured to acquire matching results of all the first segments and all the second segments according to a second preset rule;
    判断模块,用于根据所述匹配结果判断所述第一地址和所述第二地址是否相同。The judgment module is configured to judge whether the first address and the second address are the same according to the matching result.
  9. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其中,所述处理器执行所述计算机程序时实现地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:A computer device includes a memory and a processor, the memory stores a computer program, wherein the processor implements an address matching method when the computer program is executed, the first address is the address to be retrieved input by the user, and the second address Stored in the index server, methods include:
    调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
    根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
    根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
    根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
  10. 根据权利要求9所述的计算机设备,其中,所述第一地址和所述第二地址分别包括范围地址和标志地址,所述调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤,包括:The computer device according to claim 9, wherein the first address and the second address include a range address and a flag address, respectively, and the preset matching algorithm is invoked to combine the first address and the second address, respectively The step of performing word segmentation according to a first preset rule to obtain a first segmentation group corresponding to the first address and a second segmentation group corresponding to the second address includes:
    将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分;The range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the second address corresponding to the first address are obtained respectively The corresponding first participle part;
    将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分;The mark addresses corresponding to the first address and the second address are segmented according to the first grammar model in the natural language processing model to obtain the second segmentation part corresponding to the first address and the second segmentation respectively. The second segmentation part corresponding to the address;
    将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。The first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segmentation part corresponding to the second address is combined with The second word segmentation part corresponding to the second address forms a second word segmentation group corresponding to the second address.
  11. 根据权利要求10所述的计算机设备,其中,所述第一地址和所述第二地址还分别包括细节地址,所述将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤之后,包括:10. The computer device according to claim 10, wherein the first address and the second address further include detailed addresses, and the flag addresses corresponding to the first address and the second address are respectively based on After the grammar model in the natural language processing model performs word segmentation to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address, the steps include:
    将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分;The detailed addresses corresponding to the first address and the second address are segmented according to the second grammar model in the natural language processing model, and the third segmentation part and the second segmentation part corresponding to the first address are obtained respectively. The third participle part corresponding to the address;
    将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。The first segmentation part corresponding to the first address, the second segmentation part corresponding to the first address, and the third segmentation part corresponding to the first address form the first segmentation group corresponding to the first address, and The first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address form a second word segmentation group corresponding to the second address.
  12. 根据权利要求11所述的计算机设备,其中,所述范围地址包括省、市/区、县和乡/镇四个行政级别,所述标志地址包括小区名称或大厦名称,所述根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤,包括:The computer device according to claim 11, wherein the range address includes four administrative levels of province, city/district, county, and town/town, the mark address includes a cell name or a building name, and the second predetermined The step of obtaining the matching results of all the first segments and all the second segments by a rule includes:
    将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应;Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple nodes, and each node is One-to-one correspondence with each of the first segments or each of the second segments;
    获取两颗所述结构树各节点分别对应的匹配值;Obtaining matching values corresponding to each node of the two structure trees;
    分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重;Acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address respectively;
    根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率;Calculate the matching rate according to the matching value multiplied by the corresponding weight, and obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address respectively;
    将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
  13. 根据权利要求12所述的计算机设备,其中,所述获取两颗所述结构树各节点分别对应的匹配值的步骤,包括:The computer device according to claim 12, wherein the step of obtaining the matching values corresponding to the respective nodes of the two structure trees comprises:
    将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值;Each first segment corresponding to the range address in the first address is matched with each second segment corresponding to the range address in the second address in a one-to-one correspondence according to the node correspondence relationship to perform precise and full matching, to obtain Each first matching value;
    将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值;Match each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address in a one-to-one correspondence of model keywords according to the node correspondence, Get each second matching value;
    将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值;Match each first segment corresponding to the detail address in the first address with each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence to obtain each Third matching value;
    汇总各所述第一匹配值、各所述第二匹配值以及各所述第三匹配值,得到两颗所述结构树各节点分别对应的匹配值。Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  14. 根据权利要求13所述的计算机设备,其中,所述分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤之前,包括:The computer device according to claim 13, wherein before the step of separately obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address, the step includes :
    将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练;Inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training;
    通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致;By adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value;
    将所述第一参数中对应的权重值,分别按照节点对应关系对应为所述第一权重、所述第二权重以及所述第三权重。Corresponding weight values in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
  15. 一种计算机可读存储介质,其上存储有计算机程序,其中,所述计算机程序被处理器执行时实现地址匹配方法,第一地址为用户输入的待检索地址,第二地址存储于索引服务器中,方法包括:A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements an address matching method when executed by a processor, the first address is the address to be retrieved input by the user, and the second address is stored in an index server , Methods include:
    调用预设匹配算法,分别将所述第一地址和所述第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组,其中,所述预设匹配算法包括分词计算和匹配计算;Call a preset matching algorithm, respectively segment the first address and the second address according to the first preset rule, and obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address Phrase segmentation, wherein the preset matching algorithm includes word segmentation calculation and matching calculation;
    根据所述第一分词组将所述第一地址划分为多个第一分段,根据所述第二分词组将所述第二地址划分为多个第二分段;Dividing the first address into a plurality of first segments according to the first word segmentation, and dividing the second address into a plurality of second segments according to the second word segmentation;
    根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果;Obtaining matching results of all the first segments and all the second segments according to a second preset rule;
    根据所述匹配结果判断所述第一地址和所述第二地址是否相同。Determine whether the first address and the second address are the same according to the matching result.
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述第一地址和所述第二地址分别包括范围地址和标志地址,所述调用所述预设匹配算法,分别将第一地址和第二地址根据第一预设规则进行分词,得到所述第一地址对应的第一分词组和所述第二地址对应的第二分词组的步骤,包括:The computer-readable storage medium according to claim 15, wherein the first address and the second address respectively include a range address and a flag address, and the preset matching algorithm is invoked to combine the first address and the second address respectively The step of performing word segmentation for the second address according to the first preset rule to obtain the first segmentation group corresponding to the first address and the second segmentation group corresponding to the second address includes:
    将所述第一地址和所述第二地址分别对应的范围地址,根据自然语言处理模型中预关联地址词典进行分词,分别得到所述第一地址对应的第一分词部分和所述第二地址对应的第一分词部分;The range addresses corresponding to the first address and the second address are segmented according to the pre-associated address dictionary in the natural language processing model, and the first segmentation part and the second address corresponding to the first address are obtained respectively The corresponding first participle part;
    将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的第一语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分;The mark addresses corresponding to the first address and the second address are segmented according to the first grammar model in the natural language processing model to obtain the second segmentation part corresponding to the first address and the second segmentation respectively. The second segmentation part corresponding to the address;
    将所述第一地址对应的第一分词部分和所述第一地址对应的第二分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分和所述第二地址对应的第二分词部分组成所述第二地址对应的第二分词组。The first word segmentation part corresponding to the first address and the second word segmentation part corresponding to the first address form the first word segmentation group corresponding to the first address, and the first word segmentation part corresponding to the second address is combined with The second word segmentation part corresponding to the second address forms a second word segmentation group corresponding to the second address.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述第一地址和所述第二地址还分别包括细节地址,所述将所述第一地址和所述第二地址分别对应的标志地址,根据自然语言处理模型中的语法模型进行分词,分别得到所述第一地址对应的第二分词部分和所述第二地址对应的第二分词部分的步骤之后,包括:The computer-readable storage medium according to claim 16, wherein the first address and the second address further include a detailed address, and the first address and the second address respectively correspond to the flags After the steps of performing word segmentation according to the grammar model in the natural language processing model to obtain the second word segmentation part corresponding to the first address and the second word segmentation part corresponding to the second address, the following steps include:
    将所述第一地址和所述第二地址分别对应的细节地址,根据自然语言处理模型中的第二语法模型进行分词,分别得到所述第一地址对应的第三分词部分和所述第二地址对应的第三分词部分;The detailed addresses corresponding to the first address and the second address are segmented according to the second grammar model in the natural language processing model, and the third segmentation part and the second segmentation part corresponding to the first address are obtained respectively. The third participle part corresponding to the address;
    将所述第一地址对应的第一分词部分、所述第一地址对应的第二分词部分以及所述第 一地址对应的第三分词部分组成所述第一地址对应的第一分词组,将所述第二地址对应的第一分词部分、所述第二地址对应的第二分词部分以及所述第二地址对应的第三分词部分组成所述第二地址对应的第二分词组。The first segmentation part corresponding to the first address, the second segmentation part corresponding to the first address, and the third segmentation part corresponding to the first address form the first segmentation group corresponding to the first address, and The first word segmentation part corresponding to the second address, the second word segmentation part corresponding to the second address, and the third word segmentation part corresponding to the second address form a second word segmentation group corresponding to the second address.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述范围地址包括省、市/区、县和乡/镇四个行政级别,所述标志地址包括小区名称或大厦名称,所述根据第二预设规则获取所有所述第一分段与所有所述第二分段的匹配结果的步骤,包括:The computer-readable storage medium according to claim 17, wherein the scope address includes four administrative levels of province, city/district, county and township/town, and the logo address includes a cell name or a building name, and the The step of obtaining the matching results of all the first segments and all the second segments by the second preset rule includes:
    将所有所述第一分段与所有所述第二分段分别按照行政级别由高到低的顺序映射为两颗相同结构的结构树,其中,所述结构树包括多个节点,各节点分别与各所述第一分段或各所述第二分段一一对应;Map all the first segments and all the second segments into two structure trees with the same structure in the order of administrative level from high to low, wherein the structure tree includes multiple nodes, and each node is One-to-one correspondence with each of the first segments or each of the second segments;
    获取两颗所述结构树各节点分别对应的匹配值;Obtaining matching values corresponding to each node of the two structure trees;
    分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重;Acquiring the first weight corresponding to the range address, the second weight corresponding to the mark address, and the third weight corresponding to the detail address respectively;
    根据匹配值乘以对应权重计算匹配率,分别得到所述范围地址对应的第一匹配率、所述标志地址对应的第二匹配率以及所述细节地址对应的第三匹配率;Calculate the matching rate according to the matching value multiplied by the corresponding weight, and obtain the first matching rate corresponding to the range address, the second matching rate corresponding to the flag address, and the third matching rate corresponding to the detail address respectively;
    将所述第一匹配率、所述第二匹配率和所述第三匹配率的加和,作为所述所有所述第一分段与所有所述第二分段的匹配结果。The sum of the first matching rate, the second matching rate, and the third matching rate is used as a matching result of all the first segments and all the second segments.
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述获取两颗所述结构树各节点分别对应的匹配值的步骤,包括:18. The computer-readable storage medium according to claim 18, wherein the step of obtaining the matching value corresponding to each node of the two structure trees comprises:
    将所述第一地址中的范围地址对应的各第一分段,与将所述第二地址中的范围地址对应的各第二分段,根据节点对应关系一一对应进行精准全匹配,得到各第一匹配值;Each first segment corresponding to the range address in the first address is matched with each second segment corresponding to the range address in the second address in a one-to-one correspondence according to the node correspondence relationship to perform precise and full matching, to obtain Each first matching value;
    将所述第一地址中的标志地址对应的各第一分段,与将所述第二地址中的标志地址对应的各第二分段,根据节点对应关系一一对应进行模型关键字匹配,得到各第二匹配值;Match each first segment corresponding to the flag address in the first address with each second segment corresponding to the flag address in the second address in a one-to-one correspondence of model keywords according to the node correspondence, Get each second matching value;
    将所述第一地址中的细节地址对应的各第一分段,与将所述第二地址中的细节地址对应的各第二分段,根据节点对应关系一一对应进行数字匹配,得到各第三匹配值;Match each first segment corresponding to the detail address in the first address with each second segment corresponding to the detail address in the second address in a one-to-one correspondence according to the node correspondence to obtain each Third matching value;
    汇总各所述第一匹配值、各所述第二匹配值以及各所述第三匹配值,得到两颗所述结构树各节点分别对应的匹配值。Summarize each of the first matching values, each of the second matching values, and each of the third matching values to obtain matching values corresponding to each node of the two structure trees.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述分别获取所述范围地址对应的第一权重、所述标志地址对应的第二权重以及所述细节地址对应的第三权重的步骤之前,包括:The computer-readable storage medium according to claim 19, wherein the step of respectively obtaining the first weight corresponding to the range address, the second weight corresponding to the flag address, and the third weight corresponding to the detail address Before, including:
    将预标注相似度值的指定数量的训练样本,输入至所述自然语言处理模型中进行训练;Inputting a specified number of training samples with pre-labeled similarity values into the natural language processing model for training;
    通过调整训练参数至第一参数,使所述自然语言处理模型输出的相似度值与所述预标注相似度值一致;By adjusting the training parameter to the first parameter, the similarity value output by the natural language processing model is consistent with the pre-labeled similarity value;
    将所述第一参数中对应的权重值,分别按照节点对应关系对应为所述第一权重、所述第二权重以及所述第三权重。Corresponding weight values in the first parameter to the first weight, the second weight, and the third weight according to the node correspondence relationship.
PCT/CN2020/098804 2019-07-03 2020-06-29 Address matching method and apparatus, computer device and storage medium WO2021000831A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910601364.8 2019-07-03
CN201910601364.8A CN110442603B (en) 2019-07-03 2019-07-03 Address matching method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021000831A1 true WO2021000831A1 (en) 2021-01-07

Family

ID=68428771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098804 WO2021000831A1 (en) 2019-07-03 2020-06-29 Address matching method and apparatus, computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110442603B (en)
WO (1) WO2021000831A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935293A (en) * 2021-12-16 2022-01-14 湖南四方天箭信息科技有限公司 Address splitting and complementing method and device, computer equipment and storage medium

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442603B (en) * 2019-07-03 2024-01-19 平安科技(深圳)有限公司 Address matching method, device, computer equipment and storage medium
CN111144117B (en) * 2019-12-26 2023-08-29 同济大学 Method for disambiguating Chinese address of knowledge graph
CN111563806A (en) * 2020-07-20 2020-08-21 平安国际智慧城市科技股份有限公司 Method, device, medium and electronic equipment for identifying merchant compliance in network platform
CN114064827A (en) * 2020-08-05 2022-02-18 北京四维图新科技股份有限公司 Position searching method, device and equipment
CN112256821B (en) * 2020-09-23 2024-05-17 北京捷通华声科技股份有限公司 Chinese address completion method, device, equipment and storage medium
CN112163070B (en) * 2020-09-27 2024-02-27 杭州海康威视系统技术有限公司 Place name matching method, place name matching device, electronic equipment and machine-readable storage medium
CN112835897B (en) * 2021-01-29 2024-03-15 上海寻梦信息技术有限公司 Geographic area division management method, data conversion method and related equipment
CN112835899A (en) * 2021-01-29 2021-05-25 上海寻梦信息技术有限公司 Address library indexing method, address matching method and related equipment
CN113343688A (en) * 2021-06-22 2021-09-03 南京星云数字技术有限公司 Address similarity determination method and device and computer equipment
CN113987114B (en) * 2021-09-17 2023-04-07 上海燃气有限公司 Address matching method and device based on semantic analysis and electronic equipment
CN114756654A (en) * 2022-04-25 2022-07-15 广州城市信息研究所有限公司 Dynamic place name and address matching method and device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1487444A (en) * 2002-09-13 2004-04-07 富士施乐株式会社 Text statement comparing unit
CN101770499A (en) * 2009-01-07 2010-07-07 上海聚力传媒技术有限公司 Information retrieval method in search engine and corresponding search engine
CN102402533A (en) * 2010-09-13 2012-04-04 方正国际软件有限公司 Address matching method and system
CN108763215A (en) * 2018-05-30 2018-11-06 中智诚征信有限公司 A kind of address storage method, device and computer equipment based on address participle
US10216837B1 (en) * 2014-12-29 2019-02-26 Google Llc Selecting pattern matching segments for electronic communication clustering
CN110442603A (en) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 Address matching method, apparatus, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101257516B (en) * 2008-03-11 2011-07-13 中兴通讯股份有限公司 Method for correcting source address
KR20180057853A (en) * 2016-11-23 2018-05-31 잠쉬딘 허지무하메도브 Method, system and computer program for converting addresses
CN106874384B (en) * 2017-01-10 2020-12-04 航天精一(广东)信息科技有限公司 Heterogeneous address standard conversion and matching method
CN109145169B (en) * 2018-07-26 2021-03-26 浙江省测绘科学技术研究院 Address matching method based on statistical word segmentation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1487444A (en) * 2002-09-13 2004-04-07 富士施乐株式会社 Text statement comparing unit
CN101770499A (en) * 2009-01-07 2010-07-07 上海聚力传媒技术有限公司 Information retrieval method in search engine and corresponding search engine
CN102402533A (en) * 2010-09-13 2012-04-04 方正国际软件有限公司 Address matching method and system
US10216837B1 (en) * 2014-12-29 2019-02-26 Google Llc Selecting pattern matching segments for electronic communication clustering
CN108763215A (en) * 2018-05-30 2018-11-06 中智诚征信有限公司 A kind of address storage method, device and computer equipment based on address participle
CN110442603A (en) * 2019-07-03 2019-11-12 平安科技(深圳)有限公司 Address matching method, apparatus, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113935293A (en) * 2021-12-16 2022-01-14 湖南四方天箭信息科技有限公司 Address splitting and complementing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110442603B (en) 2024-01-19
CN110442603A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
WO2021000831A1 (en) Address matching method and apparatus, computer device and storage medium
WO2020007224A1 (en) Knowledge graph construction and smart response method and apparatus, device, and storage medium
WO2021139283A1 (en) Knowledge graph question-answer method and apparatus based on deep learning technology, and device
WO2020063092A1 (en) Knowledge graph processing method and apparatus
US10783171B2 (en) Address search method and device
CN104636466B (en) Entity attribute extraction method and system for open webpage
WO2022142613A1 (en) Training corpus expansion method and apparatus, and intent recognition model training method and apparatus
CN111291161A (en) Legal case knowledge graph query method, device, equipment and storage medium
CN109033314B (en) Real-time query method and system for large-scale knowledge graph under condition of limited memory
CN106844380A (en) A kind of database operation method, information processing method and related device
WO2021184627A1 (en) R-tree-based pollutant traceability method and apparatus, and related device therefor
CN105224622A (en) The place name address extraction of Internet and standardized method
CN104657439A (en) Generation system and method for structured query sentence used for precise retrieval of natural language
CN103838837B (en) Remote sensing Metadata integration method based on semantic template
CN105893551A (en) Method and device for processing data and knowledge graph
CN103440312A (en) System and terminal for inquiring zip code for mailing address
CN104657440A (en) Structured query statement generating system and method
CN106951526B (en) Entity set extension method and device
CN110347810B (en) Dialogue type search answering method, device, computer equipment and storage medium
CN103646019A (en) Method and device for fusing multiple machine translation systems
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
CN112528174A (en) Address finishing and complementing method based on knowledge graph and multiple matching and application
CN110532358A (en) A kind of template automatic generation method towards knowledge base question and answer
CN104794163A (en) Entity set extension method
CN108595437B (en) Text query error correction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20835215

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20835215

Country of ref document: EP

Kind code of ref document: A1