CN114003812A

CN114003812A - Address matching method, system, device and storage medium

Info

Publication number: CN114003812A
Application number: CN202111274139.1A
Authority: CN
Inventors: 李洁
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-02-01

Abstract

The invention provides an address matching method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring a preprocessed target address; inputting the preprocessed target address into a trained CRF splitting model to obtain an optimal labeling address sequence, wherein the trained CRF splitting model is obtained by training based on a preset characteristic template and training data; and acquiring an alternative matching address according to the current search index of the optimal labeling address sequence and a preset ElasticSearch search engine. The embodiment of the invention reduces the workload of all manually processed address information, can quickly position and match complete address information, greatly improves the processing speed of the address information, reduces the waiting time of a client, can quickly position the client to a specific cell through the address information, can realize quick response and better serve the client.

Description

Address matching method, system, device and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an address matching method, system, device, and storage medium.

Background

At present, in the requirement of address splitting and matching, most of the information of regular addresses is split by adopting a regular expression mode, the regular expression is difficult to completely cover due to different address information data in all parts of the country, the irregularity degree of the addresses is quite large, and the disordered address labeling and random address expression modes bring great difficulty to address resolution and are difficult to accurately split the irregular address information.

Due to the fact that accurate splitting cannot be carried out, the obtained split address information cannot be automatically matched with complete correct address information, manual intervention is needed due to the problems, and when batch of irregular address information appears, a large amount of manual matching needs to be consumed to obtain correct detailed addresses.

Disclosure of Invention

The invention provides an address matching method, system, equipment and storage medium, and mainly aims to accurately divide irregular or unrefined address information, effectively improve the address splitting precision and accuracy and improve the subsequent address matching precision.

In a first aspect, an embodiment of the present invention provides an address matching method, including:

acquiring a preprocessed target address;

inputting the preprocessed target address into a trained CRF splitting model to obtain an optimal labeling address sequence, wherein the trained CRF splitting model is obtained by training based on a preset characteristic template and training data;

and acquiring an alternative matching address according to the current search index of the optimal labeling address sequence and a preset ElasticSearch search engine.

Preferably, the method further comprises the following steps:

acquiring a reference matching address according to a preset address element of the optimal labeling address sequence and the preset ElasticSearch search engine;

and acquiring the best matching address according to the confidence degree between the reference matching address and the alternative matching address.

Preferably, the training data is obtained by:

s211, acquiring a labeled address library and a preprocessed unlabeled address library in an original corpus, wherein the labeled address library is obtained by labeling according to a preset classification labeling system;

s212, training the initial CRF splitting model according to the labeled address library to obtain a target CRF splitting model;

s213, according to the target CRF splitting model, marking part of unmarked addresses in the preprocessed unmarked address library to obtain a marked address sequence corresponding to the part of unmarked addresses;

s214, updating the labeled address library by using the part of unlabeled addresses and the labeled address sequence corresponding to the part of unlabeled addresses, using the updated labeled address library as the labeled address library again, and using the target CRF splitting model as the initial CRF splitting model again;

s215, repeating the steps S212 to S214 until the number of the remaining unmarked addresses in the unmarked address base is less than a preset number threshold, and using the addresses in the marked address base as training data.

Preferably, the updating the labeled address library by using the part of unlabeled addresses and the standard address sequence corresponding to the part of unlabeled addresses includes:

deleting the part of the un-labeled addresses with the confidence degrees larger than a preset confidence degree threshold value from the un-labeled address library according to the confidence degrees between the part of the un-labeled addresses and the corresponding labeled address sequences;

and adding the corresponding labeled address sequence into the labeled address library to obtain an updated labeled address library.

Preferably, the confidence between the part of the unlabeled addresses and the corresponding labeled address sequence is obtained as follows:

wherein, C_xRepresenting the confidence between the unmarked address corpus and the corresponding marked address sequence, i representing the current position, and X ═ X₁,x₂,…,x_n) For no address, Y ═ Y₁,y₂,…,y_n) And representing the predicted labeled address sequence, wherein an input variable X is X, and an output variable Y is Y.

Preferably, the trained CRF split model is obtained by training based on a preset feature template and training data, and is obtained through the following steps:

acquiring a characteristic function according to the preset characteristic template;

and extracting features of the training data according to the feature function, training the initial CRF splitting model by combining the weight of each feature, and obtaining the trained CRF splitting model.

Preferably, the obtaining a best matching address according to the confidence between the reference matching address and the candidate matching address includes:

for any alternative matching address, if the confidence degrees of the reference matching address and the any alternative matching address are greater than a first preset matching threshold, respectively matching the cell information of the reference matching address and the cell information of the any alternative matching address, the path number information of the reference matching address and the path number information of the any alternative matching address, and taking the any alternative matching address as an optimal matching address if the cell matching result and the path number matching result are both greater than a second preset matching threshold;

if the confidence degrees of the reference matching address and any one of the alternative matching addresses are smaller than the first preset matching threshold, combining the route number information and the cell information of the reference matching address, combining the route number information and the cell information of any one of the alternative matching addresses, and if the degree of matching after the combination of the route number information and the cell information is larger than the first preset matching threshold, taking any one of the alternative matching addresses as the best matching address.

In a second aspect, an embodiment of the present invention provides an address matching system, including:

the acquisition module is used for acquiring the preprocessed target address;

the sequence module is used for inputting the preprocessed target address into a trained CRF splitting model to obtain an optimal labeling address sequence, and the trained CRF splitting model is obtained by training based on a preset characteristic template and training data;

and the matching module is used for acquiring the alternative matching address according to the current search index of the optimal labeling address sequence and a preset ElasticSearch search engine.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the address matching method when executing the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the steps of the address matching method.

According to the address matching method, the system, the equipment and the storage medium, irregular information in the target address is removed after the target address is preprocessed, and then the information is input into the trained CRF splitting model, and the CRF splitting model can accurately split irregular or non-detailed address information, so that the splitting precision and the accuracy of the target address are improved, and the subsequent address matching precision is improved; and then matching in a preset ElasticSearch search engine according to the optimal labeling address sequence, and quickly matching regular, detailed and accurate address information by fully utilizing the self-contained search function of the preset ElasticSearch search engine.

The embodiment of the invention reduces the workload of all manually processed address information, can quickly position and match complete address information, greatly improves the processing speed of the address information, reduces the waiting time of a client, can quickly position the client to a specific cell through the address information, can realize quick response and better serve the client.

Drawings

Fig. 1 is an application scenario diagram of an address matching method according to an embodiment of the present invention;

fig. 2 is a flowchart of an address matching method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an address matching system according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is an application scenario diagram of an address matching method according to an embodiment of the present invention, as shown in fig. 1, a user inputs a target address in a client, the client extracts the target address and sends the target address to a server, and the server receives the target address and then executes an address matching method to match the target address.

It should be noted that the server may be implemented by an independent server or a server cluster composed of a plurality of servers. The client may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The client and the server may be connected through bluetooth, USB (Universal Serial Bus), or other communication connection manners, which is not limited in this embodiment of the present invention.

Fig. 2 is a flowchart of an address matching method according to an embodiment of the present invention, and as shown in fig. 2, the method includes:

s210, acquiring a preprocessed target address;

firstly, a preprocessed target address is obtained, the target address is an address which needs to be matched, such as a mail address, a usual writing address and the like, generally speaking, many non-normalized expressions exist in the writing address and the mail address which are contacted at ordinary times, and redundant and meaningless address components in the writing address and the mail address need to be identified, wherein the common non-normalized condition is as follows: the numbers are not uniform, special symbols exist, address data are too short, and key route number cell information is lacked.

For example, for the address "the small golden village and small thunder road in the thunder field, near the north of 2 kilometers of the intersection with the ai tai road", the address "and" belong to meaningless components, and the meaningless components "and", "and" are required to be marked. The spatial relationship refers to the topological relationship among the address elements, mainly includes adjacency, association and inclusion relationships, the description of the corresponding spatial relationship includes a distance relationship of 2 km, a direction relationship of north, a cross relationship of cross, a fuzzy description of vicinity and the like, and the identification of the components can add spatial constraint in the geographic marking to improve the positioning accuracy. Geographical naming entities such as place names and organizational names are the main components in addresses and are also difficult to identify. The structured address is easy to be divided by using place name suffixes such as county and town, and the common administrative division place name falling condition that the 'thunder town' is omitted as 'thunder field' needs to be marked in the non-standard address. For the organization name present in the address, such as "east garden," embodiments of the present invention attribute it to the cell.

The preprocessing of the target address comprises the steps of unifying the number format, removing special symbols, and filtering invalid addresses to prevent errors generated by noise points. By preprocessing the target address, some irregular expressions in the address can be roughly filtered out, but the irregular address cannot be completely converted into the regular address, so that subsequent processing is required.

S220, inputting the preprocessed target address into a trained CRF splitting model to obtain an optimal labeling address sequence, wherein the trained CRF splitting model is obtained by training based on a preset characteristic template and training data;

inputting the preprocessed target address into a trained CRF splitting model to obtain an optimal labeling sequence, wherein the optimal labeling sequence is administrative divisions and interest points of each level in the split target address, each level of administrative divisions refer to administrative places such as provinces, cities, districts and towns corresponding to the target address, and the interest points refer to final residential building or room.

For example, for the target address "the mountainous area of wuhan city and the acute creation center of level one 1267" in the north of huh province, "province-north of huh province, city-wuhan city, district/county-mountainous area, road/street office-level one 1267" and the point of interest "house number/unit number-acute creation center 2107" may be marked.

In the embodiment of the invention, the CRF split model after training is obtained by training through a preset feature template and training data, the preset feature template is used for determining a feature function for selecting the training data, and the CRF split model is obtained by extracting the corresponding features of the training data and training the CRF split model.

The CRF splitting model is a model determined based on a Conditional Random Field (CRF for short), the Conditional Random Field can comprehensively consider and carry out global statistics on spatial context characteristics, and compared with other sequence labeling models, the optimal labeling result can be obtained.

In addition, in the embodiment of the invention, the training data is obtained by a self-training and manual mixed iteration method, some addresses are marked manually, then the marked addresses are used for training the CRF split model, and the trained CRF split model is used for marking other unmarked addresses, so that all marked address corpora are used as the training data. And retraining the CRF splitting model by using the training data.

And S230, acquiring a candidate matching address according to the current search index of the optimal labeling address sequence and a preset ElasticSearch search engine.

As can be seen from the above, the optimal tagged address sequence includes a plurality of address elements, the address elements include information such as province, city, district, house number or unit number, and are searched in a pre-established elastosearch search engine, and the specific search can be performed according to keywords such as a cell ID, a city, an administrative district, a street, a cell name, an address name, a cell alias, an address alias, and the like, that is, the current search index can be the cell ID, the city, the administrative district, the street, the cell name, the address name, and the like, and is specifically determined according to the actual situation, and the response speed can reach millisecond level, that is, fuzzy search can be performed, and accurate search can also be performed.

The optimal address labeling sequence can be used as a parameter in a search statement, and when the city field is accurately matched, the corresponding street name and the corresponding community building field are respectively matched in a fuzzy mode. The search statement is input into the ElasticSearch engine, and one or more matching results can be returned, that is, one or more alternative matching addresses can be selected, and the determination is specifically performed according to the actual situation. When a plurality of search results are found out according to different search conditions, the result data need to be merged according to the cell ID, repeated data are removed, and the obtained matching result can be used as input data of a confidence degree scoring algorithm.

It should be noted that the preset ElasticSearch engine is a Lucene-based search server, which can conveniently make a large amount of data have the capability of searching, analyzing and exploring. On the premise that the integration of massive cell information is completed, the ElasticSearch search engine firstly stores cell information data into the ElasticSearch in batches, and establishes inverted indexes by respectively using the cells and addresses as index libraries. The Elasticissearch is a distributed, high-expansion and high-real-time search and data analysis engine. It can conveniently make a large amount of data have the capability of searching, analyzing and exploring.

According to the address matching method provided by the invention, after the target address is preprocessed, irregular information in the target address is removed, and then the irregular information is input into a trained CRF splitting model, and the CRF splitting model can accurately split irregular or non-detailed address information, so that the splitting precision and the accuracy of the target address are improved, and the subsequent address matching precision is improved; and then matching in a preset ElasticSearch search engine according to the optimal labeling address sequence, and quickly matching regular, detailed and accurate address information by fully utilizing the self-contained search function of the preset ElasticSearch search engine.

On the basis of the above embodiment, it is preferable to further include:

When there are a plurality of candidate matching addresses, the best matching address needs to be selected from the candidate matching addresses. The method specifically comprises the steps of taking a preset address element of an optimal labeling address sequence as a search parameter, searching in a preset elastic search engine, wherein the preset address element can be a cell, a house number or a way number, and is specifically determined according to actual conditions.

Generally, when cell information and route number information are searched in an ElasticSearch, unique address information can be determined, and therefore, a reference matching address obtained by using a cell as a search parameter can be used as a confidence calculation reference in the embodiment of the present invention.

And calculating the confidence degree between the reference matching address and the alternative matching address by taking the reference matching address as a reference, wherein the higher the confidence degree is, the higher the accuracy of the alternative matching address is, the lower the confidence degree is, the lower the accuracy of the alternative matching address is, and the best matching address is selected from all the alternative matching addresses according to the confidence degree.

It should be noted that the confidence score algorithm is implemented by performing encapsulation modification through a fuzzy wuzzy string matching tool, and the principle refers to the minimum number of editing operations required for converting one string into another string. The editing operation includes replacing characters, inserting characters, and deleting characters, and generally, the smaller the editing distance, the greater the similarity between two character strings.

According to the address matching method provided by the embodiment of the invention, the best matching address is screened out from a plurality of candidate matching addresses by taking the confidence coefficient as an index, so that the accuracy of address matching is further improved.

On the basis of the above embodiment, preferably, the training data is obtained by:

Before a CRF split model is trained, training data needs to be determined, in order to obtain the training data, an address element classification and labeling system needs to be designed first to define how to label addresses and express analysis results, and in order to adapt to more standard and non-standard addresses simultaneously, for the components and meanings of all addresses in the training data, a table 1 is an address element preset classification and labeling system table, as shown in table 1. And adding other components including spatial relation description (south side, north side, nearby and the like) and classification of address elements such as redundant punctuation marks, conjunctions and the like in the address on the basis of a multi-level administrative division by referring to a related address model and an existing system.

In order to make the processed target address conform to the data of the CRF input format, the linguistic data after part of speech tagging is converted into a standard format with each line only containing one character and the character tagging, and the standard format is divided by a tabulation character, and a 3-tag tagging set is adopted to respectively represent a first character, a middle character and a tail character through B, M, E letters.

TABLE 1

Mark word	Type of address element	Illustrate by way of example
			PROV	Economic	Provincial and direct municipality, autonomous region, etc
CITY	City (R)	City, autonomous state, etc
			DIST	District/county	County, county-level city, etc
TOWN	Ballast for ballast	Town, village and the like
			VILL	Village/community	Village, community, village and the like
ROAD	Road/street office	Roads, streets, living committees, etc
			DOOR	Number plate/unit number	Number, layer, building, ridge, seat, etc
POI	Point of interest	Buildings, squares, companies, etc
			SCENE	Natural feature	Canal, river, lake, river, mountain, etc
CONJ	Conjunction word	He, river, etc
			PUNC	Punctuation mark	A comma,Brackets and the like
NOR	Spatially describing relationships	South, north, near, side, etc

The method comprises the steps of firstly obtaining an original corpus, wherein the original corpus comprises various corpus addresses, all the corpus addresses are not labeled at the beginning, in order to obtain training data, the training data need to be labeled according to a preset classification labeling system shown in table 1, a specific labeling method can be that labeling is carried out according to the corpus through a relevant machine learning model, or labeling is carried out manually according to the preset classification labeling system, and the specific labeling method is determined according to actual conditions. In the embodiment of the invention, a part of the corpus addresses are labeled manually according to the address element classification labeling system, after a part of the corpus addresses are labeled, all labeled addresses are used as labeled address libraries, and all unlabeled addresses are used as unlabeled address libraries.

And then, training the initial CRF splitting model by using the labeled address library, and continuously iterating the model parameters in the training process to obtain a target CRF splitting model. The training of the CRF split model mainly comprises the steps of training weight parameters of feature functions, wherein each feature function corresponds to a plurality of feature functions, the value of the feature function is 0 or 1, the weight can be positive number, 0 or negative number, the positive number represents that the contribution proportion of the feature functions is increased, the 0 represents that the feature functions do not contribute, the negative number represents that the contribution proportion of the feature functions is reduced, and finally, the maximum likelihood function is utilized to find the optimal solution.

And then, using the target CRF splitting model to perform labeling prediction on the address linguistic data in the unlabeled address library to obtain a labeled address sequence corresponding to the address linguistic data in the unlabeled address library. And then updating the labeled address library by using the newly labeled address, then training the target CRF splitting model by using the updated labeled address library, repeating the steps S212 to S215 until the number of the unlabeled addresses in the unlabeled address library is less than a preset number threshold, taking the finally updated addresses in the labeled address library and the corresponding labeling sequences as training data, and taking the finally obtained target CRF splitting model as the target CRF splitting model.

It should be noted that the preset number threshold may be determined according to actual situations, and the embodiment of the present invention is not specifically limited herein.

On the basis of the foregoing embodiment, preferably, the updating the labeled address library by using the part of unlabeled addresses and the standard address sequence corresponding to the part of unlabeled addresses includes:

And for an address in any unmarked address library, calculating the confidence coefficient between the address and the marked address sequence predicted by the target CRF splitting model, if the confidence coefficient is greater than a preset confidence coefficient threshold value, indicating that the predicted marked address sequence is more accurate, and moving the address from the unmarked address library to the marked address library. If the confidence between the address and the labeled address sequence predicted by the target CRF splitting model is smaller than a preset confidence threshold, the predicted labeled address sequence is still placed in the unlabeled address library, which indicates that the accuracy of the predicted labeled address sequence is not high.

It should be noted that the preset confidence threshold may be determined according to actual situations, and the embodiment of the present invention is not specifically limited herein.

Specifically, the confidence between the address corpus and the tagged address sequence predicted by the target CRF splitting model is calculated by the following formula:

wherein, C_xRepresenting the confidence coefficient between the address corpus and the labeled address sequence predicted by the target CRF splitting model, i represents the current position, and X is (X)₁,x₂,…,x_n) For unlabeled address corpus, Y ═ Y₁,y₂,…,y_n) And representing the predicted labeled address sequence, wherein an input variable X is X, and an output variable Y is Y.

In the embodiment of the invention, the conditional random field is used for analyzing the target address, comprehensive, accurate and large-scale labeled corpora are quickly obtained according to a self-training semi-supervised learning and manual mixing method, a corpus training model is selected to form a feature set and a feature template, and the conditional random field model is fused to analyze the Chinese address, so that the address splitting precision and the accuracy are improved, and the subsequent address matching precision is improved.

On the basis of the above embodiment, preferably, the trained CRF split model is obtained by training based on a preset feature template and training data, and is obtained through the following steps:

Specifically, the feature template is configured for the feature position relationship, the model selects features in a context window of a current item, generally speaking, the context window is selected from 2 to 3 when a Chinese named entity is identified, the window is too large, the features are increased, the operation efficiency is influenced, the window is too small, the context information of the address elements is lost, and the analysis precision is influenced.

In the embodiment of the invention, a context window is selected as 2 for analysis, a unitary characteristic is constructed, and the characteristics in the front direction and the rear direction are combined and compared by considering a common window of natural language processing and combining a large amount of data analysis.

In the implementation, the feature function is used to extract the corresponding feature.

And determining extracted features according to the feature function, and combining the weight corresponding to the address prediction component in the target CRF split model and the weight corresponding to the context constraint, wherein the extracted features comprise the features of the predicted address component part and the features representing the context constraint, each feature corresponds to a corresponding weight, and the trained CRF split model is obtained by training the target CRF split model.

It should be noted that the target CRF splitting model is obtained after the training of the finally obtained target CRF splitting model is completed.

On the basis of the foregoing embodiment, preferably, the obtaining a best matching address according to the confidence between the reference matching address and the candidate matching address includes:

Specifically, in the embodiment of the present invention, an example of any one candidate matching address is taken as an example for explanation, and the first preset matching threshold may be specifically determined according to an actual situation, in the embodiment of the present invention, a value of the first preset matching threshold is 90, when a confidence between the candidate matching address and the reference matching address is greater than 90 points, the way number information of the reference matching address and the candidate matching address, and the cell information of the reference matching address and the candidate matching address need to be individually fuzzy-matched, and if a matching degree between the candidate matching address and the reference matching address is greater than 80 points, the candidate matching address is taken as an optimal matching address.

The second preset matching threshold may be specifically determined according to an actual situation, in the embodiment of the present invention, a value of the second preset matching threshold is 80, when the confidence is less than 80 minutes, the route number information of the reference matching address and the cell information, and the route number information of the candidate matching address and the cell information need to be merged, the merged two are matched, then the address data with the maximum confidence is fuzzy matched, and if the score is greater than 90 minutes, the candidate matching address is used as the best matching address.

To sum up, the embodiment of the present invention provides an address matching method, where after a target address is preprocessed, irregular information in the target address is removed, and then the irregular information is input into a trained CRF splitting model, and the CRF splitting model can accurately split irregular or non-detailed address information, so as to improve splitting precision and accuracy of the target address, and further improve subsequent address matching precision; and then matching in a preset ElasticSearch search engine according to the optimal labeling address sequence, and quickly matching regular, detailed and accurate address information by fully utilizing the self-contained search function of the preset ElasticSearch search engine.

In addition, in the embodiment of the invention, the conditional random field is used for analyzing the target address, comprehensive, accurate and large-scale labeled corpora are quickly obtained according to a self-training semi-supervised learning and manual mixing method, a corpus training model is selected to form a feature set and a feature template, the conditional random field model is fused to analyze the Chinese address, the address splitting precision and the address splitting accuracy are improved, and the subsequent address matching precision is improved.

Fig. 3 is a schematic structural diagram of an address matching system according to an embodiment of the present invention, as shown in fig. 3, the system includes an obtaining module 310, a sequence module 320, and a matching module 330, where:

the obtaining module 310 is configured to obtain a preprocessed target address;

the sequence module 320 is configured to input the preprocessed target address into a trained CRF splitting model to obtain an optimal tagging address sequence, where the trained CRF splitting model is obtained by training based on a preset feature template and training data;

the matching module 330 is configured to obtain an alternative matching address according to the current search index of the best tagged address sequence and a preset ElasticSearch engine.

On the basis of the above embodiment, it is preferable to further include: a reference module and an optimization module, wherein:

the reference module is used for acquiring a reference matching address according to a preset address element of the optimal labeling address sequence and the preset ElasticSearch search engine;

the optimization module is used for obtaining the best matching address according to the confidence degree between the reference matching address and the alternative matching address.

On the basis of the foregoing embodiment, preferably, the sequence module includes a labeling unit, a training unit, a prediction unit, an update unit, and an iteration unit, and training data is obtained through the standard unit, the training unit, the update unit, and the iteration unit, where:

the labeling unit is used for acquiring a labeled address library and a preprocessed unlabeled address library in an original corpus, and the labeled address library is obtained by labeling according to a preset classification labeling system;

the training unit is used for training an initial CRF splitting model according to the labeled address library to obtain a target CRF splitting model;

the prediction unit is used for labeling part of unmarked addresses in the preprocessed unmarked address library according to the target CRF splitting model to obtain a labeled address sequence corresponding to the part of unmarked addresses;

the updating unit is used for updating the labeled address library by using the part of unlabeled addresses and the standard address sequence corresponding to the part of unlabeled addresses, and using the updated labeled address library as the labeled address library again, and using the target CRF splitting model as the initial CRF splitting model again;

the iteration unit is used for repeating the steps until the number of the residual unmarked addresses in the unmarked address base is smaller than a preset number threshold value, and taking the final addresses in the marked address base as training data.

On the basis of the foregoing embodiment, preferably, the updating unit includes a confidence unit and an updating subunit, where:

the confidence subunit is used for deleting the part of the un-labeled addresses with the confidence degrees larger than a preset confidence degree threshold from the un-labeled address library according to the confidence degrees between the part of the un-labeled addresses and the corresponding labeled address sequences;

and the updating subunit is used for adding the corresponding labeled address sequence into the labeled address library and acquiring the updated labeled address library.

On the basis of the foregoing embodiment, in the confidence subunit, preferably, the confidence between the part of unlabeled addresses and the corresponding labeled address sequence is obtained as follows:

wherein, C_xRepresenting the confidence between the unmarked address corpus and the corresponding marked address sequence, i representing the current position, and X ═ X₁,x₂,…,x_n) For no address, Y ═ Y₁,y₂,…,y_n) Indicating the predicted tag address sequence, taking X as the input variable, and outputting the variableThe quantity Y is Y.

On the basis of the foregoing embodiment, preferably, the sequence module further includes a feature unit and a splitting unit, where:

the characteristic unit is used for acquiring a characteristic function according to the preset characteristic template;

and the splitting unit is used for extracting features of the training data according to the feature function, training a finally obtained target CRF splitting model by combining the weight of each feature, and obtaining the trained CRF splitting model.

On the basis of the foregoing embodiment, preferably, the optimization unit includes a first optimization subunit and a second optimization subunit, and for any alternative matching address, where:

the first optimization subunit is configured to, if the confidence degrees of the reference matching address and the any one of the candidate matching addresses are greater than a first preset matching threshold, match the cell information of the reference matching address and the cell information of the any one of the candidate matching addresses, and if both matching results are greater than a second preset matching threshold, use the any one of the candidate matching addresses as an optimal matching address;

the second optimization subunit is configured to, if the confidence degrees of the reference matching address and the any one of the candidate matching addresses are smaller than the first preset matching threshold, merge the route number information and the cell information of the reference matching address, merge the route number information and the cell information of the any one of the candidate matching addresses, and if the matching degree after merging of the route number information and the cell information of the any one of the candidate matching addresses is greater than the first preset matching threshold, regard the any one of the candidate matching addresses as the best matching address.

The various modules in the address matching system described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

The present embodiment is a system embodiment corresponding to the method, the specific implementation process of the system embodiment is the same as the method embodiment, please refer to the method embodiment for details, and the system embodiment is not described herein again.

Fig. 4 is a schematic structural diagram of a computer device provided in an embodiment of the present invention, where the computer device may be a server, and an internal structural diagram of the computer device may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a computer storage medium and an internal memory. The computer storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the computer storage media. The database of the computer device is used for storing data generated or acquired during the execution of the address matching method, such as a preprocessed target address, a trained CRF splitting model, training data, and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an address matching method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the address matching method in the above embodiments are implemented. Alternatively, the processor implements the functions of the modules/units in this embodiment of the address matching system when executing the computer program.

In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the steps of the address matching method in the above embodiments. Alternatively, the computer program realizes the functions of the modules/units in the embodiment of the address matching system described above when executed by the processor.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. An address matching method, comprising:

acquiring a preprocessed target address;

2. The address matching method according to claim 1, further comprising:

3. The address matching method according to claim 1 or 2, wherein the training data is obtained by:

4. The address matching method of claim 3, wherein the updating the labeled address library by using the partial unlabeled address and the standard address sequence corresponding to the partial unlabeled address comprises:

5. The address matching method of claim 4, wherein the confidence between the partially unlabeled address and the corresponding labeled address sequence is obtained by:

6. The address matching method according to claim 3, wherein the trained CRF split model is obtained by training based on a preset feature template and training data, and is obtained by the following steps:

7. The address matching method according to claim 2, wherein the obtaining a best matching address according to the confidence between the reference matching address and the candidate matching address comprises:

8. An address matching system, comprising:

the acquisition module is used for acquiring the preprocessed target address;

9. A computer arrangement comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the address matching method according to any of claims 1 to 7 when executing the computer program.

10. A computer storage medium storing a computer program, the computer program implementing the steps of the address matching method according to any one of claims 1 to 7 when executed by a processor.