CN115658837A - Address data processing method and device, electronic equipment and storage medium - Google Patents

Address data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115658837A
CN115658837A CN202211412583.XA CN202211412583A CN115658837A CN 115658837 A CN115658837 A CN 115658837A CN 202211412583 A CN202211412583 A CN 202211412583A CN 115658837 A CN115658837 A CN 115658837A
Authority
CN
China
Prior art keywords
address
address data
data
structured
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211412583.XA
Other languages
Chinese (zh)
Inventor
路兴
张天宇
王轼皓
胡泽婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing E Hualu Information Technology Co Ltd
Original Assignee
Beijing E Hualu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing E Hualu Information Technology Co Ltd filed Critical Beijing E Hualu Information Technology Co Ltd
Priority to CN202211412583.XA priority Critical patent/CN115658837A/en
Publication of CN115658837A publication Critical patent/CN115658837A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an address data processing method and device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring initial structured data of a current address; performing information completion on the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library contains standard address data; carrying out error correction processing on the enhanced address data to obtain error-corrected address data; judging whether the address elements in the address data after error correction are complete; and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data. Through the method and the device, the problems that the utilization rate of address data is low, data processing is inaccurate, and further the address standardization and unification requirements cannot be met in the related technology are solved.

Description

Address data processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to an address data processing method and apparatus, an electronic device, and a storage medium.
Background
The current address resolution field lacks the defect of unified and efficient address matching standards, the mainstream address resolution technology tends to construct a complex multivariate relationship among five-level divisions, road parts and local areas, construct a set of address element division new system capable of uniquely determining a target location, and convert addresses with different descriptions into standard addresses which can be easily recognized by a computer and a person.
However, in the prior art, addresses with missing address elements are all regarded as invalid addresses, so that valid information in the addresses cannot be mined, and the data utilization rate is low. In addition, the address error correction mechanism of the prior art simply utilizes the context information of the address, and when the context has information loss, the error correction mechanism cannot be realized. One address or address element usually has a plurality of description sets satisfying a certain similarity, and when processing is performed by the current mainstream method, either overlap ratio matching of character strings for which similarity calculation is abandoned is adopted, or the accuracy rate is low when the similarity is calculated.
Therefore, the prior art has the problems of low utilization rate of address data, inaccurate data processing and incapability of meeting the requirements of address standardization and unification.
Disclosure of Invention
The application provides an address data processing method and device, electronic equipment and a storage medium, and aims to at least solve the problems that the utilization rate of address data is low, data processing is inaccurate, and further address standardization and unification requirements cannot be met in the related technology.
According to an aspect of an embodiment of the present application, there is provided an address data processing method, including:
acquiring initial structured data of a current address;
completing information of the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library comprises standard address data;
carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
judging whether the address elements in the address data after error correction are complete;
and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data.
According to another aspect of the embodiments of the present application, there is also provided an address data processing apparatus, including:
the first acquisition module is used for acquiring the initial structured data of the current address;
a completion module, configured to perform information completion on the initial structured data through a standard structured address library to obtain enhanced address data, where the standard structured address library includes standard address data;
the error correction module is used for carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
the judging module is used for judging whether the address elements in the corrected address data are complete or not;
and the matching module is used for carrying out similarity matching on the address data after error correction and the standard address data in the standard structured address library under the condition that the address elements in the address data after error correction are complete to obtain target address data.
Optionally, the apparatus further comprises:
a second obtaining module, configured to obtain a standardized address element before completing the information of the initial structured data through a standardized structured address library, where the standardized address element includes: dividing an address and a preset entity by a region;
the third acquisition module is used for acquiring the hierarchical relationship among the area division addresses;
a fourth obtaining module, configured to obtain an association relationship between the area division address and the preset entity;
and the establishing module is used for establishing the standard structured address library according to the standardized address elements, the hierarchical relationship and the incidence relationship.
Optionally, the matching module comprises:
the matching unit is used for matching the coincidence degree of the error-corrected address data and the standard address data and outputting similar address data of which the coincidence degree with the error-corrected address data is higher than a preset threshold value;
a first operation unit configured to take the similar address data as the target address data if the similar address data is unique;
the calculation unit is used for respectively calculating the semantic similarity between the similar address data and the address data after error correction if the similar address data is not unique;
a second operation unit, configured to use the similar address data with the highest semantic similarity value as the target address data.
Optionally, the first obtaining module includes:
an obtaining unit configured to obtain the current address;
the preprocessing unit is used for preprocessing the data of the current address to obtain preprocessed address data;
the labeling unit is used for carrying out sequence labeling on the address text in the preprocessed address data through a first model;
and the splitting unit is used for splitting the address elements in the preprocessed address data according to the sequence labels to obtain the initial structured data.
Optionally, the completion module comprises:
a first judging unit, configured to judge missing address elements in the structured data;
and the first completion unit is used for completing the address element information of the missing address elements through the standard structured address library to obtain the enhanced address data.
Optionally, the error correction module comprises:
a mask unit, configured to mask the address elements in the enhanced address data to obtain a preset number of mask addresses;
the first obtaining unit is used for obtaining the predicted address data corresponding to each mask address according to the mask addresses;
a determining unit, configured to determine reference address data according to the predicted address data and a preset mechanism;
a second obtaining unit, configured to obtain reference structured data according to the reference address data;
and the second completion unit is used for performing information completion on the reference structured data through the standard structured address library to obtain the address data after error correction.
Optionally, the determining module includes:
an input unit, configured to input the error-corrected address data into a second model to obtain a supplemental address element when an address element in the error-corrected address data is incomplete, where the second model is used to predict the supplemental address element;
a combining unit, configured to combine the corrected address data and the supplemental address element to obtain combined address data;
a second judging unit, configured to judge whether third intermediate address data can be obtained according to the combined address data within a preset number of times, where the third intermediate address data is obtained by performing error correction processing on second intermediate address data, where the second intermediate address data is obtained by performing information completion on first intermediate address data by using the standard structured address library, and the first intermediate address data is obtained by splitting the combined address data;
a determining unit, configured to determine that the error-corrected address data is invalid data if the third intermediate address data cannot be obtained.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory communicate with each other through the communication bus; wherein the memory is used for storing the computer program; a processor for performing the method steps in any of the above embodiments by running the computer program stored on the memory.
According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method steps of any of the above embodiments when the computer program is executed.
In the embodiment of the application, the initial structured data of the current address is obtained; completing information of the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library comprises standard address data; carrying out error correction processing on the enhanced address data to obtain error-corrected address data; judging whether the address elements in the address data after error correction are complete; and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and standard address data in a standard structured address library to obtain target address data. By the method, on one hand, the address is predicted, the expression capability of the model is greatly improved, and simultaneously, addresses such as complete errors, homophone errors, short addresses and the like can be corrected or restored; and the incomplete address can be completed to a certain degree, and the utilization rate of the data is improved. On the other hand, dynamic updating and self-repairing of the address base are guaranteed, a similarity matching mode is used, and effective information in the address is mined to the maximum extent. The problems that the utilization rate of address data is low, data processing is inaccurate, and the requirements for address standardization and unification cannot be met in the related technology are solved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic flow chart diagram illustrating an alternative address data processing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating an alternative address data processing method according to an embodiment of the present application;
FIG. 3 is a block diagram of an alternative address data processing apparatus according to an embodiment of the present application;
fig. 4 is a block diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an aspect of an embodiment of the present application, there is provided an address data processing method, as shown in fig. 1, a flow of the method may include the following steps:
step S101, acquiring initial structured data of the current address.
Optionally, removing special characters irrelevant to the address information in the current address data, labeling the address text and splitting the address elements to obtain the initial structured data of the current address.
And S102, completing information of the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library comprises standard address data.
Optionally, for the addresses with incomplete elements, completing element information to a certain extent by using an established standard structured address library to obtain enhanced address data.
Step S103, the address data after the enhancement is processed by error correction, and the address data after error correction is obtained.
Optionally, for a situation that part of elements in the address may have an error in expression, masking and predicting each element to obtain a predicted complete address, and determining a final result by using a voting mechanism to obtain error-corrected address data.
And step S104, judging whether the address elements in the address data after error correction are complete.
Optionally, if some addresses still lack the key information of the 'province' after the previous resolution, the addresses are input into an ERNIE model to mine deep semantic representations of the addresses, and the provinces which the addresses possibly belong to are predicted.
And step S105, under the condition that the address elements in the error-corrected address data are complete, carrying out similarity matching on the error-corrected address data and standard address data in a standard structured address library to obtain target address data.
Optionally, matching the structured address (i.e., the address data after error correction) with the standard address data in the standard address library by exact matching or fuzzy matching to find a standard structured unified address (i.e., the target address data) so as to meet the address matching requirements of different regions, different sources and different descriptions, thereby implementing the address similarity determination.
In the embodiment of the application, the initial structured data of the current address is obtained; performing information completion on the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library contains standard address data; carrying out error correction processing on the enhanced address data to obtain error-corrected address data; judging whether the address elements in the address data after error correction are complete; and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data. By the method, on one hand, the address is predicted, the expression capability of the model is greatly improved, and simultaneously, addresses such as complete errors, homophone errors, short addresses and the like can be corrected or restored; and the incomplete address can be completed to a certain degree, and the utilization rate of the data is improved. On the other hand, dynamic updating and self-repairing of the address base are guaranteed, a similarity matching mode is used, and effective information in the address is mined to the maximum extent. The problems that the utilization rate of address data is low, data processing is inaccurate, and the requirements for address standardization and unification cannot be met in the related technology are solved.
As an alternative embodiment, before completing the information of the initial structured data by the standard structured address library, the method further comprises:
obtaining a standardized address element, wherein the standardized address element comprises: dividing an address and a preset entity by a region;
acquiring a hierarchical relation between the regional division addresses;
acquiring an incidence relation between the area division address and a preset entity;
and establishing a standard structured address library according to the standardized address elements, the hierarchical relationship and the incidence relationship.
Optionally, a bottom unified structured database is constructed, addresses from different systems are mapped into the same standard address base, and parallel processing of standardization and similarity matching (same address lookup) of multi-element system address data is achieved.
Acquiring five-level addresses of province, city, district, town and village and standardized address elements of special entities (namely preset entities) such as roads, districts and buildings.
And establishing a standard structured address library containing the standardized address elements, wherein the address library can maintain the hierarchical relationship among the five-level division address elements and the incidence relationship among the five-level division address elements and other special entities such as roads, cells and the like, and constructing a bridge with addresses matched and compared with each other. Based on the address library, a universal standard form of the address can be given finally, and the uniqueness of the address is ensured.
In the embodiment of the application, the multi-element system address fusion is realized, the defect that the current address resolution field lacks a unified and efficient address matching standard is overcome, and the problem that the address standardization requirement cannot be met in the prior art is solved.
As an alternative embodiment, performing similarity matching on the address data after error correction and standard address data in a standard structured address library to obtain target address data, including:
matching the coincidence degree of the error-corrected address data with the standard address data, and outputting similar address data of which the coincidence degree with the error-corrected address data is higher than a preset threshold value;
if the similar address data is unique, the similar address data is taken as target address data;
if the similar address data is not unique, semantic similarity between the similar address data and the address data after error correction is respectively calculated;
and taking the similar address data with the highest semantic similarity value as target address data.
Optionally, the purpose of matching the structured address (i.e., the address data after error correction) with the standard address library is to find a standard structured unified address, so as to meet the actual needs of matching different description addresses in different regions and different sources, and to implement the similar layer determination of the address. The matching of the address elements adopts a mode of combining character string overlap ratio matching and semantic similarity matching of a pre-training model BERT based on a large-scale unsupervised text corpus.
Firstly, matching the character string overlap ratio of a structured address (namely, address data after error correction) and a standard address library, if a unique similar value (namely, similar address data) is obtained, realizing accurate matching, and obtaining structured unified data (namely, target address data); if the similarity value is not unique, performing semantic similarity matching through a pre-training model BERT based on a large-scale unsupervised text corpus, calculating the semantic similarity between each similar address and the structured unified data, taking the similar address data with the highest semantic similarity value as the structured unified data (namely target address data), completing fuzzy matching, wherein the fuzzy matching is to solve the problem that partial address elements cannot be correctly identified after the previous steps, calculating the similarity through the BERT, and finding the element with the highest semantic similarity as a matching target.
In the embodiment of the application, the matching of the address elements adopts a mode of combining character string contact degree matching and semantic similarity matching of a pre-training model BERT based on a large-scale unsupervised text corpus, so that the accuracy of address matching can be improved, irregular address semantic information can be prevented from being omitted, and effective information in addresses can be mined to the maximum extent. The problems of low utilization rate of address data and inaccurate data processing in the prior art are solved.
As an alternative embodiment, obtaining the initial structured data of the current address includes:
acquiring a current address;
preprocessing data of a current address to obtain preprocessed address data;
carrying out sequence marking on address texts in the address data after preprocessing through a first model;
and splitting the address elements in the preprocessed address data according to the sequence labels to obtain initial structured data.
Optionally, the current address data is scrubbed to remove special characters that are not relevant to the address information. And then, carrying out sequence labeling on the address text by using a BILSTM + CRF named entity recognition model (namely a first model), so as to split address elements including provinces, cities, districts, towns, villages, roads, cells and the like, and obtain initial structured data. For example: the address "ZZ City TT district CC road street XX community LL road 666 No. ZZ district 7 room 602", of HH province, is analyzed as { ' providence ': HH province ', ' city ': ZZ City ', ' county ': TT district ', ' town ': CC road street ', ' village ', ' XX community living committee ', ' road ': LL road 666 ', ' quatters ': ZZ district ', ' building '7 ' after sequence marking, and the resolution of each address element is completed through the identification of a naming entity.
In the embodiment of the application, address data is preprocessed and split, and a basis is provided for the following steps of error correction, enhancement, completion, matching and the like.
As an alternative embodiment, performing information completion on the initial structured data through the standard structured address library to obtain enhanced address data, including:
judging missing address elements in the structured data;
and completing the address element information of the missing address elements through a standard structured address library to obtain the enhanced address data.
Optionally, for the address data with incomplete address elements, one or more elements of province, city, district, town, village and the like are missing after structuring, and element information completion is performed to a certain extent through a built standard structured address library, so as to obtain enhanced address data. For example: the address 'BJBJBJ' obtains a structured address after data enhancement: { ' success ': BJ city ', ' city ': BJ city ', ' county ': SJS area, ' town ': BJ street ', ' video ': none } in the fifth-order region, BJ street was identified and the region information was completed.
In the embodiment of the application, the address elements are divided and expanded from five levels to seven levels, namely, the coding levels of the newly added roads and entities are increased. The complex multivariate relation among the five-level divisions, the road parts and the local areas which tend to be constructed in the prior art is greatly simplified. The management of the house buildings is more detailed, and meanwhile, the hit rate of address model matching is improved. The problem of the prior art exist inaccurate data processing is solved.
As an alternative embodiment, performing error correction processing on the enhanced address data to obtain error-corrected address data includes:
respectively masking the address elements in the enhanced address data to obtain a preset number of mask addresses;
respectively obtaining the predicted address data corresponding to each mask address according to the mask addresses;
determining reference address data according to the predicted address data and a preset mechanism;
obtaining reference structured data according to the reference address data;
and performing information completion on the reference structured data through a standard structured address library to obtain the address data after error correction.
Optionally, for a situation that part of elements in the address may have an expression error, sequentially masking each element in the enhanced address data, and determining a final result according to a complete address predicted from a masked address of each element by using a voting mechanism (i.e., a preset mechanism), i.e., a minority-majority-compliant principle, to complete error correction of the address error element and obtain error-corrected address data.
For example, the address "HH area of BB city HH area CC village area JJ garden first community" is error address element, actually CC area, the model masks the elements in turn according to the previous element division to obtain "[ mask ] HH area CC village area JJ garden first community", "BB city [ mask ] CC village area JJ garden first community", "BB city HH area [ mask ] JJ garden first community", "BB city HH area CC village area [ mask ] first community", "BB city HH area CC village area JJ garden [ mask ] based on the mask address, then predicts to obtain the predicted address corresponding to each mask address based on the mask address, and finally determines the final result through voting mechanism.
In the embodiment of the application, a mask mechanism is used in an address analysis model to predict the address, so that the expression capability of the model is greatly improved, and simultaneously, addresses such as complete errors, homophone errors, short addresses and the like can be corrected or restored, thereby solving the problems of low utilization rate of address data and inaccurate data processing in the prior art.
As an alternative embodiment, the determining whether the address elements in the address data after error correction are complete includes:
inputting the address data after error correction into a second model under the condition that the address elements in the address data after error correction are incomplete to obtain supplementary address elements, wherein the second model is used for predicting the supplementary address elements;
combining the corrected address data and the supplementary address elements to obtain combined address data;
judging whether third intermediate address data can be obtained according to the combined address data within preset times, wherein the third intermediate address data is obtained by carrying out error correction processing on second intermediate address data, the second intermediate address data is obtained by carrying out information completion on the first intermediate address data through a standard structured address library, and the first intermediate address data is obtained by splitting the combined address data;
and if the third intermediate address data cannot be obtained, judging the address data after error correction as invalid data.
Optionally, after the foregoing parsing step, if there is a key information that a part of the error-corrected address data still lacks a "province", the address data is input into an ERNIE model (i.e., a second model) to mine a deep semantic representation thereof, a province (i.e., a supplementary address element) to which the address data may belong is predicted, the address data is added and then subjected to the above parsing step again to obtain first intermediate address data, the first intermediate address data is processed through the above complementary parsing step to obtain second intermediate address data, the second intermediate address data is processed through the above error-correction parsing step to obtain third intermediate address data, and if the address data added with the province element cannot be validly parsed (i.e., cannot obtain third intermediate address data), and the above steps are repeated multiple times (where a preset number indicates multiple times, where no specific number limit is made), the address data added with the new province element still cannot be validly parsed, and then the address data is regarded as an invalid address. For example: the address 'XX path XX garden 10 101 houses', after the calculation of the previous steps, the relevant area division address information can not be found, and provinces possibly associated with the address can be given through an ERNIE classification model.
In the embodiment of the application, the ERNIE classification model is used to obtain the classified province information, and the new address based on the province + the original address is analyzed again through the model, so that the incomplete address can be completed to a certain extent, and the data utilization rate is improved. The problem that in the prior art, the address of the missing region division address element is regarded as an invalid address, and effective information in the invalid address cannot be mined is solved.
As an alternative embodiment, fig. 2 is a schematic flowchart of another alternative address data processing method according to an embodiment of the present application, where the method includes:
preprocessing data; identifying a BILSTM + CRF full-element named entity; splitting an address element; structuring the initial data; data enhancement (address information completion); a mask mechanism (address error correction) for executing the subsequent steps from the splitting of the address elements after voting; judging whether the address elements are complete, if not, predicting the province address elements through a province ERNIE classification model, then executing subsequent steps from the splitting of the address elements, and if so, obtaining structured pure data; matching an address element database; judging whether the precise matching has a unique similarity value, if so, obtaining structured unified data, and if not, performing address similarity calculation to obtain the structured unified data; and (5) pushing data.
Optionally, after the address data is subjected to error correction in the step of a masking mechanism (address error correction), the address elements need to be split again, the initial data is structured, the data is enhanced (address information is completed), and then, the masking mechanism (address error correction) does not need to be repeated, and the subsequent judgment on whether the address elements are complete is directly performed, where the content is shown as a line M in fig. 2.
For specific implementation manners of other steps in the embodiments of the present application, reference may be made to detailed descriptions of other embodiments of the present application, which are not described herein again.
The method comprises the steps of constructing a standardized address library, splitting address elements, complementing address elements, correcting address errors and matching address similarity. The method meets the actual needs of address matching of different sources and different description modes, and realizes the standardization and unification of addresses by constructing a standard address library and judging the similarity of addresses.
According to another aspect of the embodiments of the present application, there is also provided an address data processing apparatus for implementing the above address data processing method. Fig. 3 is a block diagram of an alternative address data processing apparatus according to an embodiment of the present application, and as shown in fig. 3, the apparatus may include:
a first obtaining module 301, configured to obtain initial structured data of a current address;
a completion module 302, configured to perform information completion on the initial structured data through a standard structured address library to obtain enhanced address data, where the standard structured address library includes standard address data;
the error correction module 303 is configured to perform error correction processing on the enhanced address data to obtain error-corrected address data;
a judging module 304, configured to judge whether an address element in the error-corrected address data is complete;
the matching module 305 is configured to, in a case that an address element in the error-corrected address data is complete, perform similarity matching between the error-corrected address data and standard address data in a standard structured address base to obtain target address data.
It should be noted that the first obtaining module 301 in this embodiment may be configured to execute the step S101, the completing module 302 in this embodiment may be configured to execute the step S102, the error correcting module 303 in this embodiment may be configured to execute the step S103, the determining module 304 in this embodiment may be configured to execute the step S104, and the matching module 305 in this embodiment may be configured to execute the step S105.
Through the module, on one hand, the address is predicted, the expression capability of the model is greatly improved, and simultaneously, addresses such as complete errors, homophone errors, short addresses and the like can be corrected or restored; and the incomplete address can be completed to a certain degree, and the utilization rate of the data is improved. On the other hand, dynamic updating and self-repairing of the address base are guaranteed, a similarity matching mode is used, and effective information in the address is mined to the maximum extent. The problems that the utilization rate of address data is low, data processing is inaccurate, and the requirements for address standardization and unification cannot be met in the related technology are solved.
As an alternative embodiment, the apparatus further comprises:
a second obtaining module, configured to obtain a standardized address element before completing the information on the initial structured data through a standardized structured address library, where the standardized address element includes: dividing an address and a preset entity by a region;
the third acquisition module is used for acquiring the hierarchical relationship between the area division addresses;
the fourth acquisition module is used for acquiring the incidence relation between the area division address and the preset entity;
and the establishing module is used for establishing a standard structured address library according to the standardized address elements, the hierarchical relationship and the incidence relationship.
As an alternative embodiment, the matching module comprises:
the matching unit is used for matching the coincidence degree of the error-corrected address data and the standard address data and outputting similar address data of which the coincidence degree with the error-corrected address data is higher than a preset threshold value;
a first operation unit configured to, if the similar address data is unique, take the similar address data as target address data;
the calculation unit is used for respectively calculating the semantic similarity between the similar address data and the address data after error correction if the similar address data is not unique;
and the second operation unit is used for taking the similar address data with the highest semantic similarity value as the target address data.
As an alternative embodiment, the first obtaining module includes:
an acquisition unit configured to acquire a current address;
the preprocessing unit is used for preprocessing the data of the current address to obtain preprocessed address data;
the labeling unit is used for carrying out sequence labeling on the address text in the preprocessed address data through the first model;
and the splitting unit is used for splitting the address elements in the preprocessed address data according to the sequence labels to obtain the initial structured data.
As an alternative embodiment, the completion module comprises:
a first judgment unit for judging missing address elements in the structured data;
and the first completion unit is used for completing the address element information of the missing address elements through the standard structured address library to obtain the enhanced address data.
As an alternative embodiment, the error correction module comprises:
the mask unit is used for respectively masking the address elements in the enhanced address data to obtain a preset number of mask addresses;
the first obtaining unit is used for obtaining the prediction address data corresponding to each mask address according to the mask addresses;
a determination unit configured to determine reference address data according to the predicted address data and a preset mechanism;
a second obtaining unit, configured to obtain reference structured data according to the reference address data;
and the second completion unit is used for performing information completion on the reference structured data through the standard structured address library to obtain the address data after error correction.
As an alternative embodiment, the judging module includes:
the input unit is used for inputting the address data after error correction into a second model to obtain a supplementary address element under the condition that the address element in the address data after error correction is incomplete, wherein the second model is used for predicting the supplementary address element;
a combination unit for combining the corrected address data and the supplemental address elements to obtain combined address data;
the second judging unit is used for judging whether third intermediate address data can be obtained according to the combined address data within preset times, wherein the third intermediate address data is obtained by carrying out error correction processing on the second intermediate address data, the second intermediate address data is obtained by carrying out information completion on the first intermediate address data through a standard structured address library, and the first intermediate address data is obtained by splitting the combined address data;
and a determination unit configured to determine that the error-corrected address data is invalid data if the third intermediate address data is not available.
It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments.
According to another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the address data processing method, where the electronic device may be a server, a terminal, or a combination thereof.
Fig. 4 is a block diagram of an alternative electronic device according to an embodiment of the present application, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 communicate with each other through the communication bus 404, where,
a memory 403 for storing a computer program;
the processor 401, when executing the computer program stored in the memory 403, implements the following steps:
acquiring initial structured data of a current address;
performing information completion on the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library contains standard address data;
carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
judging whether the address elements in the address data after error correction are complete;
and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data.
Alternatively, in this embodiment, the communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The memory may include RAM, and may also include non-volatile memory, such as at least one disk memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
As an example, as shown in fig. 4, the memory 403 may include, but is not limited to, a first obtaining module 301, a complementing module 302, an error correcting module 303, a determining module 304, and a matching module 305 in the address data processing apparatus. In addition, the address data processing apparatus may further include, but is not limited to, other module units in the address data processing apparatus, which is not described in detail in this example.
The processor may be a general-purpose processor, and may include but is not limited to: a CPU (Central Processing Unit), NP (Network Processor), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
Optionally, for a specific example in this embodiment, reference may be made to the example described in the foregoing embodiment, and this embodiment is not described herein again.
It can be understood by those skilled in the art that the structure shown in fig. 4 is only an illustration, and the device implementing the address data processing method may be a terminal device, and the terminal device may be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 4 is a diagram illustrating a structure of the electronic device. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 4, or have a different configuration than shown in FIG. 4.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, and the like.
According to still another aspect of an embodiment of the present application, there is also provided a storage medium. Alternatively, in this embodiment, the storage medium may be configured to store a program code for executing the address data processing method.
Optionally, in this embodiment, the storage medium may be located on at least one of a plurality of network devices in a network shown in the above embodiment.
Optionally, in this embodiment, the storage medium is configured to store program code for performing the following steps:
acquiring initial structured data of a current address;
performing information completion on the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library contains standard address data;
carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
judging whether the address elements in the address data after error correction are complete;
and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and standard address data in a standard structured address library to obtain target address data.
Optionally, the specific example in this embodiment may refer to the example described in the above embodiment, which is not described again in this embodiment.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, a ROM, a RAM, a removable hard disk, a magnetic disk, or an optical disk.
In the description of the present specification, reference to the description of the terms "this embodiment," "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. In the description of the present disclosure, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims (10)

1. An address data processing method, characterized in that the method comprises:
acquiring initial structured data of a current address;
completing information of the initial structured data through a standard structured address library to obtain enhanced address data, wherein the standard structured address library contains standard address data;
carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
judging whether the address elements in the address data after error correction are complete;
and under the condition that the address elements in the error-corrected address data are complete, performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data.
2. The method of claim 1, wherein before the completing the initial structured data with the standardized structured address repository, the method further comprises:
obtaining a standardized address element, wherein the standardized address element comprises: dividing an address and a preset entity by a region;
acquiring the hierarchical relation among the area division addresses;
acquiring an incidence relation between the area division address and the preset entity;
and establishing the standard structured address library according to the standardized address elements, the hierarchical relationship and the incidence relationship.
3. The method according to claim 2, wherein the performing similarity matching on the error-corrected address data and the standard address data in the standard structured address library to obtain target address data comprises:
matching the coincidence degree of the error-corrected address data and the standard address data, and outputting similar address data of which the coincidence degree with the error-corrected address data is higher than a preset threshold value;
if the similar address data is unique, taking the similar address data as the target address data;
if the similar address data is not unique, respectively calculating the semantic similarity of the similar address data and the address data after error correction;
and taking the similar address data with the highest semantic similarity value as the target address data.
4. The method of claim 1, wherein obtaining the initial structured data for the current address comprises:
acquiring the current address;
preprocessing the data of the current address to obtain preprocessed address data;
carrying out sequence marking on address texts in the preprocessed address data through a first model;
and splitting address elements in the preprocessed address data according to the sequence labels to obtain the initial structured data.
5. The method of claim 4, wherein the complementing the information on the initial structured data through a standard structured address library to obtain the enhanced address data comprises:
judging the address elements which are lacked in the structured data;
and completing the address element information of the missing address elements through the standard structured address library to obtain the enhanced address data.
6. The method of claim 5, wherein the performing error correction processing on the enhanced address data to obtain error-corrected address data comprises:
respectively masking the address elements in the enhanced address data to obtain a preset number of mask addresses;
obtaining the predicted address data corresponding to each mask address according to the mask addresses respectively;
determining reference address data according to the predicted address data and a preset mechanism;
obtaining reference structured data according to the reference address data;
and completing the information of the reference structured data through the standard structured address library to obtain the address data after error correction.
7. The method of claim 1, wherein the determining whether the address elements in the error-corrected address data are complete comprises:
inputting the address data after error correction into a second model to obtain a supplementary address element under the condition that the address element in the address data after error correction is incomplete, wherein the second model is used for predicting the supplementary address element;
combining the corrected address data and the supplementary address elements to obtain combined address data;
judging whether third intermediate address data can be obtained according to the combined address data within preset times, wherein the third intermediate address data is obtained by carrying out error correction processing on second intermediate address data, the second intermediate address data is obtained by carrying out information completion on first intermediate address data through the standard structured address library, and the first intermediate address data is obtained by splitting the combined address data;
and if the third intermediate address data cannot be obtained, judging the address data after error correction as invalid data.
8. An address data processing apparatus, characterized in that the apparatus comprises:
the first acquisition module is used for acquiring the initial structured data of the current address;
a completion module, configured to perform information completion on the initial structured data through a standard structured address library to obtain enhanced address data, where the standard structured address library includes standard address data;
the error correction module is used for carrying out error correction processing on the enhanced address data to obtain error-corrected address data;
the judging module is used for judging whether the address elements in the address data after error correction are complete or not;
and the matching module is used for carrying out similarity matching on the address data after error correction and the standard address data in the standard structured address library under the condition that the address elements in the address data after error correction are complete to obtain target address data.
9. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein said processor, said communication interface and said memory communicate with each other via said communication bus,
the memory for storing a computer program;
the processor for performing the method steps of any one of claims 1 to 7 by running the computer program stored on the memory.
10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program realizes the method steps of any one of claims 1 to 7 when executed by a processor.
CN202211412583.XA 2022-11-11 2022-11-11 Address data processing method and device, electronic equipment and storage medium Pending CN115658837A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211412583.XA CN115658837A (en) 2022-11-11 2022-11-11 Address data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211412583.XA CN115658837A (en) 2022-11-11 2022-11-11 Address data processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115658837A true CN115658837A (en) 2023-01-31

Family

ID=85020893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211412583.XA Pending CN115658837A (en) 2022-11-11 2022-11-11 Address data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115658837A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349451A (en) * 2023-12-01 2024-01-05 广东中思拓大数据研究院有限公司 Data processing method, data processing apparatus, computer device, and storage medium
CN117724985A (en) * 2024-02-08 2024-03-19 此芯科技(武汉)有限公司 Memory access behavior monitoring method and device, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349451A (en) * 2023-12-01 2024-01-05 广东中思拓大数据研究院有限公司 Data processing method, data processing apparatus, computer device, and storage medium
CN117724985A (en) * 2024-02-08 2024-03-19 此芯科技(武汉)有限公司 Memory access behavior monitoring method and device, storage medium and electronic equipment
CN117724985B (en) * 2024-02-08 2024-04-30 此芯科技(武汉)有限公司 Memory access behavior monitoring method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
WO2022134592A1 (en) Address information resolution method, apparatus and device, and storage medium
CN115658837A (en) Address data processing method and device, electronic equipment and storage medium
CN110209830B (en) Entity linking method, apparatus, device, and computer readable storage medium
CN111695355A (en) Address text recognition method, device, medium and electronic equipment
CN110990520B (en) Address coding method and device, electronic equipment and storage medium
CN108369582B (en) Address error correction method and terminal
WO2021189977A1 (en) Address coding method and apparatus, and computer device and computer-readable storage medium
CN110515986B (en) Processing method and device of social network diagram and storage medium
CN108733810A (en) A kind of address date matching process and device
CN115292344A (en) Data dictionary construction method and device, electronic equipment and storage medium
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN111126422B (en) Method, device, equipment and medium for establishing industry model and determining industry
CN112069824B (en) Region identification method, device and medium based on context probability and citation
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
CN116992880A (en) Building name identification method, device, electronic equipment and storage medium
CN113221558B (en) Express address error correction method and device, storage medium and electronic equipment
CN112417812B (en) Address standardization method and system and electronic equipment
CN112819593B (en) Data analysis method, device, equipment and medium based on position information
CN114638308A (en) Method and device for acquiring object relationship, electronic equipment and storage medium
CN114792091A (en) Chinese address element analysis method and equipment based on vocabulary enhancement and storage medium
CN113128225B (en) Named entity identification method and device, electronic equipment and computer storage medium
CN114064927A (en) Address map construction method and device, computer equipment and readable storage medium
CN114003674A (en) Double-recording address determination method, device, equipment and storage medium
CN113515677A (en) Address matching method and device and computer readable storage medium
CN113535883A (en) Business place entity linking method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination