CN110895651A

CN110895651A - Address standardization processing method, device, equipment and computer readable storage medium

Info

Publication number: CN110895651A
Application number: CN201810965153.8A
Authority: CN
Inventors: 王翔; 张雯
Original assignee: Beijing Jingdong Financial Technology Holding Co Ltd
Current assignee: Beijing Jingdong Financial Technology Holding Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2020-03-20
Anticipated expiration: 2038-08-23
Also published as: CN110895651B

Abstract

The invention provides an address standardization processing method, an address standardization processing device, address standardization processing equipment and a computer readable storage medium, wherein the method comprises the following steps: receiving an address text to be processed; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.

Description

Address standardization processing method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to an address standardization processing method, apparatus, device, and computer-readable storage medium.

Background

Since the address information has the problems of diversified description, inaccuracy, description error, homophones, shorthand and the like of the same address, the subsequent service based on the address information becomes extremely difficult. For example, with the development of science and technology, online shopping gradually becomes the mainstream of current shopping for users, and generally, a user selects a commodity to be purchased on the network, and the commodity is sent to a mailing address of the user through express delivery after paying, so that the correctness of the information of the receiving address of the user becomes very important in order to ensure that the commodity purchased by the user can be accurately and quickly sent. If the address is wrongly or briefly written, the commodity of the user cannot be sent to the user mailing address, so that poor consumption experience is brought to the user, and on the other hand, a certain degree of customer loss is caused for the merchant. Therefore, how to standardize the user address information is an urgent technical problem to be solved.

In the prior art, generally, address information of a user is checked manually, and erroneous address information is corrected and missing information is supplemented.

However, because the number of address information to be processed is large and the manual processing speed is limited, the adoption of the method for standardizing the address information often has the technical problems of low processing efficiency, waste of human resources and high maintenance cost.

Disclosure of Invention

The invention provides an address standardization processing method, device and equipment and a computer readable storage medium, which are used for solving the technical problems of low address standardization efficiency and high manual maintenance cost caused by manually correcting and supplementing address text information in the prior art.

The first aspect of the present invention provides an address standardization processing method, including:

receiving an address text to be processed;

marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;

and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed.

Another aspect of the present invention provides an address normalization processing apparatus, including:

the first receiving module is used for receiving the address text to be processed;

the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;

and the processing module is used for processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.

Still another aspect of the present invention is to provide an address normalization processing apparatus, including: a memory, a processor;

a memory; a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the address normalization processing method as described above by the processor.

Yet another aspect of the present invention is to provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the address normalization processing method as described above when the computer-executable instructions are executed by a processor.

The address standardization processing method, the device, the equipment and the computer readable storage medium provided by the invention receive the address text to be processed; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic flowchart of an address normalization processing method according to an embodiment of the invention;

fig. 2 is a schematic flowchart of an address normalization processing method according to a second embodiment of the present invention;

fig. 3 is a schematic flowchart of an address normalization processing method according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an address normalization processing apparatus according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an address normalization processing apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.

Fig. 1 is a schematic flow chart of an address normalization processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 101, receiving an address text to be processed.

In this embodiment, because the address information has the problems of diversified, inaccurate, wrong description, homophones, shorthand and the like in the same address description, the subsequent service based on the address information becomes very difficult, for example, if the subsequent service is a logistics service, if the address information input by the user is inaccurate, the logistics may not be normally transported to the user's hand; if the subsequent service is a navigation service; if the address information input by the user is inaccurate, it may be impossible to plan a correct route for the user or a wrong route for the user according to the address information. Therefore, in order to improve the service quality and improve the user experience, after receiving the address information text input by the user, the address information text needs to be standardized, that is, the non-standard address information text input by the user is converted into standard and correct address information.

And 102, marking the level of each sub-address in the address text to be processed through a preset neural network model, and obtaining the marked address text to be processed.

In this embodiment, any one of the address texts to be processed includes a plurality of sub-addresses, for example, in beijing, tokyo, kazakh, kozakh, undecahbound, and building a, which are different sub-addresses respectively, and it is understood that different sub-addresses have different levels, for example, beijing, tokyo, and bulgah are ranked in the city, and the tokyo is ranked in the district. Therefore, in order to conveniently standardize the address text to be processed, after the address text to be processed input by the user is received, the levels of the sub-addresses in the address text to be processed can be labeled through the preset neural network model, and the labeled address text to be processed with different levels can be obtained.

And 103, processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.

In this embodiment, after the level of each sub-address in the address text to be processed is labeled, the labeled address text to be processed and a preset standard address library may be processed, so as to achieve standardization of the address text to be processed. Specifically, the address text to be processed input by the user may be compared with the standard address library, so that the correction of the error information in the address text to be processed and the supplement of the missing information are realized, and thus the standard address corresponding to the address text to be processed can be obtained, that is, the standardization of the address text to be processed is realized. The standard address library comprises standard names of all current address information, stores the standard names according to different levels, and stores association relations among the addresses of the levels.

In the address standardization processing method provided by the embodiment, the address text to be processed is received; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.

Further, on the basis of the above embodiment, the method further includes:

receiving an address text to be processed;

training a preset model to be trained through the text to be trained after each sub-address is labeled to obtain the preset neural network model;

In this embodiment, after receiving the address text to be processed sent by the user, the rank of the sub-address in the address text to be processed needs to be identified through a preset neural network model. Therefore, before identifying the level of the sub-address in the address text to be processed, a preset neural network model needs to be established first. Specifically, the text to be trained after labeling each sub-address can be obtained, the labeled text to be processed is respectively subjected to a training set and a testing set at random, parameters of the model to be trained are continuously adjusted until the identification result output by the model to be trained is accurate enough, the preset neural network model is obtained, and therefore labeling of the text to be processed can be achieved subsequently according to the neural network model, the labeled text to be processed and a preset standard address library are processed, and standardization of the text to be processed is achieved.

In the address standardization processing method provided by this embodiment, the preset to-be-trained model is trained through the to-be-trained text labeled on each sub-address, so as to obtain the preset neural network model, so that the neural network model can be used to label the to-be-processed address text, and a basis is provided for subsequent address standardization.

Further, on the basis of any of the above embodiments, the method further includes:

receiving an address text to be processed;

receiving a text to be trained;

removing useless punctuation marks in the text to be trained;

segmenting words of the text to be trained without useless punctuations to obtain sub-addresses corresponding to the text to be trained;

marking the level of each sub-address in the text to be trained;

In this embodiment, before training a preset model to be trained through a text to be trained after labeling each sub-address, the text to be trained needs to be processed. Specifically, in some cases, the text to be trained input by the user may include useless punctuation marks, for example, the useless punctuation marks may be "/", and so on, and therefore, in order to improve the efficiency of subsequent standardization, the useless punctuation marks in the text to be trained need to be removed first. Further, in the process of training the model, the model to be trained may be processed for each individual character, but since the combination of characters in the text to be trained has a specific meaning, in order to improve the accuracy of model identification, the text to be trained also needs to be participled to obtain a plurality of sub-addresses corresponding to the text to be trained. For example, the word of mansion a of the country of great happy area, also banked with the scientific name of eleven street, may be divided into the mansion a of great happy area, also banked with the country of great happy area, also banked with the scientific name of eleven, street, a, and seven sub-addresses corresponding to the text to be trained may be obtained. After the text to be trained is segmented into a plurality of sub-addresses, the sub-addresses can be labeled in grades. And training a preset model to be trained through the text to be trained after each sub-address is labeled, and processing the labeled text to be processed and a preset standard address library to realize the standardization of the text to be processed.

It should be noted that there are various methods for removing useless punctuations in the text to be trained, and any method may be adopted to remove the useless punctuations, which is not limited herein.

According to the address standardization processing method provided by the embodiment, useless characters in the text to be trained are removed in advance, and the text to be trained is segmented, so that the efficiency of subsequent model training can be improved, and a basis is provided for address standardization.

Further, there are various methods for removing useless punctuations from the text to be trained, and specifically, the useless punctuations in the text to be trained can be removed by a regular matching method.

Fig. 2 is a schematic flow chart of an address normalization processing method according to a second embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 2, the method further includes:

step 201, receiving an address text to be processed;

step 202, receiving a text to be trained;

step 203, removing useless punctuation marks in the text to be trained;

step 204, segmenting the text to be trained without useless punctuations to obtain sub-addresses corresponding to the text to be trained;

step 205, encoding each sub-address according to a preset encoding mode;

step 206, converting each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and storing the text vector and the code vector in a correlation manner;

step 207, establishing a model for the text vector and the coding vector corresponding to each sub-address through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address;

step 208, marking the level of each subaddress after the association relationship is established according to the preset subaddress level;

step 209, training a preset model to be trained through the text to be trained after each sub-address is labeled, and obtaining the preset neural network model;

step 210, labeling the level of each sub-address in the address text to be processed through a preset neural network model, and obtaining a labeled address text to be processed;

and step 211, aiming at each sub-address in the labeled address text to be processed, processing the sub-address according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.

In this embodiment, after removing useless characters in the text to be trained in advance and performing word segmentation on the text to be trained, in order to further enhance the integrity of the sub-addresses, each sub-address after word segmentation may be encoded in a preset encoding manner. Specifically, the first character in the sub-address may be coded as 1, the last character in the sub-address may be coded as 3, and any number of characters in the middle may be coded as 2, for example, the code corresponding to beijing is 13; the corresponding code of the great happy area is 123; the XX building corresponds to code 1223. It should be noted that, there may be multiple encoding manners, and any manner capable of enhancing the integrity of the sub-address may be selected to implement the encoding of the sub-address, which is not limited herein.

Furthermore, because the coded text to be trained needs to be input into the model to be trained to train the model, the text to be trained and the coding information corresponding to the text to be trained also need to be converted into a language that can be recognized by the model, and therefore, the text to be trained and the coding information corresponding to the text to be trained can be converted into a text vector and a coding vector through a preset vector conversion model. Because the text vectors and the encoding vectors of different sub-addresses are in one-to-one correspondence, in order to represent the correspondence between the text vectors and the encoding vectors, the text vectors and the encoding vectors need to be stored in association, and the text vectors and the encoding vectors after the storage in association are marked as (v)₁₁,v₁₂,…,v_1n). It should be noted that there may be a plurality of vector conversion manners, and any manner capable of implementing vector conversion may be selected to implement vector conversion of the text to be trained and the coding information corresponding to the text to be trained, which is not limited herein.

Further, since one text to be trained includes at least one sub-address, and there is an association relationship between the sub-addresses, in order to strengthen the association structure of the text to be trained, for the text vector and the encoding vector corresponding to each sub-address, it is necessary to associate the stored text vector and the encoding vector (v)₁₁,v₁₂,…,v_1n) Adding the data into a preset incidence relation establishing model, establishing incidence relation between text vectors and coding vectors corresponding to the current sub-address and the adjacent sub-address, and establishing the incidence relation between the text vectors and the coding vectorsThe vector for establishing the association relationship is marked as (v)₂₁,v₂₂,…,v_2n) Then, subsequently, for each sub-address, the information of the previous and next sub-addresses can be determined according to the sub-address. For example, still using the great district of Beijing to also village to create the eleven street A mansion, aiming at the great district of subaddress, the subaddress before the great district can be determined to be Beijing and the subaddress after the great district can be determined to be also village according to the association relationship. It should be noted that any association relationship establishment model may be adopted to implement enhancement of association relationship between sub-addresses, and the present invention is not limited herein. For example, the Bi-LSTM model can be used to enhance the association relationship between the sub-addresses.

Further, after the incidence relation between the text vector and the coding vector of the adjacent sub-address is established through a preset incidence relation establishing model, the level marking can be carried out on each sub-address after the incidence relation is established according to the preset sub-address level. Specifically, the vectors establishing the association relationship are denoted as (v)₂₁,v₂₂,…,v_2n) And adding the data into a preset labeling model, and performing level labeling on each subaddress after the association relationship is established according to the preset subaddress level to obtain a labeling result. In particular, the annotation model may be a CRF model. Wherein, the preset sub-address levels are shown in table 1:

economic

City (R)

Zone(s)

Street | community | village | town |

Road village

Road number

Cell

Building number plate

Landmark

P1

P2

P3

P4

P5

P5_ID

P6

P6_ID

P7

TABLE 1

It should be noted that, in order to further increase the relevance of each character in the sub-address, the sub-address level may be labeled in a biees manner. Wherein B represents begin; i represents imide; o represents outside; e represents end; s represents single. Because each character in each sub-address is labeled, the relevance of each character in the sub-address can be increased on the basis of determining the level of the sub-address. The A building of the eleventh street of the Kechu of the Kazakh, also known as Beijing Daxing district, is still used, for example, the level corresponding to Beijing is P1, and correspondingly, the level corresponding to Beijing is labeled B-P1, which is characterized as the level P1 and is the first character in the sub-address; the corresponding character labeled E-P1, representing that the character is P1 and the last character in the subaddress; correspondingly, the great interest zone corresponds to a level P3, and the great correspondence is labeled B-P3, which is characterized by a level P3 and is the first character in the subaddress; the "Xingqing" correspondence, labeled I-P3, features its rank P3 and is the middle character in the subaddress; the region correspondence is labeled E-P3; the token is characterized by a rank of P3 and is the last character in the subaddress, and is labeled in the manner described above for each subaddress.

The text to be trained marked by the method is used for training the model to be trained, and a neural network model is obtained. Therefore, accurate marking can be carried out on the input address text to be processed according to the neural network model. And processing the marked address text to be processed and a preset standard address library to realize the standardization of the address text to be processed.

In the address standardization processing method provided by this embodiment, each sub-address and a code corresponding to each sub-address are converted into a text vector and a code vector through a preset vector conversion model, and the text vector and the code vector are stored in an associated manner; aiming at the text vector and the coding vector corresponding to each sub-address, establishing a model through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address; and level labeling is carried out on each subaddress after the incidence relation is established according to the preset subaddress level, and a labeled text to be trained is obtained, so that a model to be trained can be trained subsequently according to the text to be trained, a foundation is provided for subsequent labeling of the address text to be processed, and the accuracy of labeling the neural network model can be improved.

Fig. 3 is a schematic flow chart of an address normalization processing method according to a third embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 3, the method includes:

step 301, receiving an address text to be processed;

step 302, marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain a marked address text to be processed;

step 303, sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;

step 304, judging whether the first sub-address is in a preset standard address library or not;

step 305, if not, calculating the similarity between the first sub-address and at least one correct address in the standard address base, wherein the at least one correct address is consistent with the first sub-address in level;

step 306, for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;

step 307, if yes, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;

step 308, judging whether a second sub-address with the level larger than the first sub-address level exists in the address text to be processed;

and 309, if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library until no second sub-address with the level higher than the first level of the first sub-address exists in the address text to be processed.

In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. Comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, otherwise, representing that the first sub-address is wrongly written, at the moment, calculating the similarity between the first sub-address and a plurality of correct addresses in the standard address library, which are consistent with the first sub-address in level, judging whether the similarity between the first sub-address and the plurality of correct addresses in the standard address library, which are consistent with the first sub-address in level, exceeds a preset first threshold value, and if so, representing that the correct address may be the standard address corresponding to the first sub-address. Therefore, in order to improve the accuracy of address normalization, it is necessary to use the correct address with the highest similarity exceeding the preset first threshold as the standard address corresponding to the first sub-address. After the standard address corresponding to the first sub-address is determined, whether a second sub-address with a level greater than the first sub-address by one level is included in the current address text to be processed or not can be judged, if yes, the second sub-address can be used as the current first sub-address, the steps are repeatedly executed until the second sub-address with the level greater than the first sub-address does not exist in the address text to be processed, all sub-addresses in the current address text to be processed are represented to be standardized, and the standard address corresponding to the address text to be processed is obtained.

In the address standardization processing method provided in this embodiment, a first sub-address with the lowest level in a text to be processed is determined, the first sub-address is compared with a preset standard address library, whether the first sub-address exists in the standard address library or not is determined, if the first sub-address does not exist in the standard address library, a standard address corresponding to the first sub-address is determined according to a similarity between the first sub-address and a correct address in the standard address library, and the above steps are repeatedly performed for the sub-address of each level, so that the standard address corresponding to the text to be processed can be obtained. The accuracy and efficiency of address standardization are improved.

Further, on the basis of the above embodiment, the method includes:

receiving an address text to be processed;

sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;

judging whether the first sub-address is in a preset standard address library or not;

if not, calculating the font similarity and the pinyin similarity between the first sub-address and the correct address;

calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity;

for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;

if so, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;

judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;

if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until the second sub-address with the level higher than the first sub-address does not exist in the address text to be processed.

In this embodiment, since the address text to be processed input by the user has a plurality of error modes, for example, a font error may be obtained, a hai lake region may be input as a hai-defined region, and a pinyin error may also be obtained, a hai lake region may be input as a hai-dian region, and if the similarity between the first sub-address and the correct address is calculated only by the font similarity, the calculation is not accurate for the condition of the pinyin error, for example, the font similarity between the hai-dian regions input by the hai-lake region is low, but the pinyin similarity is high. Therefore, in order to improve the accuracy of address normalization, the similarity between the first sub-address and a plurality of correct addresses in the standard address base, which are consistent with the first sub-address level, can be calculated in two ways. Specifically, the font similarity and the pinyin similarity between the first sub-address and the correct address may be calculated, and the similarity between the plurality of correct addresses of which the first sub-address level is consistent may be calculated according to the pinyin similarity and the font similarity. Therefore, the standard address determined according to the similarity is more accurate.

The address standardization processing method provided by this embodiment can improve the accuracy of address standardization by calculating the pinyin similarity and the font similarity between the first sub-address and the correct address.

Further, on the basis of any of the above embodiments, the method comprises:

receiving an address text to be processed;

if not, calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;

calculating the pinyin similarity between the first sub-address and the correct address by at least one preset pinyin similarity calculation method;

In this embodiment, in order to further improve the accuracy of similarity calculation, a plurality of methods for calculating the similarity between the font and the pinyin may be selected to implement the similarity between the first sub-address and the correct address. Specifically, the font similarity between the first sub-address and the correct address may be calculated by any of various font similarity calculation methods, which is not limited herein, for example, the address similarity between the first sub-address and the correct address may be calculated by using word-level Jaro Distance, word-level Jaro-willerdistance, word-level Edit Distance, and the like. Correspondingly, the calculation of the pinyin similarity between the first sub-address and the correct address can be realized by adopting any calculation method of multiple pinyin similarities, and the present invention is not limited herein, for example, the pinyin similarity between the first sub-address and the correct address can be calculated by adopting the pinyin level Jaro Distance, the pinyin level Jaro-winner Distance, the pinyin level Edit Distance, and the like.

The address standardization processing method provided by this embodiment calculates the pinyin similarity and the font similarity between the first sub-address and the correct address by using multiple methods, so as to improve the accuracy of address standardization.

Further, on the basis of any of the above embodiments, the method comprises:

receiving an address text to be processed;

setting different weights for the font similarity calculated by each font similarity calculation method;

setting different weights for the pinyin similarity calculated by the pinyin similarity calculation method;

calculating the similarity between the first sub-address and the correct address by a weighted average method according to the font similarity and the pinyin similarity;

In this embodiment, because the pinyin similarity and the font similarity between different sub-addresses and the standard address are different, in order to further improve the accuracy of the similarity between the sub-addresses and the standard address, different weights may be set for the font similarities calculated by different font similarity calculation methods, different weights may be set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-addresses and the standard address may be calculated by a weighted average method according to the weights corresponding to the respective methods. In general, since the pinyin similarity is higher than the font similarity, a higher weight may be set for the pinyin similarity.

In the address standardization processing method provided by this embodiment, different weights are set for the font similarities calculated by different font similarity calculation methods, different weights are set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-address and the standard address is calculated by using a weighted average method according to the weights corresponding to the respective methods, so that the accuracy of calculating the similarity between the sub-address and the standard address can be improved.

receiving an address text to be processed;

if not, calculating the similarity between the first sub-address and at least one correct address in the standard address base, wherein the at least one correct address is consistent with the first sub-address in level;

and if not, sending a service handling request to the user, wherein the service handling request comprises the first sub-address and the address text to be processed, so that the user can manually process the first sub-address according to the service handling request.

In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. Comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, if not, representing that the first sub-address is wrongly written, at the moment, calculating the similarity between the first sub-address and a plurality of correct addresses in the standard address library, which are consistent with the first sub-address level, judging whether the similarity between the first sub-address and the plurality of correct addresses in the standard address library, which are consistent with the first sub-address level, exceeds a preset first threshold value, if the similarity is lower than the preset first threshold value, representing that the standard address corresponding to the sub-address does not exist in the standard address library, at the moment, in order to realize the standardization of an address text to be processed, a service handling request needs to be sent to a user, wherein the service handling request comprises the first sub-address and the address text to be processed, so that the user can manually process the first sub-address according to the service transaction request. It can be understood that, if the standard address corresponding to the first sub-address is determined manually, the standard address corresponding to the first sub-address may be added to the standard address library, so as to implement the expansion of the standard address library.

In the address standardization processing method provided in this embodiment, when the similarity between the first sub-address and the plurality of correct addresses in the standard address base, which are consistent with the first sub-address in level, is lower than the preset first threshold, a service handling request is sent to the user, so that the user manually processes the first sub-address according to the service handling request, and thus standardization of all sub-addresses can be achieved.

receiving an address text to be processed;

if yes, judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;

In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. And comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, and taking the first sub-address as the standard address. And determining whether the address text to be processed comprises a second sub-address with the level greater than the first sub-address by one level, if so, taking the second sub-address as the current first sub-address, and returning to the step of judging whether the first sub-address is in a preset standard address library until the second sub-address with the level greater than the first sub-address by one level does not exist in the address text to be processed. Accordingly, if there is no second sub-address having a level one level greater than that of the first sub-address, the first sub-address may be output as the current standard address.

In the address standardization processing method provided by this embodiment, when the first sub-address exists in the standard address library, the first sub-address is used as the standard address, so that standardization of the address text to be processed can be achieved.

receiving an address text to be processed;

detecting whether the address text to be processed comprises a sub-address corresponding to a first level with the lowest level in the level sequence or not according to a preset level sequence;

if so, taking the sub-address corresponding to the first level as the first sub-address;

if not, taking a second level with the level greater than the first level as the current first level, marking the first level as vacant, returning to execute the step of detecting whether the address text to be processed comprises the sub-address corresponding to the first level with the lowest level in the level sequence according to a preset level sequence until the address text to be processed comprises the sub-address corresponding to the first level;

if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level greater than the first sub-address by one level exists in the address text to be processed;

determining all levels marked as vacant in the address text to be processed;

and supplementing all levels marked as vacant in the address text to be processed according to the standard address library.

In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Because the user may miss some address texts when inputting address texts to be processed, for example, province, city, county, and the user may input less city, the missing information can be supplemented according to other information, the search can be performed according to the preset order of each level, the default level can be P1, if there is no sub-address with level P1 at present, the current level vacancy is represented, the level is marked as vacancy, 1 is automatically added to the level of the current search, that is, the sub-address with level P2 is continuously searched, if there is a sub-address with level P2 at present, the similarity between the rest correct addresses is calculated according to the preset standard address library, and the correct address with the similarity exceeding the preset threshold and the highest similarity is used as the current standard address, the above steps are repeated for each level until the processing of each level is finished, at this time, all levels of the current vacancy are determined, and for the level of the current vacancy, the address is a sub-address which is not filled by the user currently, and at this time, the vacant sub-address can be supplemented according to a standard address library.

As an implementable manner, after receiving a to-be-processed address text sent by a user and labeling the to-be-processed address text through a preset neural network model, the labeled to-be-processed text needs to be processed according to a preset standard address library. When the user inputs the address text to be processed, the address text may be partially omitted, for example, the province, the city and the county, and the user may input less city, so that the omitted information can be supplemented according to other information, specifically, whether the level difference between any two sub-addresses in the address text to be processed exceeds a preset second threshold value is determined, where the second threshold value may be set by the user or may be set by default in the system. If so, the address text to be processed can be compared with a preset standard address library, so that the address text to be processed can be supplemented. For example, if the received address text to be processed is a beijing also banker, kochu, eleven street a mansion, wherein the level of the beijing is P1, and the level of the also banker is P4, the level difference between the two is 3, and the difference exceeds a preset second threshold, the address text to be processed can be compared with a preset standard address library to supplement the address text to be processed.

According to the address standardization processing method provided by the embodiment, the sub-addresses are processed according to the preset level sequence, so that the vacant sub-addresses which are not filled by the user can be supplemented on the basis of correcting the error sub-addresses, the filling of the address text to be processed can be further realized, the accuracy of the address text to be processed is improved, and a basis is provided for the subsequent service development.

Fig. 4 is a schematic structural diagram of an address normalization processing apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the apparatus includes:

the first receiving module 41 is configured to receive the address text to be processed.

And the first labeling module 42 is configured to label, through a preset neural network model, the level of each sub-address in the address text to be processed, so as to obtain a labeled address text to be processed.

And the processing module 43 is configured to, for each sub-address in the labeled address text to be processed, process the sub-address according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.

The address standardization processing device provided by the embodiment receives the address text to be processed; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.

Further, on the basis of the above embodiment, the apparatus further includes:

the training module is used for training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model;

The address standardization processing device provided by this embodiment trains a preset model to be trained through a text to be trained after labeling each sub-address, so as to obtain the preset neural network model, and thus, the neural network model can be used to label the address text to be processed, thereby providing a basis for subsequent address standardization.

Further, on the basis of any one of the above embodiments, the apparatus further includes:

the second receiving module is used for receiving the text to be trained;

the removing module is used for removing useless punctuation marks in the text to be trained;

the segmentation module is used for segmenting the text to be trained without the useless punctuations to obtain each sub-address corresponding to the text to be trained;

the second labeling module is used for labeling the level of each sub-address in the text to be trained;

The address standardization processing device provided by the embodiment can improve the efficiency of subsequent model training and provide a basis for address standardization by removing useless characters in the text to be trained in advance and performing word segmentation on the text to be trained.

Further, there are various methods for removing useless punctuation marks in a text to be trained, and specifically, the removing module includes:

and the removing unit is used for removing useless punctuation marks in the text to be trained by a regular matching method.

the second receiving module is used for receiving the text to be trained;

the coding module is used for coding each sub-address according to a preset coding mode;

the vector conversion module is used for converting each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and storing the text vector and the code vector in a correlation manner;

the association module is used for establishing a model for the text vector and the coding vector corresponding to each subaddress through a preset association relationship to establish the association relationship between the text vector and the coding vector of the adjacent subaddress;

the second labeling module comprises:

the marking unit is used for marking the level of each subaddress after the association relationship is established according to the preset subaddress level;

The address standardization processing device provided in this embodiment converts each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and stores the text vector and the code vector in association; aiming at the text vector and the coding vector corresponding to each sub-address, establishing a model through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address; and level labeling is carried out on each subaddress after the incidence relation is established according to the preset subaddress level, and a labeled text to be trained is obtained, so that a model to be trained can be trained subsequently according to the text to be trained, a foundation is provided for subsequent labeling of the address text to be processed, and the accuracy of labeling the neural network model can be improved.

Further, on the basis of any of the above embodiments, the apparatus comprises:

the processing module comprises:

the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;

the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;

the similarity calculation unit is used for calculating the similarity between the first sub-address and at least one correct address which is consistent with the first sub-address in the standard address base in level if the first sub-address is not consistent with the first sub-address in level;

a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;

the standard address determining unit is used for determining a correct address with the highest similarity if the first sub-address corresponds to the first sub-address, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;

a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;

and if so, the first circulation unit is used for taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level higher than the first sub-address is present in the address text to be processed.

The address standardization processing apparatus provided in this embodiment determines a first sub-address with a lowest level in a text to be processed, compares the first sub-address with a preset standard address library, determines whether the first sub-address exists in the standard address library, determines a standard address corresponding to the first sub-address according to a similarity between the first sub-address and a correct address in the standard address library if the first sub-address does not exist in the standard address library, and repeats the above steps for each level of sub-address, so that a standard address corresponding to the text to be processed can be obtained. The accuracy and efficiency of address standardization are improved.

Further, on the basis of the above embodiment, the apparatus includes:

the processing module comprises:

the similarity calculation unit is specifically configured to: if not, calculating the font similarity and the pinyin similarity between the first sub-address and the correct address;

The address standardization processing device provided by the embodiment can improve the accuracy of address standardization by calculating the pinyin similarity and the font similarity between the first sub-address and the correct address.

Further, on the basis of any of the above embodiments, the method comprises:

the processing module comprises:

the similarity calculation unit is specifically configured to:

The address standardization processing device provided by this embodiment calculates the pinyin similarity and the font similarity between the first sub-address and the correct address by using multiple methods, so as to improve the accuracy of address standardization.

Further, on the basis of any of the above embodiments, the apparatus comprises:

the processing module comprises:

the similarity calculation unit is used for calculating the pinyin similarity between the first sub-address and the correct address through at least one preset pinyin similarity calculation method;

the similarity calculation unit is specifically configured to: setting different weights for the font similarity calculated by each font similarity calculation method;

In the address normalization processing apparatus provided in this embodiment, different weights are set for the font similarities calculated by different font similarity calculation methods, different weights are set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-address and the standard address is calculated by using a weighted average method according to the weights corresponding to the respective methods, so that the accuracy of calculating the similarity between the sub-address and the standard address can be improved.

the processing module comprises:

The address standardization processing apparatus provided in this embodiment sends a service transaction request to a user when the similarity between a first sub-address and a plurality of correct addresses in a standard address base, which are consistent with the first sub-address in level, is lower than a preset first threshold, so that the user manually processes the first sub-address according to the service transaction request, thereby implementing standardization of all sub-addresses.

the processing module comprises:

a fifth judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address level exists in the address text to be processed if the address text to be processed exists;

and if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library until no second sub-address with the level greater than the first level of the first sub-address exists in the address text to be processed.

The address standardization processing apparatus provided in this embodiment can realize standardization of the address text to be processed by using the first sub-address as the standard address when the first sub-address exists in the standard address library.

the processing module comprises:

the first sub-address determining unit is specifically configured to: detecting whether the address text to be processed comprises a sub-address corresponding to a first level with the lowest level in the level sequence or not according to a preset level sequence;

a standard address determining unit, configured to determine, if the address is a correct address with the highest similarity, and use the correct address with the highest similarity as a standard address corresponding to the first sub-address;

a first circulation unit, configured to, if yes, use the second sub-address as the first sub-address, and return to perform the step of determining whether the first sub-address is in a preset standard address library, until there is no second sub-address in the to-be-processed address text whose level is one level greater than that of the first sub-address;

the processing module further comprises:

a vacancy level determination unit, configured to determine all levels labeled as vacancies in the address text to be processed;

and the supplement unit is used for supplementing all levels marked as vacant in the address text to be processed according to the standard address library.

The address standardization processing device provided by this embodiment processes the sub-addresses according to the preset rank order, so that the vacant sub-addresses that are not filled in by the user can be supplemented on the basis of correcting the wrong sub-addresses, the filling of the address text to be processed can be further realized, the accuracy of the address text to be processed is improved, and a basis is provided for the subsequent service development.

Fig. 5 is a schematic structural diagram of an address normalization processing apparatus according to a fifth embodiment of the present invention, and as shown in fig. 5, the apparatus includes: a memory 51, a processor 52;

a memory 51; a memory 51 for storing instructions executable by the processor 52;

wherein the processor 52 is configured to execute the address normalization processing method as described above by the processor 52.

Yet another embodiment of the present invention provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the address normalization processing method as described above when the computer-executable instructions are executed by a processor.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An address standardization processing method, comprising:

receiving an address text to be processed;

2. The method according to claim 1, wherein before labeling the level of each sub-address in the address text to be processed through a preset neural network model and obtaining the labeled address text to be processed, the method further comprises:

and training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model.

3. The method according to claim 2, wherein before the training of the preset model to be trained through the text to be trained after the sub-addresses are labeled, and the obtaining of the preset neural network model, the method further comprises:

receiving a text to be trained;

removing useless punctuation marks in the text to be trained;

and marking the level of each sub-address in the text to be trained.

4. The method according to claim 3, wherein the removing useless punctuation marks in the text to be trained comprises:

and removing useless punctuation marks in the text to be trained by a regular matching method.

5. The method according to claim 3, wherein after the segmenting the text to be trained from which the useless punctuation marks are removed to obtain the sub-addresses corresponding to the text to be trained, the method further comprises:

coding each sub-address according to a preset coding mode;

converting each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and storing the text vector and the code vector in a correlation manner;

aiming at the text vector and the coding vector corresponding to each sub-address, establishing a model through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address;

the level labeling of each sub-address in the text to be trained includes:

and marking the level of each subaddress after the association relationship is established according to the preset subaddress level.

6. The method according to claim 1, wherein the processing, according to a preset standard address library, each sub-address in the labeled address text to be processed to obtain a standard address corresponding to the address text to be processed, comprises:

7. The method of claim 6, wherein calculating the similarity between the first sub-address and at least one correct address in the standard address base that is consistent with the first sub-address level comprises:

calculating the font similarity and the pinyin similarity between the first sub-address and the correct address;

and calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity.

8. The method of claim 7, wherein the calculating the font similarity and pinyin similarity between the first sub-address and the correct address comprises:

calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;

and calculating the pinyin similarity between the first sub-address and the correct address by at least one preset pinyin similarity calculation method.

9. The method of claim 8, wherein the calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity comprises:

and calculating the similarity between the first sub-address and the correct address by a weighted average method according to the font similarity and the pinyin similarity.

10. The method according to claim 6, wherein after determining, for each of the correct addresses, whether the similarity between the correct address and the first sub-address is greater than a preset threshold, the method further comprises:

11. The method according to claim 6, wherein the sequentially determining the first sub-address with the lowest level in the address text to be processed according to a preset level order comprises:

if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library, until no second sub-address with a level greater than the first sub-address by one level exists in the address text to be processed, further comprising:

determining all levels marked as vacant in the address text to be processed;

12. The method of claim 6, wherein after determining whether the first sub-address is in a preset standard address bank, the method further comprises:

13. An address normalization processing apparatus, comprising:

14. The apparatus of claim 13, further comprising:

and the training module is used for training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model.

15. The apparatus of claim 14, further comprising:

the second receiving module is used for receiving the text to be trained;

and the second labeling module is used for labeling the level of each sub-address in the text to be trained.

16. The apparatus of claim 15, wherein the removal module comprises:

17. The apparatus of claim 15, further comprising:

the second labeling module comprises:

and the marking unit is used for marking the level of each subaddress after the association relationship is established according to the preset subaddress level.

18. The apparatus of claim 13, wherein the processing module comprises:

19. The apparatus according to claim 18, wherein the similarity calculation unit is specifically configured to:

20. The apparatus according to claim 19, wherein the similarity calculation unit is specifically configured to: calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;

21. The method according to claim 20, wherein the similarity calculation unit is specifically configured to:

22. The apparatus of claim 18, wherein the processing module further comprises:

23. The apparatus of claim 13, wherein the first sub-address determination unit is specifically configured to:

the processing module further comprises:

24. The apparatus of claim 18, wherein the processing module further comprises:

25. An address normalization processing apparatus, comprising: a memory, a processor;

a memory; a memory for storing the processor-executable instructions;

wherein the processor is configured to perform the address normalization processing method of any one of claims 1-12 by the processor.

26. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the address normalization processing method according to any one of claims 1 to 12.