Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other examples obtained based on the examples in the present invention are within the scope of the present invention.
Fig. 1 is a schematic flow chart of an address normalization processing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 101, receiving an address text to be processed.
In this embodiment, because the address information has the problems of diversified, inaccurate, wrong description, homophones, shorthand and the like in the same address description, the subsequent service based on the address information becomes very difficult, for example, if the subsequent service is a logistics service, if the address information input by the user is inaccurate, the logistics may not be normally transported to the user's hand; if the subsequent service is a navigation service; if the address information input by the user is inaccurate, it may be impossible to plan a correct route for the user or a wrong route for the user according to the address information. Therefore, in order to improve the service quality and improve the user experience, after receiving the address information text input by the user, the address information text needs to be standardized, that is, the non-standard address information text input by the user is converted into standard and correct address information.
And 102, marking the level of each sub-address in the address text to be processed through a preset neural network model, and obtaining the marked address text to be processed.
In this embodiment, any one of the address texts to be processed includes a plurality of sub-addresses, for example, in beijing, tokyo, kazakh, kozakh, undecahbound, and building a, which are different sub-addresses respectively, and it is understood that different sub-addresses have different levels, for example, beijing, tokyo, and bulgah are ranked in the city, and the tokyo is ranked in the district. Therefore, in order to conveniently standardize the address text to be processed, after the address text to be processed input by the user is received, the levels of the sub-addresses in the address text to be processed can be labeled through the preset neural network model, and the labeled address text to be processed with different levels can be obtained.
And 103, processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
In this embodiment, after the level of each sub-address in the address text to be processed is labeled, the labeled address text to be processed and a preset standard address library may be processed, so as to achieve standardization of the address text to be processed. Specifically, the address text to be processed input by the user may be compared with the standard address library, so that the correction of the error information in the address text to be processed and the supplement of the missing information are realized, and thus the standard address corresponding to the address text to be processed can be obtained, that is, the standardization of the address text to be processed is realized. The standard address library comprises standard names of all current address information, stores the standard names according to different levels, and stores association relations among the addresses of the levels.
In the address standardization processing method provided by the embodiment, the address text to be processed is received; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.
Further, on the basis of the above embodiment, the method further includes:
receiving an address text to be processed;
training a preset model to be trained through the text to be trained after each sub-address is labeled to obtain the preset neural network model;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed.
In this embodiment, after receiving the address text to be processed sent by the user, the rank of the sub-address in the address text to be processed needs to be identified through a preset neural network model. Therefore, before identifying the level of the sub-address in the address text to be processed, a preset neural network model needs to be established first. Specifically, the text to be trained after labeling each sub-address can be obtained, the labeled text to be processed is respectively subjected to a training set and a testing set at random, parameters of the model to be trained are continuously adjusted until the identification result output by the model to be trained is accurate enough, the preset neural network model is obtained, and therefore labeling of the text to be processed can be achieved subsequently according to the neural network model, the labeled text to be processed and a preset standard address library are processed, and standardization of the text to be processed is achieved.
In the address standardization processing method provided by this embodiment, the preset to-be-trained model is trained through the to-be-trained text labeled on each sub-address, so as to obtain the preset neural network model, so that the neural network model can be used to label the to-be-processed address text, and a basis is provided for subsequent address standardization.
Further, on the basis of any of the above embodiments, the method further includes:
receiving an address text to be processed;
receiving a text to be trained;
removing useless punctuation marks in the text to be trained;
segmenting words of the text to be trained without useless punctuations to obtain sub-addresses corresponding to the text to be trained;
marking the level of each sub-address in the text to be trained;
training a preset model to be trained through the text to be trained after each sub-address is labeled to obtain the preset neural network model;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed.
In this embodiment, before training a preset model to be trained through a text to be trained after labeling each sub-address, the text to be trained needs to be processed. Specifically, in some cases, the text to be trained input by the user may include useless punctuation marks, for example, the useless punctuation marks may be "/", and so on, and therefore, in order to improve the efficiency of subsequent standardization, the useless punctuation marks in the text to be trained need to be removed first. Further, in the process of training the model, the model to be trained may be processed for each individual character, but since the combination of characters in the text to be trained has a specific meaning, in order to improve the accuracy of model identification, the text to be trained also needs to be participled to obtain a plurality of sub-addresses corresponding to the text to be trained. For example, the word of mansion a of the country of great happy area, also banked with the scientific name of eleven street, may be divided into the mansion a of great happy area, also banked with the country of great happy area, also banked with the scientific name of eleven, street, a, and seven sub-addresses corresponding to the text to be trained may be obtained. After the text to be trained is segmented into a plurality of sub-addresses, the sub-addresses can be labeled in grades. And training a preset model to be trained through the text to be trained after each sub-address is labeled, and processing the labeled text to be processed and a preset standard address library to realize the standardization of the text to be processed.
It should be noted that there are various methods for removing useless punctuations in the text to be trained, and any method may be adopted to remove the useless punctuations, which is not limited herein.
According to the address standardization processing method provided by the embodiment, useless characters in the text to be trained are removed in advance, and the text to be trained is segmented, so that the efficiency of subsequent model training can be improved, and a basis is provided for address standardization.
Further, there are various methods for removing useless punctuations from the text to be trained, and specifically, the useless punctuations in the text to be trained can be removed by a regular matching method.
Fig. 2 is a schematic flow chart of an address normalization processing method according to a second embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 2, the method further includes:
step 201, receiving an address text to be processed;
step 202, receiving a text to be trained;
step 203, removing useless punctuation marks in the text to be trained;
step 204, segmenting the text to be trained without useless punctuations to obtain sub-addresses corresponding to the text to be trained;
step 205, encoding each sub-address according to a preset encoding mode;
step 206, converting each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and storing the text vector and the code vector in a correlation manner;
step 207, establishing a model for the text vector and the coding vector corresponding to each sub-address through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address;
step 208, marking the level of each subaddress after the association relationship is established according to the preset subaddress level;
step 209, training a preset model to be trained through the text to be trained after each sub-address is labeled, and obtaining the preset neural network model;
step 210, labeling the level of each sub-address in the address text to be processed through a preset neural network model, and obtaining a labeled address text to be processed;
and step 211, aiming at each sub-address in the labeled address text to be processed, processing the sub-address according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
In this embodiment, after removing useless characters in the text to be trained in advance and performing word segmentation on the text to be trained, in order to further enhance the integrity of the sub-addresses, each sub-address after word segmentation may be encoded in a preset encoding manner. Specifically, the first character in the sub-address may be coded as 1, the last character in the sub-address may be coded as 3, and any number of characters in the middle may be coded as 2, for example, the code corresponding to beijing is 13; the corresponding code of the great happy area is 123; the XX building corresponds to code 1223. It should be noted that, there may be multiple encoding manners, and any manner capable of enhancing the integrity of the sub-address may be selected to implement the encoding of the sub-address, which is not limited herein.
Furthermore, because the coded text to be trained needs to be input into the model to be trained to train the model, the text to be trained and the coding information corresponding to the text to be trained also need to be converted into a language that can be recognized by the model, and therefore, the text to be trained and the coding information corresponding to the text to be trained can be converted into a text vector and a coding vector through a preset vector conversion model. Because the text vectors and the encoding vectors of different sub-addresses are in one-to-one correspondence, in order to represent the correspondence between the text vectors and the encoding vectors, the text vectors and the encoding vectors need to be stored in association, and the text vectors and the encoding vectors after the storage in association are marked as (v)11,v12,…,v1n). It should be noted that there may be a plurality of vector conversion manners, and any manner capable of implementing vector conversion may be selected to implement vector conversion of the text to be trained and the coding information corresponding to the text to be trained, which is not limited herein.
Further, since one text to be trained includes at least one sub-address, and there is an association relationship between the sub-addresses, in order to strengthen the association structure of the text to be trained, for the text vector and the encoding vector corresponding to each sub-address, it is necessary to associate the stored text vector and the encoding vector (v)11,v12,…,v1n) Adding the data into a preset incidence relation establishing model, establishing incidence relation between text vectors and coding vectors corresponding to the current sub-address and the adjacent sub-address, and establishing the incidence relation between the text vectors and the coding vectorsThe vector for establishing the association relationship is marked as (v)21,v22,…,v2n) Then, subsequently, for each sub-address, the information of the previous and next sub-addresses can be determined according to the sub-address. For example, still using the great district of Beijing to also village to create the eleven street A mansion, aiming at the great district of subaddress, the subaddress before the great district can be determined to be Beijing and the subaddress after the great district can be determined to be also village according to the association relationship. It should be noted that any association relationship establishment model may be adopted to implement enhancement of association relationship between sub-addresses, and the present invention is not limited herein. For example, the Bi-LSTM model can be used to enhance the association relationship between the sub-addresses.
Further, after the incidence relation between the text vector and the coding vector of the adjacent sub-address is established through a preset incidence relation establishing model, the level marking can be carried out on each sub-address after the incidence relation is established according to the preset sub-address level. Specifically, the vectors establishing the association relationship are denoted as (v)21,v22,…,v2n) And adding the data into a preset labeling model, and performing level labeling on each subaddress after the association relationship is established according to the preset subaddress level to obtain a labeling result. In particular, the annotation model may be a CRF model. Wherein, the preset sub-address levels are shown in table 1:
economic
|
City (R)
|
Zone(s)
|
Street | community | village | town |
|
Road village
|
Road number
|
Cell
|
Building number plate
|
Landmark
|
P1
|
P2
|
P3
|
P4
|
P5
|
P5_ID
|
P6
|
P6_ID
|
P7 |
TABLE 1
It should be noted that, in order to further increase the relevance of each character in the sub-address, the sub-address level may be labeled in a biees manner. Wherein B represents begin; i represents imide; o represents outside; e represents end; s represents single. Because each character in each sub-address is labeled, the relevance of each character in the sub-address can be increased on the basis of determining the level of the sub-address. The A building of the eleventh street of the Kechu of the Kazakh, also known as Beijing Daxing district, is still used, for example, the level corresponding to Beijing is P1, and correspondingly, the level corresponding to Beijing is labeled B-P1, which is characterized as the level P1 and is the first character in the sub-address; the corresponding character labeled E-P1, representing that the character is P1 and the last character in the subaddress; correspondingly, the great interest zone corresponds to a level P3, and the great correspondence is labeled B-P3, which is characterized by a level P3 and is the first character in the subaddress; the "Xingqing" correspondence, labeled I-P3, features its rank P3 and is the middle character in the subaddress; the region correspondence is labeled E-P3; the token is characterized by a rank of P3 and is the last character in the subaddress, and is labeled in the manner described above for each subaddress.
The text to be trained marked by the method is used for training the model to be trained, and a neural network model is obtained. Therefore, accurate marking can be carried out on the input address text to be processed according to the neural network model. And processing the marked address text to be processed and a preset standard address library to realize the standardization of the address text to be processed.
In the address standardization processing method provided by this embodiment, each sub-address and a code corresponding to each sub-address are converted into a text vector and a code vector through a preset vector conversion model, and the text vector and the code vector are stored in an associated manner; aiming at the text vector and the coding vector corresponding to each sub-address, establishing a model through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address; and level labeling is carried out on each subaddress after the incidence relation is established according to the preset subaddress level, and a labeled text to be trained is obtained, so that a model to be trained can be trained subsequently according to the text to be trained, a foundation is provided for subsequent labeling of the address text to be processed, and the accuracy of labeling the neural network model can be improved.
Fig. 3 is a schematic flow chart of an address normalization processing method according to a third embodiment of the present invention, where on the basis of any of the foregoing embodiments, as shown in fig. 3, the method includes:
step 301, receiving an address text to be processed;
step 302, marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain a marked address text to be processed;
step 303, sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
step 304, judging whether the first sub-address is in a preset standard address library or not;
step 305, if not, calculating the similarity between the first sub-address and at least one correct address in the standard address base, wherein the at least one correct address is consistent with the first sub-address in level;
step 306, for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
step 307, if yes, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
step 308, judging whether a second sub-address with the level larger than the first sub-address level exists in the address text to be processed;
and 309, if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library until no second sub-address with the level higher than the first level of the first sub-address exists in the address text to be processed.
In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. Comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, otherwise, representing that the first sub-address is wrongly written, at the moment, calculating the similarity between the first sub-address and a plurality of correct addresses in the standard address library, which are consistent with the first sub-address in level, judging whether the similarity between the first sub-address and the plurality of correct addresses in the standard address library, which are consistent with the first sub-address in level, exceeds a preset first threshold value, and if so, representing that the correct address may be the standard address corresponding to the first sub-address. Therefore, in order to improve the accuracy of address normalization, it is necessary to use the correct address with the highest similarity exceeding the preset first threshold as the standard address corresponding to the first sub-address. After the standard address corresponding to the first sub-address is determined, whether a second sub-address with a level greater than the first sub-address by one level is included in the current address text to be processed or not can be judged, if yes, the second sub-address can be used as the current first sub-address, the steps are repeatedly executed until the second sub-address with the level greater than the first sub-address does not exist in the address text to be processed, all sub-addresses in the current address text to be processed are represented to be standardized, and the standard address corresponding to the address text to be processed is obtained.
In the address standardization processing method provided in this embodiment, a first sub-address with the lowest level in a text to be processed is determined, the first sub-address is compared with a preset standard address library, whether the first sub-address exists in the standard address library or not is determined, if the first sub-address does not exist in the standard address library, a standard address corresponding to the first sub-address is determined according to a similarity between the first sub-address and a correct address in the standard address library, and the above steps are repeatedly performed for the sub-address of each level, so that the standard address corresponding to the text to be processed can be obtained. The accuracy and efficiency of address standardization are improved.
Further, on the basis of the above embodiment, the method includes:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the font similarity and the pinyin similarity between the first sub-address and the correct address;
calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity;
for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
if so, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;
if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until the second sub-address with the level higher than the first sub-address does not exist in the address text to be processed.
In this embodiment, since the address text to be processed input by the user has a plurality of error modes, for example, a font error may be obtained, a hai lake region may be input as a hai-defined region, and a pinyin error may also be obtained, a hai lake region may be input as a hai-dian region, and if the similarity between the first sub-address and the correct address is calculated only by the font similarity, the calculation is not accurate for the condition of the pinyin error, for example, the font similarity between the hai-dian regions input by the hai-lake region is low, but the pinyin similarity is high. Therefore, in order to improve the accuracy of address normalization, the similarity between the first sub-address and a plurality of correct addresses in the standard address base, which are consistent with the first sub-address level, can be calculated in two ways. Specifically, the font similarity and the pinyin similarity between the first sub-address and the correct address may be calculated, and the similarity between the plurality of correct addresses of which the first sub-address level is consistent may be calculated according to the pinyin similarity and the font similarity. Therefore, the standard address determined according to the similarity is more accurate.
The address standardization processing method provided by this embodiment can improve the accuracy of address standardization by calculating the pinyin similarity and the font similarity between the first sub-address and the correct address.
Further, on the basis of any of the above embodiments, the method comprises:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;
calculating the pinyin similarity between the first sub-address and the correct address by at least one preset pinyin similarity calculation method;
calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity;
for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
if so, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;
if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until the second sub-address with the level higher than the first sub-address does not exist in the address text to be processed.
In this embodiment, in order to further improve the accuracy of similarity calculation, a plurality of methods for calculating the similarity between the font and the pinyin may be selected to implement the similarity between the first sub-address and the correct address. Specifically, the font similarity between the first sub-address and the correct address may be calculated by any of various font similarity calculation methods, which is not limited herein, for example, the address similarity between the first sub-address and the correct address may be calculated by using word-level Jaro Distance, word-level Jaro-willerdistance, word-level Edit Distance, and the like. Correspondingly, the calculation of the pinyin similarity between the first sub-address and the correct address can be realized by adopting any calculation method of multiple pinyin similarities, and the present invention is not limited herein, for example, the pinyin similarity between the first sub-address and the correct address can be calculated by adopting the pinyin level Jaro Distance, the pinyin level Jaro-winner Distance, the pinyin level Edit Distance, and the like.
The address standardization processing method provided by this embodiment calculates the pinyin similarity and the font similarity between the first sub-address and the correct address by using multiple methods, so as to improve the accuracy of address standardization.
Further, on the basis of any of the above embodiments, the method comprises:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;
calculating the pinyin similarity between the first sub-address and the correct address by at least one preset pinyin similarity calculation method;
setting different weights for the font similarity calculated by each font similarity calculation method;
setting different weights for the pinyin similarity calculated by the pinyin similarity calculation method;
calculating the similarity between the first sub-address and the correct address by a weighted average method according to the font similarity and the pinyin similarity;
for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
if so, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;
if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until the second sub-address with the level higher than the first sub-address does not exist in the address text to be processed.
In this embodiment, because the pinyin similarity and the font similarity between different sub-addresses and the standard address are different, in order to further improve the accuracy of the similarity between the sub-addresses and the standard address, different weights may be set for the font similarities calculated by different font similarity calculation methods, different weights may be set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-addresses and the standard address may be calculated by a weighted average method according to the weights corresponding to the respective methods. In general, since the pinyin similarity is higher than the font similarity, a higher weight may be set for the pinyin similarity.
In the address standardization processing method provided by this embodiment, different weights are set for the font similarities calculated by different font similarity calculation methods, different weights are set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-address and the standard address is calculated by using a weighted average method according to the weights corresponding to the respective methods, so that the accuracy of calculating the similarity between the sub-address and the standard address can be improved.
Further, on the basis of any of the above embodiments, the method further includes:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the similarity between the first sub-address and at least one correct address in the standard address base, wherein the at least one correct address is consistent with the first sub-address in level;
for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
and if not, sending a service handling request to the user, wherein the service handling request comprises the first sub-address and the address text to be processed, so that the user can manually process the first sub-address according to the service handling request.
In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. Comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, if not, representing that the first sub-address is wrongly written, at the moment, calculating the similarity between the first sub-address and a plurality of correct addresses in the standard address library, which are consistent with the first sub-address level, judging whether the similarity between the first sub-address and the plurality of correct addresses in the standard address library, which are consistent with the first sub-address level, exceeds a preset first threshold value, if the similarity is lower than the preset first threshold value, representing that the standard address corresponding to the sub-address does not exist in the standard address library, at the moment, in order to realize the standardization of an address text to be processed, a service handling request needs to be sent to a user, wherein the service handling request comprises the first sub-address and the address text to be processed, so that the user can manually process the first sub-address according to the service transaction request. It can be understood that, if the standard address corresponding to the first sub-address is determined manually, the standard address corresponding to the first sub-address may be added to the standard address library, so as to implement the expansion of the standard address library.
In the address standardization processing method provided in this embodiment, when the similarity between the first sub-address and the plurality of correct addresses in the standard address base, which are consistent with the first sub-address in level, is lower than the preset first threshold, a service handling request is sent to the user, so that the user manually processes the first sub-address according to the service handling request, and thus standardization of all sub-addresses can be achieved.
Further, on the basis of any of the above embodiments, the method further includes:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
sequentially determining a first subaddress with the lowest level in the address text to be processed according to a preset level sequence;
judging whether the first sub-address is in a preset standard address library or not;
if yes, judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;
if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until the second sub-address with the level higher than the first sub-address does not exist in the address text to be processed.
In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Specifically, the first subaddress with the lowest level in the text to be processed is determined, wherein the level is gradually increased from P1 to P7. And comparing the first sub-address with a preset standard address library, judging whether the first sub-address exists in the standard address library, if so, representing that the first sub-address has no error, and taking the first sub-address as the standard address. And determining whether the address text to be processed comprises a second sub-address with the level greater than the first sub-address by one level, if so, taking the second sub-address as the current first sub-address, and returning to the step of judging whether the first sub-address is in a preset standard address library until the second sub-address with the level greater than the first sub-address by one level does not exist in the address text to be processed. Accordingly, if there is no second sub-address having a level one level greater than that of the first sub-address, the first sub-address may be output as the current standard address.
In the address standardization processing method provided by this embodiment, when the first sub-address exists in the standard address library, the first sub-address is used as the standard address, so that standardization of the address text to be processed can be achieved.
Further, on the basis of any of the above embodiments, the method further includes:
receiving an address text to be processed;
marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed;
detecting whether the address text to be processed comprises a sub-address corresponding to a first level with the lowest level in the level sequence or not according to a preset level sequence;
if so, taking the sub-address corresponding to the first level as the first sub-address;
if not, taking a second level with the level greater than the first level as the current first level, marking the first level as vacant, returning to execute the step of detecting whether the address text to be processed comprises the sub-address corresponding to the first level with the lowest level in the level sequence according to a preset level sequence until the address text to be processed comprises the sub-address corresponding to the first level;
judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the similarity between the first sub-address and at least one correct address in the standard address base, wherein the at least one correct address is consistent with the first sub-address in level;
for each correct address, judging whether the similarity between the correct address and the first sub-address is greater than a preset first threshold value;
if so, determining a correct address with the highest similarity, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
judging whether a second sub-address with the level larger than the first sub-address by one level exists in the address text to be processed;
if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level greater than the first sub-address by one level exists in the address text to be processed;
determining all levels marked as vacant in the address text to be processed;
and supplementing all levels marked as vacant in the address text to be processed according to the standard address library.
In this embodiment, after receiving an address text to be processed sent by a user and labeling the address text to be processed through a preset neural network model, the labeled address text to be processed needs to be processed according to a preset standard address library. Because the user may miss some address texts when inputting address texts to be processed, for example, province, city, county, and the user may input less city, the missing information can be supplemented according to other information, the search can be performed according to the preset order of each level, the default level can be P1, if there is no sub-address with level P1 at present, the current level vacancy is represented, the level is marked as vacancy, 1 is automatically added to the level of the current search, that is, the sub-address with level P2 is continuously searched, if there is a sub-address with level P2 at present, the similarity between the rest correct addresses is calculated according to the preset standard address library, and the correct address with the similarity exceeding the preset threshold and the highest similarity is used as the current standard address, the above steps are repeated for each level until the processing of each level is finished, at this time, all levels of the current vacancy are determined, and for the level of the current vacancy, the address is a sub-address which is not filled by the user currently, and at this time, the vacant sub-address can be supplemented according to a standard address library.
As an implementable manner, after receiving a to-be-processed address text sent by a user and labeling the to-be-processed address text through a preset neural network model, the labeled to-be-processed text needs to be processed according to a preset standard address library. When the user inputs the address text to be processed, the address text may be partially omitted, for example, the province, the city and the county, and the user may input less city, so that the omitted information can be supplemented according to other information, specifically, whether the level difference between any two sub-addresses in the address text to be processed exceeds a preset second threshold value is determined, where the second threshold value may be set by the user or may be set by default in the system. If so, the address text to be processed can be compared with a preset standard address library, so that the address text to be processed can be supplemented. For example, if the received address text to be processed is a beijing also banker, kochu, eleven street a mansion, wherein the level of the beijing is P1, and the level of the also banker is P4, the level difference between the two is 3, and the difference exceeds a preset second threshold, the address text to be processed can be compared with a preset standard address library to supplement the address text to be processed.
According to the address standardization processing method provided by the embodiment, the sub-addresses are processed according to the preset level sequence, so that the vacant sub-addresses which are not filled by the user can be supplemented on the basis of correcting the error sub-addresses, the filling of the address text to be processed can be further realized, the accuracy of the address text to be processed is improved, and a basis is provided for the subsequent service development.
Fig. 4 is a schematic structural diagram of an address normalization processing apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the apparatus includes:
the first receiving module 41 is configured to receive the address text to be processed.
And the first labeling module 42 is configured to label, through a preset neural network model, the level of each sub-address in the address text to be processed, so as to obtain a labeled address text to be processed.
And the processing module 43 is configured to, for each sub-address in the labeled address text to be processed, process the sub-address according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
The address standardization processing device provided by the embodiment receives the address text to be processed; marking the level of each sub-address in the address text to be processed through a preset neural network model to obtain the marked address text to be processed; and processing the sub-addresses according to a preset standard address library aiming at each sub-address in the marked address text to be processed to obtain a standard address corresponding to the address text to be processed. Therefore, the standard address information corresponding to the address text to be processed can be quickly and accurately determined, the accuracy of address standardization can be improved, and in addition, the manual maintenance cost of the address text can be reduced.
Further, on the basis of the above embodiment, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the training module is used for training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
and the processing module is used for processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
The address standardization processing device provided by this embodiment trains a preset model to be trained through a text to be trained after labeling each sub-address, so as to obtain the preset neural network model, and thus, the neural network model can be used to label the address text to be processed, thereby providing a basis for subsequent address standardization.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the second receiving module is used for receiving the text to be trained;
the removing module is used for removing useless punctuation marks in the text to be trained;
the segmentation module is used for segmenting the text to be trained without the useless punctuations to obtain each sub-address corresponding to the text to be trained;
the second labeling module is used for labeling the level of each sub-address in the text to be trained;
the training module is used for training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
and the processing module is used for processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
The address standardization processing device provided by the embodiment can improve the efficiency of subsequent model training and provide a basis for address standardization by removing useless characters in the text to be trained in advance and performing word segmentation on the text to be trained.
Further, there are various methods for removing useless punctuation marks in a text to be trained, and specifically, the removing module includes:
and the removing unit is used for removing useless punctuation marks in the text to be trained by a regular matching method.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the second receiving module is used for receiving the text to be trained;
the removing module is used for removing useless punctuation marks in the text to be trained;
the segmentation module is used for segmenting the text to be trained without the useless punctuations to obtain each sub-address corresponding to the text to be trained;
the coding module is used for coding each sub-address according to a preset coding mode;
the vector conversion module is used for converting each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and storing the text vector and the code vector in a correlation manner;
the association module is used for establishing a model for the text vector and the coding vector corresponding to each subaddress through a preset association relationship to establish the association relationship between the text vector and the coding vector of the adjacent subaddress;
the second labeling module comprises:
the marking unit is used for marking the level of each subaddress after the association relationship is established according to the preset subaddress level;
the training module is used for training a preset model to be trained through the text to be trained after the sub-addresses are labeled, so as to obtain the preset neural network model;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
and the processing module is used for processing the subaddresses in the labeled address text to be processed according to a preset standard address library to obtain a standard address corresponding to the address text to be processed.
The address standardization processing device provided in this embodiment converts each sub-address and the code corresponding to each sub-address into a text vector and a code vector through a preset vector conversion model, and stores the text vector and the code vector in association; aiming at the text vector and the coding vector corresponding to each sub-address, establishing a model through a preset incidence relation to establish the incidence relation between the text vector and the coding vector of the adjacent sub-address; and level labeling is carried out on each subaddress after the incidence relation is established according to the preset subaddress level, and a labeled text to be trained is obtained, so that a model to be trained can be trained subsequently according to the text to be trained, a foundation is provided for subsequent labeling of the address text to be processed, and the accuracy of labeling the neural network model can be improved.
Further, on the basis of any of the above embodiments, the apparatus comprises:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
the similarity calculation unit is used for calculating the similarity between the first sub-address and at least one correct address which is consistent with the first sub-address in the standard address base in level if the first sub-address is not consistent with the first sub-address in level;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
the standard address determining unit is used for determining a correct address with the highest similarity if the first sub-address corresponds to the first sub-address, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;
and if so, the first circulation unit is used for taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level higher than the first sub-address is present in the address text to be processed.
The address standardization processing apparatus provided in this embodiment determines a first sub-address with a lowest level in a text to be processed, compares the first sub-address with a preset standard address library, determines whether the first sub-address exists in the standard address library, determines a standard address corresponding to the first sub-address according to a similarity between the first sub-address and a correct address in the standard address library if the first sub-address does not exist in the standard address library, and repeats the above steps for each level of sub-address, so that a standard address corresponding to the text to be processed can be obtained. The accuracy and efficiency of address standardization are improved.
Further, on the basis of the above embodiment, the apparatus includes:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
the similarity calculation unit is specifically configured to: if not, calculating the font similarity and the pinyin similarity between the first sub-address and the correct address;
calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
the standard address determining unit is used for determining a correct address with the highest similarity if the first sub-address corresponds to the first sub-address, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;
and if so, the first circulation unit is used for taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level higher than the first sub-address is present in the address text to be processed.
The address standardization processing device provided by the embodiment can improve the accuracy of address standardization by calculating the pinyin similarity and the font similarity between the first sub-address and the correct address.
Further, on the basis of any of the above embodiments, the method comprises:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
the similarity calculation unit is specifically configured to:
if not, calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;
calculating the pinyin similarity between the first sub-address and the correct address by at least one preset pinyin similarity calculation method;
calculating the similarity between the first sub-address and the correct address according to the font similarity and the pinyin similarity;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
the standard address determining unit is used for determining a correct address with the highest similarity if the first sub-address corresponds to the first sub-address, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;
and if so, the first circulation unit is used for taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level higher than the first sub-address is present in the address text to be processed.
The address standardization processing device provided by this embodiment calculates the pinyin similarity and the font similarity between the first sub-address and the correct address by using multiple methods, so as to improve the accuracy of address standardization.
Further, on the basis of any of the above embodiments, the apparatus comprises:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
if not, calculating the font similarity between the first sub-address and the correct address by at least one preset font similarity calculation method;
the similarity calculation unit is used for calculating the pinyin similarity between the first sub-address and the correct address through at least one preset pinyin similarity calculation method;
the similarity calculation unit is specifically configured to: setting different weights for the font similarity calculated by each font similarity calculation method;
setting different weights for the pinyin similarity calculated by the pinyin similarity calculation method;
calculating the similarity between the first sub-address and the correct address by a weighted average method according to the font similarity and the pinyin similarity;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
the standard address determining unit is used for determining a correct address with the highest similarity if the first sub-address corresponds to the first sub-address, and taking the correct address with the highest similarity as a standard address corresponding to the first sub-address;
a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;
and if so, the first circulation unit is used for taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library or not until no second sub-address with the level higher than the first sub-address is present in the address text to be processed.
In the address normalization processing apparatus provided in this embodiment, different weights are set for the font similarities calculated by different font similarity calculation methods, different weights are set for the pinyin similarities calculated by the pinyin similarity calculation methods, and the similarity between the sub-address and the standard address is calculated by using a weighted average method according to the weights corresponding to the respective methods, so that the accuracy of calculating the similarity between the sub-address and the standard address can be improved.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
the similarity calculation unit is used for calculating the similarity between the first sub-address and at least one correct address which is consistent with the first sub-address in the standard address base in level if the first sub-address is not consistent with the first sub-address in level;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
and if not, sending a service handling request to the user, wherein the service handling request comprises the first sub-address and the address text to be processed, so that the user can manually process the first sub-address according to the service handling request.
The address standardization processing apparatus provided in this embodiment sends a service transaction request to a user when the similarity between a first sub-address and a plurality of correct addresses in a standard address base, which are consistent with the first sub-address in level, is lower than a preset first threshold, so that the user manually processes the first sub-address according to the service transaction request, thereby implementing standardization of all sub-addresses.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is used for sequentially determining a first sub-address with the lowest level in the address text to be processed according to a preset level sequence;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
a fifth judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address level exists in the address text to be processed if the address text to be processed exists;
and if so, taking the second sub-address as the first sub-address, and returning to execute the step of judging whether the first sub-address is in a preset standard address library until no second sub-address with the level greater than the first level of the first sub-address exists in the address text to be processed.
The address standardization processing apparatus provided in this embodiment can realize standardization of the address text to be processed by using the first sub-address as the standard address when the first sub-address exists in the standard address library.
Further, on the basis of any one of the above embodiments, the apparatus further includes:
the first receiving module is used for receiving the address text to be processed;
the first labeling module is used for labeling the level of each sub-address in the address text to be processed through a preset neural network model to obtain a labeled address text to be processed;
the processing module comprises:
the first sub-address determining unit is specifically configured to: detecting whether the address text to be processed comprises a sub-address corresponding to a first level with the lowest level in the level sequence or not according to a preset level sequence;
if so, taking the sub-address corresponding to the first level as the first sub-address;
if not, taking a second level with the level greater than the first level as the current first level, marking the first level as vacant, returning to execute the step of detecting whether the address text to be processed comprises the sub-address corresponding to the first level with the lowest level in the level sequence according to a preset level sequence until the address text to be processed comprises the sub-address corresponding to the first level;
the first judging unit is used for judging whether the first sub-address is in a preset standard address library or not;
the similarity calculation unit is used for calculating the similarity between the first sub-address and at least one correct address which is consistent with the first sub-address in the standard address base in level if the first sub-address is not consistent with the first sub-address in level;
a second determining unit, configured to determine, for each correct address, whether a similarity between the correct address and the first sub-address is greater than a preset first threshold;
a standard address determining unit, configured to determine, if the address is a correct address with the highest similarity, and use the correct address with the highest similarity as a standard address corresponding to the first sub-address;
a third judging unit, configured to judge whether a second sub-address whose level is greater than the first sub-address by one level exists in the address text to be processed;
a first circulation unit, configured to, if yes, use the second sub-address as the first sub-address, and return to perform the step of determining whether the first sub-address is in a preset standard address library, until there is no second sub-address in the to-be-processed address text whose level is one level greater than that of the first sub-address;
the processing module further comprises:
a vacancy level determination unit, configured to determine all levels labeled as vacancies in the address text to be processed;
and the supplement unit is used for supplementing all levels marked as vacant in the address text to be processed according to the standard address library.
The address standardization processing device provided by this embodiment processes the sub-addresses according to the preset rank order, so that the vacant sub-addresses that are not filled in by the user can be supplemented on the basis of correcting the wrong sub-addresses, the filling of the address text to be processed can be further realized, the accuracy of the address text to be processed is improved, and a basis is provided for the subsequent service development.
Fig. 5 is a schematic structural diagram of an address normalization processing apparatus according to a fifth embodiment of the present invention, and as shown in fig. 5, the apparatus includes: a memory 51, a processor 52;
a memory 51; a memory 51 for storing instructions executable by the processor 52;
wherein the processor 52 is configured to execute the address normalization processing method as described above by the processor 52.
Yet another embodiment of the present invention provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the address normalization processing method as described above when the computer-executable instructions are executed by a processor.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.