CN111984748A

CN111984748A - Address information processing method and device, storage medium and electronic equipment

Info

Publication number: CN111984748A
Application number: CN201910430569.4A
Authority: CN
Inventors: 周立勇; 周立
Original assignee: Shenzhen Zhong Xing Credex Finance Technology Co ltd
Current assignee: Qianhai feisuan Technology (Shenzhen) Co.,Ltd.
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2020-11-24

Abstract

The present disclosure relates to an address information processing method and apparatus, a storage medium, and an electronic device, the method including determining a first path of sentence fragments and a first cell sentence fragment in a first address character string, determining a second path of sentence fragments and a second cell sentence fragment in a second address character string; calculating first similarity of the first path of sentence fragments and the second path of sentence fragments, and calculating second similarity of the first cell sentence fragments and the second cell sentence fragments; and if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of which the first address character string and the second address character string correspond to the same.

Description

Address information processing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of information processing, and in particular, to an address information processing method and apparatus, a storage medium, and an electronic device.

Background

In many service scenes, address information filled by a user needs to be analyzed and used, and the address information may be filled by mistake or leaked when the user fills the address information; for example, some users may miss or miss part of the information, such as missing a way number or filling a wrong word; or, in some application scenarios, the address information needs to be generated by performing character recognition on characters handwritten or photographed by the user, and wrong characters, missed characters, and multiple characters caused by misrecognition may occur.

The misfilled and missed filled parts in the address information interfere with the analysis of the address information, thereby affecting the accurate positioning of the address pointed by the address information and causing adverse effects on the use of the address information.

Disclosure of Invention

The present disclosure aims to provide an address information processing method and apparatus, a storage medium, and an electronic device, so as to solve the problem in the related art that analysis and processing of missed address information are not accurate enough.

In order to achieve the above object, a first aspect of the present disclosure provides an address information processing method, including: determining a first path of sentence segments and a first cell sentence segment in a first address character string, and determining a second path of sentence segments and a second cell sentence segment in a second address character string; calculating first similarity of the first path of sentence fragments and the second path of sentence fragments, and calculating second similarity of the first cell sentence fragments and the second cell sentence fragments; and if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of which the first address character string and the second address character string correspond to the same.

Optionally, determining a road name period and a cell name period in the address character string includes: searching clauses in the address character string, wherein the clauses belong to a preset clause word set, and the preset clause word set comprises the following types of clauses: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses;

Taking the clause words as starting words and/or ending words of the clauses, and carrying out sentence segment division processing on the address character strings to obtain one or more sentence segments starting from the starting words and/or ending with the ending words; determining the road name sentence segments and the cell name sentence segments from a plurality of sentence segments according to the obtained sentence segmentation words included in each sentence segment and the corresponding relation between the sentence segmentation words and the sentence segment types; the corresponding relation between the clause words and the sentence fragment types comprises the following steps: the sentence segment comprising the road name clause words is the road name sentence segment; the sentence segment including the cell name clause word is the cell name sentence segment.

Optionally, the administrative district name clause words include any character or character combination of the following characters: district, town, street office, new district, industrial park, development district; the road name clause words comprise any character or character combination as follows: roads, avenues, communities, and blocks; the cell name clause words comprise any character or character combination as follows: village, new village, Estate, new Estate, district, apartment, garden, home, mansion, pub, villa, neighborhood; the quantifier clause words comprise any character or character combination as follows: number, frame, section, layer, room, unit, building, period; the number word clause words comprise any character or character combination as follows: arabic numerals, Chinese numerals, Roman numerals, capital English letters, lowercase English letters, and Chinese heavenly stems characters.

Optionally, the correspondence between the clause words and sentence fragment types includes: the sentence segments of the quantitative word clauses, which comprise the digital word clauses and have no interval characters with the digital word clauses, are the number sentence segments; the cell statement segment in the determined address string includes: judging whether a road number period segment exists in the address character string, wherein the road number period segment is the number period segment which is behind the road name period segment and has no interval character with the road name period segment; if the address character string has the road number sentence segment, determining the character string behind the road number sentence segment and before the first digit word clause word behind the road number sentence segment as the cell name sentence segment; if the road number sentence segment does not exist in the address character string, determining the character string after the road name sentence segment and before the first digit word sentence word after the road name sentence word as the cell name sentence segment.

Optionally, after determining that the first address string corresponds to the same cell as the second address string, the method further includes: acquiring a first digital sequence and a second digital sequence, wherein the first digital sequence is a digital sequence consisting of digital word clauses arranged in the first order after the first cell name sentence segment in the first address character string, and the second digital sequence is a digital sequence consisting of digital word clauses arranged in the second order after the second cell name sentence segment in the second address character string; judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning or not; and if the address string is a digit corresponding to the same word sense, determining that the first address character string and the second address character string correspond to the same address.

Optionally, after determining that the first address string corresponds to the same cell as the second address string, the method further includes: correspondingly storing a first path name combination sentence section formed by the first path name sentence section and a first path number sentence section after the first path name sentence section, a second path name combination sentence section formed by the second path name sentence section and a second path number sentence section after the second path name sentence section, the first cell name sentence section and the second cell name sentence section into a cell name database; the cell name database is used for searching a cell name or a road name matched with target address information in the cell name database when the target address information input by a user is received.

Optionally, calculating the similarity of the first sentence segment and the second sentence segment includes: determining the similarity according to the number of the same character numbers of the first sentence segment and the second sentence segment and the target exchange operation times; the target exchange operation frequency is the minimum required character exchange operation frequency when the character exchange operation of exchanging any two characters for character positions is repeatedly executed until the arrangement sequence of the target characters in the first sentence fragment is adjusted to be consistent with the arrangement sequence of the target characters in the second sentence fragment; the target character is a character present in both the first sentence segment and the second sentence segment.

Optionally, the determining the similarity according to the number of the same number of characters of the first sentence segment and the second sentence segment and the target number of exchanging operations includes: determining the ratio of the number of the same characters of the first sentence segment and the second sentence segment to the total number of the characters appearing in the first sentence segment and the second sentence segment as the intersection similarity; determining the ratio of the difference value of the target character number and the target operation times to the target character number as sequence similarity; the similarity is a product of the intersection similarity and the order similarity.

In a second aspect of the present disclosure, there is provided an address information processing apparatus, the apparatus including: the determining module is used for determining a first path of sentence segment and a first cell sentence segment in the first address character string and determining a second path of sentence segment and a second cell sentence segment in the second address character string; the calculation module is used for calculating first similarity of the first path of statement segment and the second path of statement segment and calculating second similarity of the first cell statement segment and the second cell statement segment; and the processing module is used for determining the cells of which the first address character string and the second address character string correspond to the same cell if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value.

Optionally, the determining module includes: the searching submodule is used for searching clauses in the address character string, wherein the clauses belong to a preset clause word set, and the preset clause word set comprises the following types of clauses: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses; the sentence segmentation submodule is used for taking the sentence segmentation words as starting words and/or ending words of the sentences, and carrying out sentence segment division processing on the address character strings to obtain one or more sentence segments starting from the starting words and/or ending with the ending words; the determining submodule is used for determining a road name sentence segment and a cell name sentence segment from a plurality of sentence segments according to the obtained clause words in each sentence segment and the corresponding relation between the clause words and the sentence segment types; the corresponding relation between the clause words and the sentence fragment types comprises the following steps: the sentence segment comprising the road name clause words is the road name sentence segment; the sentence segment including the cell name clause word is the cell name sentence segment.

Optionally, the correspondence between the clause words and sentence fragment types includes: the sentence segments of the quantitative word clauses, which comprise the digital word clauses and have no interval characters with the digital word clauses, are the number sentence segments; the determining submodule is also used for judging whether a road number period segment exists in the address character string, wherein the road number period segment is the number period segment which is behind the road name period segment and has no interval character with the road name period segment; if the address character string has the road number sentence segment, determining the character string behind the road number sentence segment and before the first digit word clause word behind the road number sentence segment as the cell name sentence segment; if the road number sentence segment does not exist in the address character string, determining the character string after the road name sentence segment and before the first digit word sentence word after the road name sentence word as the cell name sentence segment.

Optionally, the apparatus further comprises: an obtaining module, configured to obtain a first digital sequence and a second digital sequence, where the first digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the first cell name sentence segment in the first address character string, and the second digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the second cell name sentence segment in the second address character string; the judging module is used for judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning or not; and if the address string is a digit corresponding to the same word sense, determining that the first address character string and the second address character string correspond to the same address.

Optionally, the apparatus further comprises: a storage module, configured to store a first path name combination sentence segment composed of the first path name sentence segment and a first path number sentence segment after the first path name sentence segment, a second path name combination sentence segment composed of the second path name sentence segment and a second path number sentence segment after the second path name sentence segment, the first cell name sentence segment and the second cell name sentence segment in a cell name database; the cell name database is used for searching a cell name or a road name matched with target address information in the cell name database when the target address information input by a user is received.

Optionally, the computing module comprises: the calculation submodule is used for determining the similarity according to the number of the same character numbers of the first sentence segment and the second sentence segment and the target exchange operation times; the target exchange operation frequency is the minimum required character exchange operation frequency when the character exchange operation of exchanging any two characters for character positions is repeatedly executed until the arrangement sequence of the target characters in the first sentence fragment is adjusted to be consistent with the arrangement sequence of the target characters in the second sentence fragment; the target character is a character present in both the first sentence segment and the second sentence segment.

Optionally, the computing sub-module is configured to: determining the ratio of the number of the same characters of the first sentence segment and the second sentence segment to the total number of the characters appearing in the first sentence segment and the second sentence segment as the intersection similarity; determining the ratio of the difference value of the target character number and the target operation times to the target character number as sequence similarity; the similarity is a product of the intersection similarity and the order similarity.

In a third aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method of any one of the first aspect of the disclosure.

In a fourth aspect of the present disclosure, an electronic device is provided, which includes a memory and a processor, the memory stores a computer program thereon, and the processor is configured to execute the computer program in the memory to implement the steps of the method of any one of the first aspect of the present disclosure.

Through the technical scheme, the following technical effects can be at least achieved:

by the technical scheme, the first path of sentence fragments and the second path of sentence fragments and the first cell sentence fragments and the second cell sentence fragments are respectively extracted from the two address character strings, the first similarity of the two path sentence fragments and the second similarity of the two cell sentence fragments are calculated, the first similarity is compared with the first similarity threshold, the second similarity is compared with the second similarity threshold, and whether the two address character strings correspond to the same cell or not can be judged through the comparison result. Therefore, the fault tolerance of the sentence segment comparison is improved, the analysis of the sentence segment is not influenced when small input errors occur, and even under the conditions of wrong characters, missed characters, multiple characters and the like in the address character strings, two similar address character strings can be compared to obtain the result of the same cell corresponding to the two address character strings.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flowchart illustrating an address information processing method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating another address information processing method in an exemplary embodiment.

Fig. 3 is a flow chart illustrating another address information processing method in an exemplary embodiment.

Fig. 4 is a block diagram of an address information processing apparatus shown in an exemplary embodiment.

FIG. 5 is a block diagram of an electronic device, shown in an exemplary embodiment.

FIG. 6 is a block diagram of another electronic device shown in an exemplary embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart illustrating an address information processing method according to an exemplary embodiment, the method including:

S11, determining the first path of famous sentence segment and the first cell famous sentence segment in the first address character string, and determining the second path of famous sentence segment and the second cell famous sentence segment in the second address character string.

The address string may be a literal arrangement representing an address entered by the user, for example, the address string may be 20 layers of the financial technology building, science and technology road 11, south mountain region, shenzhen, guang, where the road term period is country road 11 and the cell term period is the financial technology building; for another example, the address string may be a number 01 from one unit to one unit in the garden stone bridge 28 in the Pudong new area of Shanghai, where the street term sentence section is the garden stone bridge and the cell term sentence section is the product from the other unit.

When determining the road name period and the cell name period, the period filled in the road name column and the period filled in the cell column when the user inputs the address may be directly extracted, for example, when filling the address information, the following manner may be used for inputting: province-Guangdong province, city-Shenzhen city, district-southern mountain area, Luo-Keyuan road, road number-11, district-financial science and technology mansion, and concrete address-20 layers, so that the sentence segments of the 'Keyuan road' under the 'road' classification can be directly extracted from the input information as the road name sentence segments, and the sentence segments of the 'financial science and technology mansion' under the 'district' classification can be used as the district name sentence segments.

It should be noted that the cell names in the cell name period include not only the cell names of residential cells, but also commercial cells, public institutions, such as Wanda plaza, Tianfu software garden, Tianjia Dianzhong school, first-person hospital, etc.

S12, calculating the first similarity of the first path of famous periods and the second path of famous periods, and calculating the second similarity of the first cell famous periods and the second cell famous periods.

In an actual service scenario, due to carelessness of a user or a text recognition error, situations of wrong words, missing words, multiple words and the like of an address character string may occur, if two identical sentence segments are strictly taken as identical sentence segments to judge that two address character strings correspond to the same address, and sentence segments with errors of wrong words, missing words, multiple words and the like are taken as different sentence segments, and further judgment is made that two address character strings correspond to different addresses, subsequent service use may be affected. Therefore, whether two sentence segments represent the same information can be judged by calculating the similarity of the sentence segments.

The similarity between the first path of famous sentence segment and the second path of famous sentence segment can be calculated by a Jaccard coefficient (JaccagIndex) method, specifically, the ratio of the number of the characters which appear together in the first path of famous sentence segment and the second path of famous sentence segment to the total number of the characters in the two sentence segments can be calculated to obtain a Jaccard coefficient which represents the similarity degree of the two sentence segments, and the higher the coefficient value is, the more similar the two sentence segments are; the similarity between the first path of famous sentence fragments and the second path of famous sentence fragments can be calculated by an editing distance calculation method, specifically, the minimum editing operation times required for converting the first path of famous sentence fragments into the second path of famous sentence fragments (or converting the second path of famous sentence fragments into the first path of famous sentence fragments) can be calculated, and the one-time editing operation times comprises one-time replacement, increase and decrease of characters in the first path of famous sentence fragments; the similarity can also be calculated by converting the first path of sentence fragments and the second path of sentence fragments into vectors respectively and calculating the cosine value of the included angle between the vector of the first path of sentence fragments and the vector of the second path of sentence fragments, and specifically, the conversion of the sentence fragments into the vectors can be completed by Word2Vec software.

Similarly, the similarity of the first cell term period and the second cell term period may also be calculated in the manner described above.

S13, if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of the first address character string and the second address character string which correspond to the same.

The first similarity threshold and the second similarity threshold may be the same value or different values. The first similarity threshold and the second similarity threshold may be 0.5 at the same time, that is, the first path of sentence segment and the second path of sentence segment do not need to be identical, and the first cell sentence segment and the second cell sentence segment do not need to be identical; the first similarity threshold may be 1 and the second similarity threshold may be 0.5, i.e., the first and second road segments need to be identical, while the first and second cell segments may be different.

For example, the first term sentence segment is a "garden stone bridge", the cell term sentence segment is a "soup minister first item", the second term sentence segment is a "garden bridge", the second term sentence segment is a "soup chen first item", the similarity is calculated to be 0.8 for the "garden stone bridge" and the "garden bridge" and is calculated to be 0.6 for the "soup minister first item" and the "soup chen first item" in S12 using the jaccard coefficient method. Thus, when the first similarity threshold and the second similarity threshold are both 0.5, the cells with the same first address character string and the same second address character string can be judged; when the first similarity threshold is 1 and the second similarity threshold is 0.5, it may be determined that the first address string and the second address string do not correspond to the same cell.

Fig. 2 is a flowchart illustrating an address information processing method according to an exemplary embodiment, the method including:

s21, determining the first path of famous sentence segment and the first cell famous sentence segment in the first address character string, and determining the second path of famous sentence segment and the second cell famous sentence segment in the second address character string.

S22, determining the first similarity according to the number of the same character number of the first road statement segment and the second road statement segment and the target exchange operation times, and determining the second similarity according to the number of the same character number of the first cell statement segment and the second cell statement segment and the target exchange operation times.

The target exchange operation frequency is the minimum required character exchange operation frequency when the character exchange operation of exchanging any two characters for character positions is repeatedly executed until the arrangement sequence of the target characters in the first sentence fragment is adjusted to be consistent with the arrangement sequence of the target characters in the second sentence fragment; the target character is a character present in both the first sentence segment and the second sentence segment.

In an actual service scenario, due to carelessness of a user or a text recognition error, situations of wrong words, missing words, multiple words and the like of an address character string may occur, and if two identical sentence segments are strictly taken as identical sentence segments to judge that two address character strings correspond to the same address, and a sentence segment with errors of wrong words, missing words, multiple words and the like is taken as a different sentence segment, further judgment is made that two address character strings correspond to different addresses, and subsequent service use may be affected. Therefore, whether two sentence segments represent the same information can be judged by calculating the similarity of the sentence segments.

Thus, the similarity may be determined by the number of identical characters in the period and the target number of times of the swapping operation, for example, the similarity may be a ratio of the difference of the number of identical characters minus the target number of times of the swapping operation to a fixed value (e.g., the fixed value may be 10).

Alternatively, the first similarity and the second similarity may be determined by determining a ratio of the number of identical characters of the first sentence segment and the second sentence segment to the total number of characters appearing in the first sentence segment and the second sentence segment as an intersection similarity, and determining a ratio of a difference between the target number of characters and the target number of operations to the target number of characters as an order similarity, the similarity being a product of the intersection similarity and the order similarity. It should be noted that, when the target characters in the first sentence segment and the second sentence segment are arranged in the same sequence, the sequence similarity is 1.

For example, if the first path of famous sentence segment is "beautiful east road" and the second path of famous sentence segment is "beautiful road", the first similarity is the product of the intersection similarity (3/4, i.e. 0.75) and the sequence similarity (since the sequence of characters [ brocade, embroidery, road ] in the first path of famous sentence segment is the same as the sequence of characters [ brocade, embroidery, road ] in the second path of famous sentence segment, the sequence similarity is 1), i.e. the first similarity is 0.75; if the first cell term period is "meteor garden" and the second cell term period is "meteor garden", the second similarity is the intersection similarity (4/5, i.e., 0.8) of "meteor garden" and "meteor garden" multiplied by the sequential similarity (3/4, i.e., 0.75) of the character sequence [ meteor, star, flower, garden ] and the character sequence [ meteor, flow, flower, garden ], i.e., the second similarity is 0.6.

S23, if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of the first address character string and the second address character string which correspond to the same.

Through the technical scheme, the first path of sentence fragments and the second path of sentence fragments and the first cell sentence fragments and the second cell sentence fragments are respectively extracted from the two address character strings, the first similarity of the two path sentence fragments and the second similarity of the two cell sentence fragments are calculated through the number of the same character number in the two sentence fragments and the target exchange operation times, the first similarity is compared with the first similarity threshold, the second similarity is compared with the second similarity threshold, and whether the two address character strings correspond to the same cell or not can be judged through the comparison result. Therefore, the fault tolerance of the sentence segment comparison is improved, the analysis of the sentence segment is not influenced when small input errors occur, and even under the conditions of wrong characters, missed characters, multiple characters and the like in the address character strings, two similar address character strings can be compared to obtain the result of the same cell corresponding to the two address character strings.

Fig. 3 is a flow chart illustrating an address information processing method according to an exemplary disclosed embodiment, the method including:

s31, finding clauses belonging to a preset clause word set in the first address character string and the second address character string.

When the clause words belonging to a preset clause word set in the address character string are searched, the preset clause word set comprises the following types of clause words: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses.

The administrative district name clause words comprise any character or character combination as follows: district, town, street office, new district, industrial park, development district; the road name clause words comprise any character or character combination as follows: roads, avenues, communities, and blocks; the cell name clause words comprise any character or character combination as follows: village, new village, Estate, new Estate, district, apartment, garden, home, mansion, pub, villa, neighborhood; the quantifier clause words comprise any character or character combination as follows: number, frame, section, layer, room, unit, building, period; the number word clause words comprise any character or character combination as follows: arabic numerals, Chinese numerals, Roman numerals, capital English letters, lowercase English letters, and Chinese heavenly stems characters.

And S32, taking the clause words as starting words and/or ending words of the clause, and carrying out sentence segment division processing on the address character string to obtain one or more sentence segments starting from the starting words and/or ending with the ending words.

During sentence segmentation, sentence segments can be segmented by using administrative district name clauses, road name clauses, cell name clauses and quantifier clauses as cutoff words, or by using digital classification words as initial words, or by using both words to make the result of sentence segmentation more accurate.

For example, if the address string "wang-term sentence" 1 st wang-term street in taiyuan city, 1 st floor, 2 nd floor, 5 nd floor, 2 nd unit 1202 room "in taiyuan city, shan west | manway |1 st | up to 1 | international city floor, 5 st floor, |2 nd unit 1202 room" in taiyuan, then, the system can be divided into a ten thousand Chinese forest area in Taiyuan city of Shanxi province obtained by dividing sentences with the sentence word of sentence division of the administrative district name, a people road obtained by dividing sentences with the sentence word of sentence division of the road name, a number 1 obtained by dividing sentences with the sentence word of sentence division of the number word, a number 1 attached to sentences with the sentence word of sentence division of the number word division of the number 1, an international city district obtained by dividing sentences with the sentence word of division of the district name division of the number division of the cell, a number 5 obtained by dividing sentences with.

And S33, determining the first path of famous sentence fragment, the second path of famous sentence fragment, the first cell famous sentence fragment and the second cell famous sentence fragment from the plurality of sentence fragments according to the obtained clause words included in each sentence fragment and the corresponding relation between the clause words and the sentence fragment types.

The corresponding relation between the clause words and the sentence fragment types comprises the following steps: the sentence segment comprising the road name clause words is the road name sentence segment; the sentence segment including the cell name clause word is the cell name sentence segment.

For example, after the segmentation results of the address character string "mazechu tengyi 1 th international city district No. 5 building 2 cell 1202 in mazechu district 1 st of mazechu city of shanxi province" with the cutoff word to obtain "mazechu tengyi district" obtained by the clause of the administrative district name, the "mazechu district" obtained by the clause of the road name, the "1 st" obtained by the clause of the quantifier clause "No. 1", the "international city cell" obtained by the clause of the cell name, the "5 th building" obtained by the clause of the quantifier clause "building", the "2 cell" obtained by the clause of the quantifier clause "cell", and the "1202 cell" obtained by the clause of the quantifier clause "cell" are obtained by the clause of the cell name, the "road name and the" road division sentence "segment" is determined as the mazechu tengym district name, the international city district sentence segment including the district name clause words is the district name sentence segment. The road name sentence segment and the cell name sentence segment determined from the plurality of branch results of the first address character string are respectively a first road name sentence segment and a first cell name sentence segment, and the road name sentence segment and the cell name sentence segment determined from the plurality of branch results of the second address character string are respectively a second road name sentence segment and a second cell name sentence segment.

In some cases, some cell names do not contain obvious cell name clauses, such as "soup minister yi article", "four sea yi jia", "zixuan ge", etc., and if the above-described method of determining cell name sentence segments by cell name clauses is used for an address string containing such cell names, the determination of such cell name sentence segments may be biased.

Therefore, in a possible implementation manner, the correspondence between the clause words and the sentence fragment types further includes: the sentence segments of the quantitative word clauses, which comprise the digital word clauses and have no interval characters with the digital word clauses, are the number sentence segments; the cell statement segment in the determined address string includes: judging whether a road number period segment exists in the address character string, wherein the road number period segment is the number period segment which is behind the road name period segment and has no interval character with the road name period segment; if the address character string has the road number sentence segment, determining the character string behind the road number sentence segment and before the first digit word sentence word behind the road number sentence word as the cell name sentence segment; if the road number sentence segment does not exist in the address character string, determining the character string after the road name sentence segment and before the first digit word sentence word after the road name sentence word as the cell name sentence segment.

For example, the address character string "3 units 3002 of 18 th Sihai Yi Jia in the 3 rd place of the peony street in the city. After judging that there is a road number period segment of "18" (i.e. the period segment after the road number period segment) in the address character string, it can determine that the period segment "four sea Yi Jia" before the road number period segment and the first digit word segmentation word "3" after the road number period segment is the cell name period segment; similarly, if the address string is "3 units 3002 of four sea Yi Jia in the area of the city, the great river, the county, the city, the great river.

S34, calculating the first similarity of the first path of famous periods and the second path of famous periods, and calculating the second similarity of the first cell famous periods and the second cell famous periods.

In an actual service scenario, due to carelessness of a user or an error in character recognition, situations of wrong characters, missing characters, multiple characters and the like of an address character string may occur, and if two identical sentence segments are strictly taken as equivalent sentence segments to judge that two address character strings correspond to the same address, and a sentence segment with errors of wrong characters, missing characters, multiple characters and the like is taken as an equivalent sentence segment to judge that two address character strings correspond to different addresses, service use may be affected. Therefore, whether two sentence segments represent the same information can be judged by calculating the similarity of the sentence segments.

Alternatively, the first similarity may be determined according to the number of the same number of characters of the first road statement segment and the second road statement segment and the target number of times of the exchanging operation, and the second similarity may be determined according to the number of the same number of characters of the first cell statement segment and the second cell statement segment and the target number of times of the exchanging operation.

The similarity may be determined by the number of identical characters in the period and the target number of times of the exchanging operation, for example, the similarity may be a ratio of the difference of the number of identical characters minus the target number of times of the exchanging operation to a fixed value (e.g., the fixed value may be 10).

Alternatively, the first similarity and the second similarity may be determined by determining a ratio of the number of identical characters of the first sentence segment and the second sentence segment to the total number of characters appearing in the first sentence segment and the second sentence segment as an intersection similarity, and determining a ratio of a difference between the target number of characters and the target number of operations to the target number of characters as an order similarity, the similarity being a product of the intersection similarity and the order similarity. It should be noted that if the target characters in the first sentence segment and the second sentence segment are arranged in the same order, the similarity of the order is 1.

S35, if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of the first address character string and the second address character string which correspond to the same.

In a possible implementation manner, after determining that the first address character string corresponds to the same cell as the second address character string, a first path of name combination sentence fragment composed of the first path of term sentence fragment and a first path of number sentence fragment after the first path of term sentence fragment, a second path of name combination sentence fragment composed of the second path of term sentence fragment and a second path of number sentence fragment after the second path of term sentence fragment, the first cell name sentence fragment and the second cell name sentence fragment may be stored in the cell name database.

The cell name database is used for searching a cell name or a road name matched with target address information in the cell name database when the target address information input by a user is received.

For example, after it is determined that the "beautiful east way 1650 with fragrant plum garden" and the "beautiful way plum garden" correspond to the same cell, the first way name combination sentence segment "beautiful east way 1650 with", the first cell name sentence segment "fragrant plum garden" and the second cell name sentence segment "plum garden" may be stored in the database (since the second way name combination sentence segment does not exist in the second address character string, the second way name combination sentence segment is not stored), so that if the "beautiful east way 1650 number" appears in another new address character string input by the user, even if there is no cell name combination sentence segment in the new address character string, the new address character string corresponding to the same cell as the "beautiful east way 1650 fragrant plum garden" and the "beautiful way plum garden" may be found by searching.

S36, obtaining a first sequence of words and a second sequence of words, where the first sequence of words is a sequence of words and phrases arranged in the first address character string in the first order after the first cell name sentence fragment, and the second sequence of words is a sequence of words and phrases arranged in the second address character string in the second order after the second cell name sentence fragment in the first order.

For example, if the first address string is "shanghai city, purdong new garden stone bridge 28, shangcheng, one unit and one unit 1", and the second address string is "shanghai city, garden stone bridge, shangcheng, one unit and 1 unit 01", the first word sequence is "one-1", and the second word sequence is "a-1-01".

S37, judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning.

It should be noted that when the two categorical words are the same, the two categorical words correspond to the same meaning, but when the two categorical words are different and have the same rank in their respective word systems, the two categorical words may also correspond to the same meaning. For example, the other phrase classification words corresponding to the phrase classification word 1 may include "one", "first", "I", "a", and the like, and the other phrase classification words corresponding to the phrase classification word 4 may include "four", "fourth", "IV", "D", and the like.

When judging whether the number word clause words with the same sequence position in the first number word sequence 'one-1' and the second number word sequence 'A-1-01' correspond to the number words with the same meaning, comparing the 'one' with the 'A', 'one' with the '1', '1' and '01', and then knowing that the clause words with the same sequence position in the first number word sequence and the second number word sequence correspond to the number words with the same meaning.

And S38, if the words are numbers corresponding to the same meaning, determining that the first address character string and the second address character string correspond to the same address.

By the technical scheme, the first path of sentence segments and the second path of sentence segments and the first cell sentence segments and the second cell sentence segments are respectively extracted from the two address character strings in a sentence dividing mode, the first similarity of the two path sentence segments and the second similarity of the two cell sentence segments are calculated, the first similarity is compared with the first similarity threshold, the second similarity is compared with the second similarity threshold, and whether the two address character strings correspond to the same cell or not can be judged through the comparison result. Therefore, the fault tolerance of the sentence segment comparison is improved, the analysis of the sentence segment is not influenced when small input errors occur, even if wrong characters, missed characters, multiple characters and the like exist in the address character strings, the results of the two address character strings corresponding to the same cell can be obtained by comparing the two similar address character strings, and whether the sequence of the digital words behind the cell name sentence segment in the two address character strings corresponding to the same cell corresponds to the digital words with the same meaning or not can be further judged under the scenes of different writing habits of different users, so that the accuracy of address information processing is improved.

Fig. 4 is a block diagram illustrating an address information processing apparatus according to an exemplary co-publication embodiment. As shown in fig. 4, the apparatus 400 includes a determining module 410, a calculating module 420, and a processing module 430.

The determining module 410 is configured to determine a first path of sentence fragments and a first cell sentence fragment in a first address character string, and determine a second path of sentence fragments and a second cell sentence fragment in a second address character string.

The calculating module 420 is configured to calculate a first similarity between the first path of sentence segment and the second path of sentence segment, and calculate a second similarity between the first cell sentence segment and the second cell sentence segment.

The processing module 430 is configured to determine that the first address string and the second address string correspond to the same cell if the first similarity is higher than a first similarity threshold and the second similarity is higher than a second similarity threshold.

Optionally, the determining module 410 includes: the searching submodule is used for searching clauses in the address character string, wherein the clauses belong to a preset clause word set, and the preset clause word set comprises the following types of clauses: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses; the sentence segmentation submodule is used for taking the sentence segmentation words as starting words and/or ending words of the sentences, and carrying out sentence segment division processing on the address character strings to obtain one or more sentence segments starting from the starting words and/or ending with the ending words; the determining submodule is used for determining a road name sentence segment and a cell name sentence segment from a plurality of sentence segments according to the obtained clause words in each sentence segment and the corresponding relation between the clause words and the sentence segment types; the corresponding relation between the clause words and the sentence fragment types comprises the following steps: the sentence segment comprising the road name clause words is the road name sentence segment; the sentence segment including the cell name clause word is the cell name sentence segment.

Optionally, the apparatus 400 further comprises: an obtaining module, configured to obtain a first digital sequence and a second digital sequence, where the first digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the first cell name sentence segment in the first address character string, and the second digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the second cell name sentence segment in the second address character string; the judging module is used for judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning or not; and if the address string is a digit corresponding to the same word sense, determining that the first address character string and the second address character string correspond to the same address.

Optionally, the apparatus 400 further comprises: a storage module, configured to store a first path name combination sentence segment composed of the first path name sentence segment and a first path number sentence segment after the first path name sentence segment, a second path name combination sentence segment composed of the second path name sentence segment and a second path number sentence segment after the second path name sentence segment, the first cell name sentence segment and the second cell name sentence segment in a cell name database; the cell name database is used for searching a cell name or a road name matched with target address information in the cell name database when the target address information input by a user is received.

Optionally, the calculation module 420 includes:

the calculation submodule is used for determining the similarity according to the number of the same character numbers of the first sentence segment and the second sentence segment and the target exchange operation times; the target exchange operation frequency is the minimum required character exchange operation frequency when the character exchange operation of exchanging any two characters for character positions is repeatedly executed until the arrangement sequence of the target characters in the first sentence fragment is adjusted to be consistent with the arrangement sequence of the target characters in the second sentence fragment; the target character is a character present in both the first sentence segment and the second sentence segment.

Optionally, the calculation sub-module is configured to determine that a ratio of the number of identical characters of the first sentence segment and the second sentence segment to a total number of characters appearing in the first sentence segment and the second sentence segment is an intersection similarity; determining the ratio of the difference value of the target character number and the target operation times to the target character number as sequence similarity; the similarity is a product of the intersection similarity and the order similarity.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure provides a computer-readable storage medium on which a computer program is stored, which program, when executed by a processor, implements the steps of any of the address information processing methods.

The present disclosure provides an electronic device, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of any of the address information processing methods.

Fig. 5 is a block diagram illustrating an electronic device 500 in accordance with an example embodiment. As shown in fig. 5, the electronic device 500 may include: a processor 501 and a memory 502. The electronic device 500 may also include one or more of a multimedia component 503, an input/output (I/O) interface 504, and a communication component 505.

The processor 501 is configured to control the overall operation of the electronic device 500, so as to complete all or part of the steps in the address information processing method. The memory 502 is used to store various types of data to support operations at the electronic device 500, such as instructions for any application or method operating on the electronic device 500, and application-related data, such as address information, clauses, correspondence between clauses and paragraph attributes, and so forth. The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 502 or transmitted through the communication component 505. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the electronic device 500 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 505 may thus comprise: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above-mentioned address information Processing method.

In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions which, when executed by a processor, implement the steps of the address information processing method described above. For example, the computer readable storage medium may be the memory 502 described above including program instructions that are executable by the processor 501 of the electronic device 500 to perform the address information processing method described above.

Fig. 6 is a block diagram illustrating an electronic device 600 according to an example embodiment. For example, the electronic device 600 may be provided as a server. Referring to fig. 6, the electronic device 600 includes a processor 622, which may be one or more in number, and a memory 632 for storing computer programs executable by the processor 622. The computer program stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processor 622 may be configured to execute the computer program to perform the address information processing method described above.

Additionally, electronic device 600 may also include a power component 626 that may be configured to perform power management of electronic device 600 and a communication component 650 that may be configured to enable communication, e.g., wired or wireless communication, of electronic device 600. The electronic device 600 may also include input/output (I/O) interfaces 658. The electronic device 600 may operate based on an operating system stored in the memory 632, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, and so on.

In another exemplary embodiment, there is also provided a computer-readable storage medium including program instructions which, when executed by a processor, implement the steps of the address information processing method described above. For example, the computer readable storage medium may be the memory 632 including the program instructions, which are executable by the processor 622 of the electronic device 600 to perform the address information processing method described above.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. An address information processing method, characterized by comprising:

determining a first path of sentence segments and a first cell sentence segment in a first address character string, and determining a second path of sentence segments and a second cell sentence segment in a second address character string;

calculating first similarity of the first path of sentence fragments and the second path of sentence fragments, and calculating second similarity of the first cell sentence fragments and the second cell sentence fragments;

and if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value, determining the cells of which the first address character string and the second address character string correspond to the same.

2. The method of claim 1, wherein determining a road term segment and a cell term segment in an address string comprises:

searching clauses in the address character string, wherein the clauses belong to a preset clause word set, and the preset clause word set comprises the following types of clauses: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses;

taking the clause words as starting words and/or ending words of the clauses, and carrying out sentence segment division processing on the address character strings to obtain one or more sentence segments starting from the starting words and/or ending with the ending words;

determining the road name sentence segments and the cell name sentence segments from a plurality of sentence segments according to the obtained sentence segmentation words included in each sentence segment and the corresponding relation between the sentence segmentation words and the sentence segment types;

the corresponding relation between the clause words and the sentence fragment types comprises the following steps:

the sentence segment comprising the road name clause words is the road name sentence segment;

the sentence segment including the cell name clause word is the cell name sentence segment.

3. The method of claim 2,

the administrative district name clause words comprise any character or character combination as follows: district, town, street office, new district, industrial park, development district;

The road name clause words comprise any character or character combination as follows: roads, avenues, communities, and blocks;

the cell name clause words comprise any character or character combination as follows: village, new village, Estate, new Estate, district, apartment, garden, home, mansion, pub, villa, neighborhood;

the quantifier clause words comprise any character or character combination as follows: number, frame, section, layer, room, unit, building, period;

the number word clause words comprise any character or character combination as follows: arabic numerals, Chinese numerals, Roman numerals, capital English letters, lowercase English letters, and Chinese heavenly stems characters.

4. The method of claim 2, wherein the correspondence between clause words and sentence fragment types comprises:

the sentence segments of the quantitative word clauses, which comprise the digital word clauses and have no interval characters with the digital word clauses, are the number sentence segments;

the cell statement segment in the determined address string includes:

judging whether a road number period segment exists in the address character string, wherein the road number period segment is the number period segment which is behind the road name period segment and has no interval character with the road name period segment;

If the address character string has the road number sentence segment, determining the character string behind the road number sentence segment and before the first digit word clause word behind the road number sentence segment as the cell name sentence segment;

if the road number sentence segment does not exist in the address character string, determining the character string after the road name sentence segment and before the first digit word sentence word after the road name sentence word as the cell name sentence segment.

5. The method of claim 2, wherein after determining that the first address string corresponds to the same cell as the second address string, the method further comprises:

acquiring a first digital sequence and a second digital sequence, wherein the first digital sequence is a digital sequence consisting of digital word clauses arranged in the first order after the first cell name sentence segment in the first address character string, and the second digital sequence is a digital sequence consisting of digital word clauses arranged in the second order after the second cell name sentence segment in the second address character string;

judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning or not;

And if the address string is a digit corresponding to the same word sense, determining that the first address character string and the second address character string correspond to the same address.

6. The method of claim 4, wherein after determining that the first address string corresponds to the same cell as the second address string, the method further comprises:

correspondingly storing a first path name combination sentence section formed by the first path name sentence section and a first path number sentence section after the first path name sentence section, a second path name combination sentence section formed by the second path name sentence section and a second path number sentence section after the second path name sentence section, the first cell name sentence section and the second cell name sentence section into a cell name database;

7. The method of any of claims 1-6, wherein calculating the similarity of the first sentence segment and the second sentence segment comprises:

determining the similarity according to the number of the same character numbers of the first sentence segment and the second sentence segment and the target exchange operation times;

The target exchange operation frequency is the minimum required character exchange operation frequency when the character exchange operation of exchanging any two characters for character positions is repeatedly executed until the arrangement sequence of the target characters in the first sentence fragment is adjusted to be consistent with the arrangement sequence of the target characters in the second sentence fragment;

the target character is a character present in both the first sentence segment and the second sentence segment.

8. The method of claim 7, wherein said determining said similarity based on a number of identical characters in said first sentence segment and said second sentence segment and a target number of swapping operations comprises:

determining the ratio of the number of the same characters of the first sentence segment and the second sentence segment to the total number of the characters appearing in the first sentence segment and the second sentence segment as the intersection similarity;

determining the ratio of the difference value of the target character number and the target operation times to the target character number as sequence similarity;

the similarity is a product of the intersection similarity and the order similarity.

9. An address information processing apparatus, characterized in that the apparatus comprises:

The determining module is used for determining a first path of sentence segment and a first cell sentence segment in the first address character string and determining a second path of sentence segment and a second cell sentence segment in the second address character string;

the calculation module is used for calculating first similarity of the first path of statement segment and the second path of statement segment and calculating second similarity of the first cell statement segment and the second cell statement segment;

and the processing module is used for determining the cells of which the first address character string and the second address character string correspond to the same cell if the first similarity is higher than a first similarity threshold value and the second similarity is higher than a second similarity threshold value.

10. The apparatus of claim 9, wherein the determining module comprises:

the searching submodule is used for searching clauses in the address character string, wherein the clauses belong to a preset clause word set, and the preset clause word set comprises the following types of clauses: administrative district name clauses, road name clauses, cell name clauses, numerics clauses, quantifier clauses;

the sentence segmentation submodule is used for taking the sentence segmentation words as starting words and/or ending words of the sentences, and carrying out sentence segment division processing on the address character strings to obtain one or more sentence segments starting from the starting words and/or ending with the ending words;

The determining submodule is used for determining a road name sentence segment and a cell name sentence segment from a plurality of sentence segments according to the obtained clause words in each sentence segment and the corresponding relation between the clause words and the sentence segment types;

11. The apparatus of claim 10, wherein the correspondence between clause words and sentence fragment types comprises: the sentence segments of the quantitative word clauses, which comprise the digital word clauses and have no interval characters with the digital word clauses, are the number sentence segments;

the determining submodule is also used for judging whether a road number period segment exists in the address character string, wherein the road number period segment is the number period segment which is behind the road name period segment and has no interval character with the road name period segment; if the address character string has the road number sentence segment, determining the character string behind the road number sentence segment and before the first digit word clause word behind the road number sentence segment as the cell name sentence segment; if the road number sentence segment does not exist in the address character string, determining the character string after the road name sentence segment and before the first digit word sentence word after the road name sentence word as the cell name sentence segment.

12. The apparatus of claim 10, further comprising:

an obtaining module, configured to obtain a first digital sequence and a second digital sequence, where the first digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the first cell name sentence segment in the first address character string, and the second digital sequence is a digital sequence composed of digital word clauses arranged in an original order after the second cell name sentence segment in the second address character string;

the judging module is used for judging whether the number word clause words with the same sequence position in the first number word sequence and the second number word sequence are number words corresponding to the same word meaning or not; and if the address string is a digit corresponding to the same word sense, determining that the first address character string and the second address character string correspond to the same address.

13. The apparatus of claim 11, further comprising:

a storage module, configured to store a first path name combination sentence segment composed of the first path name sentence segment and a first path number sentence segment after the first path name sentence segment, a second path name combination sentence segment composed of the second path name sentence segment and a second path number sentence segment after the second path name sentence segment, the first cell name sentence segment and the second cell name sentence segment in a cell name database; the cell name database is used for searching a cell name or a road name matched with target address information in the cell name database when the target address information input by a user is received.

14. The apparatus of any one of claims 9-13, wherein the computing module comprises:

15. The apparatus of claim 14, wherein the computation submodule is configured to: determining the ratio of the number of the same characters of the first sentence segment and the second sentence segment to the total number of the characters appearing in the first sentence segment and the second sentence segment as the intersection similarity; determining the ratio of the difference value of the target character number and the target operation times to the target character number as sequence similarity; the similarity is a product of the intersection similarity and the order similarity.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

17. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 8.