WO2016127904A1 - 文本地址处理方法及装置 - Google Patents
文本地址处理方法及装置 Download PDFInfo
- Publication number
- WO2016127904A1 WO2016127904A1 PCT/CN2016/073441 CN2016073441W WO2016127904A1 WO 2016127904 A1 WO2016127904 A1 WO 2016127904A1 CN 2016073441 W CN2016073441 W CN 2016073441W WO 2016127904 A1 WO2016127904 A1 WO 2016127904A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- address
- addresses
- original text
- text
- feature
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9574—Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
Definitions
- the present application relates to the field of communications technologies, and in particular, to a text address processing method and apparatus.
- the text address needs to be normalized, that is, different text addresses corresponding to the same address information need to be unified into one text address.
- the existing idea of normalizing the address is mainly to: determine all the text addresses that need to be normalized, extract the standard fragments contained in the text address, and then calculate the correlation based on the standard fragments contained in the text address, based on two The relevance of the text addresses determines whether the two text addresses should be normalized.
- aspects of the present application provide a text address processing method and apparatus for improving the accuracy of a normalized result of a text address.
- An aspect of the present application provides a text address processing method, including:
- each of the at least one address set comprising: at least two original text addresses;
- the original text address in the address set is normalized to obtain a target text address corresponding to the address set.
- Another aspect of the present application provides a text address processing apparatus, including:
- a determining module configured to determine, according to a social relationship circle of a user in the service system, at least one address set, where each address set in the at least one address set includes: at least two original text addresses;
- a normalization module configured to perform normalization processing on the original text address in the address set for each address set to obtain a target text address corresponding to the address set.
- At least one address set is determined according to a social relationship circle of users in the business system, and then the original text addresses in each address set are normalized in units of address sets to obtain each address.
- the corresponding target text address is set to achieve normalization of the text address. Since the normalized original text address is divided by the user's social relationship circle, the scope of the original text address to be normalized is limited to the user's social relationship circle, which is equivalent to narrowing down the normalization.
- the range of the original text address on the other hand, compared with the text address used by the user in the non-social relationship circle, the text address used by the user in the social relationship circle has some connection to some extent, which is equivalent to normalizing the text address. Locked between text addresses with certain links, making it easier to control the fault-tolerant boundary between text addresses, which helps to improve the accuracy of the normalized results of text addresses.
- FIG. 1 is a schematic flowchart of a text address processing method according to an embodiment of the present application
- FIG. 2 is a schematic diagram showing a normalization process according to an embodiment of the present application.
- FIG. 3 is a schematic structural diagram of a text address processing apparatus according to an embodiment of the present disclosure.
- FIG. 1 is a schematic flowchart diagram of a text address processing method according to an embodiment of the present application. As shown in Figure 1, the method includes:
- each address set in the at least one address set includes: at least two original text addresses.
- For each address set normalize the original text address in the address set to obtain a target text address corresponding to the address set.
- This embodiment provides a text address processing method, which can be executed by a text address processing apparatus.
- the method provided in this embodiment is mainly used for normalizing a text address.
- the text address in this embodiment refers to a textual description of the address information.
- Different text addresses may be textual descriptions of the same address information.
- the text address before normalization is referred to as an original text address
- the text address obtained after normalization is referred to as a target text address. Whether it is the original text address or the target text address, it is a textual description of the address information.
- text addresses are normalized only when there is a normalization requirement.
- the need to normalize text addresses is usually for one or some business systems. To put it simply, it is necessary to normalize the text address associated with it for one or some business systems, so that new business or new business needs can be mined through normalized text addresses, or Statistical analysis of information, etc.
- the service system may be various service systems that involve text addresses, such as an e-commerce system, an online payment system, an instant messaging system, an email system, and the like. Wait.
- the original text address to be normalized related to the business system needs to be determined.
- the original text address associated with the business system is determined based on the social relationship circle of the user in the business system.
- the social relationship circle mainly includes other users who are associated with the user.
- other users whose related relationships are closely related may be selected from the other users who are associated with the user as the social relationship circle of the user.
- the user's social relationship circle can be obtained in at least one of the following ways:
- instant messaging tools include but are not limited to WeChat, QQ, etc.; preferably, the frequency of interaction with the user or the communication duration exceeds Other users with a certain threshold as users in the user's social relationship circle;
- the device may include a computer, a mobile phone, a WIFI, etc., preferably, the frequency or duration of acquiring the same device with the user exceeds Other users with a certain threshold are the users in the social relationship circle of the user.
- the text address processing apparatus determines at least one address set according to a social relationship circle of the user in the business system. Wherein each address set includes at least two original text addresses.
- the original text address related to the business system is divided into Different address collections.
- the address set can be determined according to the number of users in the service system.
- the number for example, a user corresponds to a collection of addresses.
- the text address processing device first needs to determine the social relationship circle of the user (specifically, the user's social relationship circle can be determined in the manner described above); and then, the user is used to obtain The address information and the address information used by the user in the social relationship circle of the user as a set of addresses.
- the text address processing device normalizes the original text address in the address set to obtain a target text address corresponding to the address set. This is equivalent to limiting the normalization of the text address to each address set, on the one hand reducing the range of the original text address to be normalized, and on the other hand the text address used by the user in the non-social relationship circle.
- the text address used by users in the social relationship circle has some connection to some extent, which is equivalent to locking the normalization of the text address between the address information with certain links, which makes the text address processing device more It is easy to control the fault-tolerant boundary between text addresses, which is beneficial to improve the accuracy of the normalized result of text addresses.
- the process of normalizing the original text address in the address set by the text address processing device to obtain the target text address corresponding to the address set includes:
- the text address processing device calculates the similarity of each two original text addresses according to the characteristics of each two original text addresses in the address set; and determines whether each of the two original text addresses can be based on the similarity of each of the two original text addresses Normalizes to one of every two original text addresses to obtain a target text address corresponding to the set of addresses.
- target text address corresponding to the address set may be one or more.
- the text address processing device performs feature extraction on each of the two original text addresses in the address set to obtain features of each of the two original text addresses; and then, according to each of the extracted two original texts The feature of the address, calculating the similarity between each two original text addresses; further determining whether each of the two original text addresses can be normalized into one of every two original text addresses according to the similarity of each of the two original text addresses .
- the feature of the original text address used in this embodiment may include at least one of a standard segment feature, a latitude and longitude feature, and an alphanumeric feature.
- the text address processing device performs feature extraction on each of the two original text addresses in the address set to obtain standard fragment features, latitude and longitude features, and alphanumeric features of each of the two original text addresses. At least one feature; for each of the at least one feature, calculating, according to the feature, a similarity of each of the two original text addresses corresponding to the feature; further calculating a similarity of each feature according to each of the two original text addresses, To determine if every two original text addresses should be normalized into one.
- the standard fragment feature specifically reflects the standard address fragment included in the original text address.
- the original text address can be structured to obtain a standard fragment included in the original text address.
- the text address can be divided into 24 standard address segments in advance.
- a structured analysis of the original text address can be performed to obtain which of the 24 standard segments the original text address includes.
- the 24 standard segments include information such as provinces, cities, districts, development zones, and roads.
- the latitude and longitude feature may specifically reflect the latitude and longitude information of the address information described by the original text address.
- Gaud ⁇ 's Geocoding technique can be utilized to extract latitude and longitude features of the original text address.
- Geocoding technology is an encoding method based on spatial positioning technology, which provides a way to convert text addresses into geographic coordinates that can be used in geographic information systems (GIS). For details, refer to the prior art.
- the alphanumeric feature may specifically reflect the letters and/or numbers contained in the original text address. This alphanumeric feature can be extracted directly from the original text address.
- the text address processing apparatus may process the standard segment features of each of the two original text addresses using the SimHash algorithm to obtain the similarity of each of the two original text addresses in the standard segment feature dimension.
- the main idea of the SimHash algorithm is feature dimension reduction, mapping high-dimensional standard segment features into a low-dimensional standard segment feature, and then determining the two low-latitudes by comparing the Hamming distances of two low-latitude standard segment features. Whether the two text addresses identified by the standard fragment feature are duplicated or highly approximated.
- the number of bits in which the corresponding bits of the two codewords are different is called the Hamming distance of the two codewords. In a valid coding set, the minimum Hamming distance of any two codewords is called the code.
- the Hamming distance of the code set For example, the codeword 10101 and the codeword 00110 have the first digit, the fourth digit, and the fifth digit from the first digit, and the Hamming distance is 3.
- the text address processing apparatus may process the latitude and longitude features of each of the two original text addresses by using a latitude and longitude distance algorithm to obtain the similarity of each of the two original text addresses in the latitude and longitude feature dimension.
- the text processing apparatus may calculate the distance between the address information described by the two original text addresses according to the latitude and longitude features of the two original text addresses, and determine the similarity of the two original text addresses in the latitude and longitude feature dimension according to the distance.
- the original text address described by some users may be accurate to the point on the map.
- the original text address described by some users may only be accurate to the line on the map, and even the original text address described by some users may be accurate.
- the original text address is mapped to the latitude and longitude. Since all the text addresses can be mapped to the latitude and longitude, and the granularity of the latitude and longitude is relatively fine, this is equivalent to unifying the normalization process to a relatively fine granularity. Conducive to improving the accuracy of the normalized results.
- the text address processing apparatus may process the alphanumeric features of each of the two original text addresses using a Jaccard coefficient algorithm to obtain each of the two original text addresses in an alphanumeric feature dimension. Similarity.
- the Jaccard coefficient is primarily used to compare the probability of similarity and dispersion in a sample set.
- the alphanumeric feature of the original text address is used as a sample set, and the letters and/or numbers in the alphanumeric feature are used as elements in the sample set.
- the text address processing device may specifically according to each of the two original text addresses. Whether the similarity in the standard segment feature dimension, the similarity in the latitude and longitude feature, and the similarity in the alphanumeric feature dimension determine whether the two original text addresses can be normalized into one of the text addresses.
- the similarity of two original text addresses in each dimension can be respectively compared with the corresponding threshold. The values are compared. If the similarity of the two original text addresses in each dimension is greater than the corresponding threshold, it is determined that the two original text addresses can be normalized into one; otherwise, for other cases, the two original texts are determined. Addresses cannot be normalized to one of them.
- the similarity between the two original text addresses in a certain dimension may be preferentially compared with the corresponding threshold. If the similarity between the two original text addresses in the dimension is greater than the corresponding threshold, then two directly determined. The original text address can be normalized to one of them.
- a weight may be configured for the similarity in each dimension in advance, and the similarity and the corresponding weight of each of the two original text addresses in each dimension may be numerically processed to obtain a processing result, and the processing is performed. The result is compared to a preset threshold. If greater than the threshold, it is determined that the two original text addresses can be normalized into one; otherwise, it is determined that the two original text addresses cannot be normalized into one of them.
- a user may have a social relationship with a plurality of users in the business system at the same time, thereby appearing in a social relationship circle of a plurality of users, which means that the user has used the original Text addresses may appear in different sets of addresses.
- the normalization between the address sets can be further performed in order to obtain a more accurate and streamlined normalized result.
- the text address processing apparatus can record the correspondence between the target text address and the original text address, and the correspondence can reflect which original text addresses are actually the target text addresses. Normalized.
- the text address processing apparatus may further determine, according to the correspondence between the target text address formed in the normalization process and the original text address, corresponding to the same original text address. At least two target text addresses; wherein at least two target text addresses corresponding to the same original text address respectively correspond to different address sets; thereafter, normalizing at least two target text addresses corresponding to the same original text address deal with.
- the text address processing apparatus may acquire standard address fragments included in the original text addresses corresponding to the at least two target text addresses corresponding to the same original text address; further, obtain corresponding to the same original text address. An intersection of the fragment of the standard address fragment included in the original text address corresponding to each of the at least two target text addresses, the intersection of the fragments including the same original text The standard address fragment included in the original text address corresponding to each of the at least two target text addresses of the address; and then, according to the intersection of the segments, the at least two target text addresses corresponding to the same original text address are normalized.
- a specific normalization process includes: the text address processing device determines whether the intersection of the segments can represent one of at least two target text addresses corresponding to the same original text address, and if the determination result is yes, the segment intersection may be Characterizing one of the at least two target text addresses corresponding to the same original text address, then normalizing at least two target text addresses corresponding to the same original text address into a target text address identifiable by the intersection of the segments; If the result is no, that is, the intersection of the fragments cannot represent any one of the at least two target text addresses corresponding to the same original text address, no normalization processing is performed.
- the fragment set required to represent a target text address may be preset, and the intersection of the segments may be compared with a preset segment set, and if the intersection of the segments is consistent with the preset segment set, the intersection of the segments is determined.
- One of the at least two target text addresses corresponding to the same original text address may be characterized; otherwise, it is determined that the intersection of the segments cannot characterize any of the at least two target text addresses corresponding to the same original text address.
- intersection of the segments may represent one of the at least two target text addresses corresponding to the same original text address
- the intersection of the segments may be stored in the feature knowledge base corresponding to the segmentable target text address. In this way, you can use this feature knowledge base to normalize more original text addresses.
- the following takes the business system including the first user and the second user as an example. It is assumed that the social relationship circle of the first user includes user A, user B, and user C, and the social relationship circle of the second user includes: user D, user E, and user F. .
- the text address used by the first user and the text address used by the user in the social relationship circle constitute a first address set, assuming that the first address set includes text addresses X1, X2, and X3; the first user, user A, User B and user C have no fixed correspondence with text addresses X1, X2, and X3. It may be that a user has used a text address, or that multiple users have used the same text address, or it may be A user has used multiple text addresses.
- the text address used by the second user and The text address used by the user in the social relationship circle constitutes the second address set, assuming that the second address set includes text addresses X2, X4, and X5.
- the second user, user D, user E, and user F have no fixed correspondence with the text addresses X2, X4, and X5. It may be that one user has used one text address or multiple users. I have used the same text address, and it may be that a user has used multiple text addresses.
- the first address set includes text addresses X1, X2, and X3; and determining a social relationship circle of the second user, obtaining the second address set As shown in FIG. 2, the second address set includes text addresses X2, X4, and X5.
- the similarity calculation is performed on the two text addresses in the first address set, and the normalization processing is completed according to the similarity, wherein the text addresses X1 and X2 are normalized to one of X1 and X2, and the normalization is X1.
- the text address X3 is normalized to the text address X3, that is, the two target text addresses corresponding to the first address set are the text addresses X1 and X3, respectively, as shown in FIG. 2; similarly, for the second address set.
- the similarity calculation is performed on the text addresses, and the normalization processing is completed according to the similarity, wherein the text addresses X2 and X4 are normalized to one of X2 and X4, and the normalization is X4; the text address X5 is normalized to The text address X5, that is, the two target text addresses corresponding to the second address set are text addresses X4 and X5, respectively, as shown in FIG.
- the two target text addresses can be normalized.
- the two target text addresses are further normalized to one of X1 and X4, assuming normalization to text address X1, as shown in FIG.
- the original text addresses X1, X2, X3, X4, and X5 are normalized to text addresses X1, X3, and X5.
- the user's social relationship circle is used to divide the normalized original text address.
- the scope of the original text address to be normalized is limited to each user's social relationship circle, which is equivalent to narrowing down the waiting for return.
- the scope of the original text address on the other hand, compared with the text address used by the user in the non-social relationship circle, the text address used by the user in the social relationship circle has some connection to some extent, which is equivalent to the text address. Normalization is locked between text addresses with certain links, making it easier to control the fault-tolerant boundary between text addresses, which is beneficial to improve the accuracy of the normalized results of text addresses.
- FIG. 3 is a schematic structural diagram of a text address processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 3, the apparatus includes a determination module 31 and a normalization module 32.
- the determining module 31 is configured to determine, according to a social relationship circle of the user in the service system, at least one address set, where each address set in the at least one address set includes: at least two original text addresses.
- the normalization module 32 is configured to perform normalization processing on the original text address in the address set for each address set determined by the determining module 31 to obtain a target text address corresponding to the address set.
- target text address corresponding to the address set may be one or more.
- the determining module 31 is specifically configured to:
- the text address used by each user and the text address used by the user in each user's social relationship circle are obtained to form a set of addresses.
- the normalization module 32 is specifically configured to:
- the normalization module 32 further calculates the similarity of each of the two original text addresses according to the characteristics of each two original text addresses in the address set, and further specifically for:
- the similarity of the two original text addresses corresponding to the feature is calculated.
- the normalization module 32 further calculates, for each feature of the at least one feature, according to the feature, when the two original text addresses correspond to the similarity of the feature, further specifically for:
- the standard segment feature of each of the two original text addresses is processed by using a SimHash algorithm to obtain a similarity between the two original text addresses in a standard segment feature dimension;
- the latitude and longitude feature of each of the two original text addresses is processed by using a latitude and longitude distance algorithm to obtain a similarity between the two original text addresses in the latitude and longitude feature dimension;
- the alphanumeric feature of each of the two original text addresses is processed using a Jaccard coefficient algorithm to obtain a similarity of the two original text addresses in an alphanumeric feature dimension.
- the determining module 31 is further configured to: after the normalization module 32 obtains the target text address corresponding to each address set, according to the target text address and the original text formed during the normalization process Corresponding relationship of addresses, determining at least two target text addresses corresponding to the same original text address;
- the normalization module 32 is further configured to perform normalization processing on the at least two target text addresses corresponding to the same original text address.
- the normalization module 32 is specifically configured to: when normalizing the at least two target text addresses corresponding to the same original text address:
- the text address processing apparatus of this embodiment may further include: a feature knowledge base, configured to represent one of the at least two target text addresses corresponding to the same original text address at the intersection of the segments And storing the segment intersection and the characterization target text address correspondingly.
- the text address processing apparatus provided in this embodiment determines at least one address set according to the social relationship circle of the user in the service system, and then normalizes the original text address in each address set by using the address set as a unit. The target text address corresponding to each address set is obtained, and the normalization of the text address is implemented. Since the text address processing apparatus provided in this embodiment divides the normalized original text address by the social relationship circle of the user, on the one hand, the range of the original text address to be normalized is limited to the social relationship circle of the user. , which is equivalent to narrowing down the range of the original text address to be normalized.
- the text address used by the user in the social relationship circle has some connection to some extent. It is equivalent to locking the normalization of the text address between text addresses with certain links, which makes it easier to control the fault-tolerant boundary between the text addresses, which is beneficial to improve the accuracy of the normalized result of the text address.
- the disclosed system, apparatus, and method may be implemented in other manners.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit. It can be electrical, mechanical or other form.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
- the above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium.
- the software functional unit described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the methods described in various embodiments of the present application. Part of the steps.
- the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
Claims (17)
- 一种文本地址处理方法,其特征在于,包括:根据业务系统中用户的社会关系圈,确定至少一个地址集合,所述至少一个地址集合中的每个地址集合包括:至少两个原始文本地址;对于每个地址集合,对所述地址集合中的原始文本地址进行归一化处理,以获得所述地址集合对应的目标文本地址。
- 根据权利要求1所述的方法,其特征在于,所述根据业务系统中用户的社会关系圈,确定至少一个地址集合,包括:确定所述业务系统中每个用户的社会关系圈;获取每个用户使用的文本地址以及每个用户的社会关系圈中的用户使用的文本地址,以构成一个地址集合。
- 根据权利要求1所述的方法,其特征在于,所述对所述地址集合中的原始文本地址进行归一化处理,以获得所述地址集合对应的目标文本地址,包括:根据所述地址集合中的每两个原始文本地址的特征,计算所述每两个原始文本地址的相似度;根据所述每两个原始文本地址的相似度,确定所述每两个原始文本地址是否可以归一化成所述每两个原始文本地址中的一个,以获得所述地址集合对应的目标文本地址。
- 根据权利要求3所述的方法,其特征在于,所述根据所述地址集合中的每两个原始文本地址的特征,计算所述每两个原始文本地址的相似度,包括:对所述地址集合中每两个原始文本地址的进行特征提取,以获得所述每两个原始文本地址的标准片段特征、经纬度特征以及字母数字特征中的至少一个特征;对于所述至少一个特征中的每个特征,根据所述特征,计算所述每两个原始文本地址对应于所述特征的相似度。
- 根据权利要求4所述的方法,其特征在于,所述根据所述特征,计算所述 每两个原始文本地址对应于所述特征的相似度,包括:若所述特征为标准片段特征,则采用SimHash算法对所述每两个原始文本地址的标准片段特征进行处理,获得所述每两个原始文本地址在标准片段特征维度上的相似度;若所述特征为经纬度特征,则采用经纬度距离算法对所述每两个原始文本地址的经纬度特征进行处理,获得所述每两个原始文本地址在经纬度特征维度上的相似度;若所述特征为字母数字特征,则采用杰卡德系数算法对所述每两个原始文本地址的字母数字特征进行处理,获得所述每两个原始文本地址在字母数字特征维度上的相似度。
- 根据权利要求1-5任一项所述的方法,其特征在于,在获得每个地址集合对应的目标文本地址之后,还包括:根据归一化处理过程中形成的目标文本地址与原始文本地址的对应关系,确定对应于同一原始文本地址的至少两个目标文本地址;对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理。
- 根据权利要求6所述的方法,其特征在于,所述对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理,包括:获取所述对应于同一原始文本地址的至少两个目标文本地址各自对应的原始文本地址所包含的标准地址片段的片段交集;根据所述片段交集,对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理。
- 根据权利要求7所述的方法,其特征在于,所述根据所述片段交集,对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理,包括:若所述片段交集可以表征所述对应于同一原始文本地址的至少两个目标文本地址中的一个,则将所述对应于同一原始文本地址的至少两个目标文本地址归一化成所述片段交集可表征的目标文本地址。
- 根据权利要求8所述的方法,其特征在于,还包括:若所述片段交集可以表征所述对应于同一原始文本地址的至少两个目标文本 地址中的一个,将所述片段交集与所述片段交集可表征的目标文本地址对应存储到特征知识库中。
- 一种文本地址处理装置,其特征在于,包括:确定模块,用于根据业务系统中用户的社会关系圈,确定至少一个地址集合,所述至少一个地址集合中的每个地址集合包括:至少两个原始文本地址;归一化模块,用于对于每个地址集合,对所述地址集合中的原始文本地址进行归一化处理,以获得所述地址集合对应的目标文本地址。
- 根据权利要求10所述的装置,其特征在于,所述确定模块具体用于:确定所述业务系统中每个用户的社会关系圈;获取每个用户使用的文本地址以及每个用户的社会关系圈中的用户使用的文本地址,以构成一个地址集合。
- 根据权利要求10所述的装置,其特征在于,所述归一化模块具体用于:根据所述地址集合中的每两个原始文本地址的特征,计算所述每两个原始文本地址的相似度;根据所述每两个原始文本地址的相似度,确定所述每两个原始文本地址是否可以归一化成所述每两个原始文本地址中的一个,以获得所述地址集合对应的目标文本地址。
- 根据权利要求12所述的装置,其特征在于,所述归一化模块进一步具体用于:对所述地址集合中每两个原始文本地址的进行特征提取,以获得所述每两个原始文本地址的标准片段特征、经纬度特征以及字母数字特征中的至少一个特征;对于所述至少一个特征中的每个特征,根据所述特征,计算所述每两个原始文本地址对应于所述特征的相似度。
- 根据权利要求13所述的装置,其特征在于,所述归一化模块进一步具体用于:若所述特征为标准片段特征,则采用SimHash算法对所述每两个原始文本地址的标准片段特征进行处理,获得所述每两个原始文本地址在标准片段特征维度上的相似度;若所述特征为经纬度特征,则采用经纬度距离算法对所述每两个原始文本地址的经纬度特征进行处理,获得所述每两个原始文本地址在经纬度特征维度上的相似度;若所述特征为字母数字特征,则采用杰卡德系数算法对所述每两个原始文本地址的字母数字特征进行处理,获得所述每两个原始文本地址在字母数字特征维度上的相似度。
- 根据权利要求10-14任一项所述的装置,其特征在于,所述确定模块还用于:在所述归一化模块获得每个地址集合对应的目标文本地址之后,根据归一化处理过程中形成的目标文本地址与原始文本地址的对应关系,确定对应于同一原始文本地址的至少两个目标文本地址;所述归一化模块还用于:对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理。
- 根据权利要求15所述的装置,其特征在于,所述归一化模块具体用于:获取所述对应于同一原始文本地址的至少两个目标文本地址各自对应的原始文本地址所包含的标准地址片段的片段交集;根据所述片段交集,对所述对应于同一原始文本地址的至少两个目标文本地址进行归一化处理。
- 根据权利要求16所述的装置,其特征在于,还包括:特征知识库,用于在所述片段交集可以表征所述对应于同一原始文本地址的至少两个目标文本地址中的一个时,对应存储所述片段交集与所述片段交集可表征的目标文本地址。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020177025509A KR102079860B1 (ko) | 2015-02-13 | 2016-02-04 | 텍스트 주소 처리 방법 및 장치 |
SG11201706625YA SG11201706625YA (en) | 2015-02-13 | 2016-02-04 | Text address processing method and apparatus |
EP16748705.7A EP3258397A4 (en) | 2015-02-13 | 2016-02-04 | Text address processing method and apparatus |
JP2017542458A JP6594988B2 (ja) | 2015-02-13 | 2016-02-04 | 住所テキストを処理する方法及び機器 |
US15/675,177 US10795964B2 (en) | 2015-02-13 | 2017-08-11 | Text address processing method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510079914.6A CN105988988A (zh) | 2015-02-13 | 2015-02-13 | 文本地址处理方法及装置 |
CN201510079914.6 | 2015-02-13 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/675,177 Continuation US10795964B2 (en) | 2015-02-13 | 2017-08-11 | Text address processing method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016127904A1 true WO2016127904A1 (zh) | 2016-08-18 |
Family
ID=56615030
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2016/073441 WO2016127904A1 (zh) | 2015-02-13 | 2016-02-04 | 文本地址处理方法及装置 |
Country Status (7)
Country | Link |
---|---|
US (1) | US10795964B2 (zh) |
EP (1) | EP3258397A4 (zh) |
JP (1) | JP6594988B2 (zh) |
KR (1) | KR102079860B1 (zh) |
CN (1) | CN105988988A (zh) |
SG (2) | SG10201907254XA (zh) |
WO (1) | WO2016127904A1 (zh) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109635063A (zh) * | 2018-12-06 | 2019-04-16 | 拉扎斯网络科技(上海)有限公司 | 地址库的信息处理方法、装置、电子设备和存储介质 |
CN111274811A (zh) * | 2018-11-19 | 2020-06-12 | 阿里巴巴集团控股有限公司 | 地址文本相似度确定方法以及地址搜索方法 |
CN111435360A (zh) * | 2019-01-15 | 2020-07-21 | 菜鸟智能物流控股有限公司 | 地址类型识别方法和装置以及电子设备 |
CN111522901A (zh) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | 文本中地址信息的处理方法及装置 |
US10795964B2 (en) | 2015-02-13 | 2020-10-06 | Alibaba Group Holding Limited | Text address processing method and apparatus |
CN116402050A (zh) * | 2022-12-26 | 2023-07-07 | 北京码牛科技股份有限公司 | 一种地址归一化及补充方法、装置、电子设备及存储介质 |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108460046A (zh) * | 2017-02-21 | 2018-08-28 | 菜鸟智能物流控股有限公司 | 一种地址聚合的方法以及设备 |
CN108804398A (zh) * | 2017-05-03 | 2018-11-13 | 阿里巴巴集团控股有限公司 | 地址文本的相似度计算方法及装置 |
CN113591453A (zh) * | 2018-04-10 | 2021-11-02 | 百融云创科技股份有限公司 | 人为填写的地址文本相似度处理方法和系统 |
CN110417841B (zh) * | 2018-04-28 | 2022-01-18 | 阿里巴巴集团控股有限公司 | 地址归一化处理方法、装置和系统、数据处理方法 |
CN108876440B (zh) * | 2018-05-29 | 2021-09-03 | 创新先进技术有限公司 | 区域划分方法和服务器 |
CN109033225A (zh) * | 2018-06-29 | 2018-12-18 | 福州大学 | 中文地址识别系统 |
CN109388634B (zh) * | 2018-09-18 | 2024-05-03 | 平安科技(深圳)有限公司 | 地址信息的处理方法、终端设备及计算机可读存储介质 |
CN111488334B (zh) * | 2019-01-29 | 2023-04-14 | 阿里巴巴集团控股有限公司 | 数据处理方法及电子设备 |
CN111723164B (zh) * | 2019-03-18 | 2023-12-12 | 阿里巴巴集团控股有限公司 | 地址信息的处理方法和装置 |
CN110598791A (zh) * | 2019-09-12 | 2019-12-20 | 深圳前海微众银行股份有限公司 | 地址相似度评价方法、装置、设备及介质 |
CN110851669A (zh) * | 2019-10-17 | 2020-02-28 | 清华大学 | 基于地理位置信息的机构命名排歧方法及装置 |
US11159458B1 (en) | 2020-06-10 | 2021-10-26 | Capital One Services, Llc | Systems and methods for combining and summarizing emoji responses to generate a text reaction from the emoji responses |
CN112711950A (zh) * | 2020-12-23 | 2021-04-27 | 深圳壹账通智能科技有限公司 | 地址信息抽取方法、装置、设备及存储介质 |
CN115225609A (zh) * | 2021-04-20 | 2022-10-21 | 大金(中国)投资有限公司 | 用户数据处理方法及装置、服务器 |
CN114048797A (zh) * | 2021-10-20 | 2022-02-15 | 盐城金堤科技有限公司 | 确定地址相似度的方法、装置、介质及电子设备 |
CN115952779B (zh) * | 2023-03-13 | 2023-09-29 | 中规院(北京)规划设计有限公司 | 一种位置名称校准方法、装置、计算机设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996247A (zh) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | 地址数据库的建构方法及装置 |
CN102024024A (zh) * | 2010-11-10 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | 地址数据库的建构方法及装置 |
CN102955832A (zh) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | 一种通讯地址识别、标准化的系统 |
CN103473289A (zh) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | 一种通信地址补全的装置及方法 |
US20140108442A1 (en) * | 2012-10-16 | 2014-04-17 | Google Inc. | Person-based information aggregation |
Family Cites Families (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003067596A (ja) | 2001-08-30 | 2003-03-07 | Fujitsu Ltd | 売り手買い手の場所マッチング装置 |
JP3803961B2 (ja) * | 2001-12-05 | 2006-08-02 | 日本電信電話株式会社 | データベース生成装置、データベース生成処理方法及びデータベース生成プログラム |
US7885901B2 (en) * | 2004-01-29 | 2011-02-08 | Yahoo! Inc. | Method and system for seeding online social network contacts |
US7743048B2 (en) * | 2004-10-29 | 2010-06-22 | Microsoft Corporation | System and method for providing a geographic search function |
JP4687089B2 (ja) * | 2004-12-08 | 2011-05-25 | 日本電気株式会社 | 重複レコード検出システム、および重複レコード検出プログラム |
US20140230030A1 (en) * | 2006-11-22 | 2014-08-14 | Raj Abhyanker | Method and apparatus for geo-spatial and social relationship analysis |
US8050690B2 (en) | 2007-08-14 | 2011-11-01 | Mpanion, Inc. | Location based presence and privacy management |
US20090319515A1 (en) * | 2008-06-02 | 2009-12-24 | Steven Minton | System and method for managing entity knowledgebases |
US20120317217A1 (en) * | 2009-06-22 | 2012-12-13 | United Parents Online Ltd. | Methods and systems for managing virtual identities |
US20120051657A1 (en) * | 2010-08-30 | 2012-03-01 | Microsoft Corporation | Containment coefficient for identifying textual subsets |
KR101556714B1 (ko) * | 2011-01-03 | 2015-10-02 | 네이버 주식회사 | 검색결과 제공 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체 |
US20120215853A1 (en) * | 2011-02-17 | 2012-08-23 | Microsoft Corporation | Managing Unwanted Communications Using Template Generation And Fingerprint Comparison Features |
KR20120124581A (ko) | 2011-05-04 | 2012-11-14 | 엔에이치엔(주) | 개선된 유사 문서 탐지 방법, 장치 및 컴퓨터 판독 가능한 기록 매체 |
US8676937B2 (en) * | 2011-05-12 | 2014-03-18 | Jeffrey Alan Rapaport | Social-topical adaptive networking (STAN) system allowing for group based contextual transaction offers and acceptances and hot topic watchdogging |
US8515964B2 (en) * | 2011-07-25 | 2013-08-20 | Yahoo! Inc. | Method and system for fast similarity computation in high dimensional space |
JP5866176B2 (ja) * | 2011-10-31 | 2016-02-17 | 日本郵便株式会社 | 住所録管理システム、住所録管理方法及び住所録管理プログラム |
JP5676517B2 (ja) | 2012-04-12 | 2015-02-25 | 日本電信電話株式会社 | 文字列類似度計算装置、方法、及びプログラム |
CN103425648B (zh) * | 2012-05-15 | 2016-04-13 | 腾讯科技(深圳)有限公司 | 关系圈的处理方法和系统 |
CN103428164B (zh) * | 2012-05-15 | 2015-07-01 | 腾讯科技(深圳)有限公司 | 用户社交网络关系圈划分方法和系统 |
CN102682128B (zh) * | 2012-05-17 | 2017-08-29 | 厦门雅迅网络股份有限公司 | 一种用于兴趣点信息的排重方法 |
US20140214895A1 (en) * | 2013-01-31 | 2014-07-31 | Inplore | Systems and method for the privacy-maintaining strategic integration of public and multi-user personal electronic data and history |
CN105320657A (zh) * | 2014-05-30 | 2016-02-10 | 中国电信股份有限公司 | 兴趣点数据融合方法和系统 |
CN104660581A (zh) * | 2014-11-28 | 2015-05-27 | 华为技术有限公司 | 一种为业务策略确定目标用户的方法、装置及系统 |
CN105988988A (zh) | 2015-02-13 | 2016-10-05 | 阿里巴巴集团控股有限公司 | 文本地址处理方法及装置 |
US10769426B2 (en) * | 2015-09-30 | 2020-09-08 | Microsoft Technology Licensing, Llc | Inferring attributes of organizations using member graph |
-
2015
- 2015-02-13 CN CN201510079914.6A patent/CN105988988A/zh active Pending
-
2016
- 2016-02-04 EP EP16748705.7A patent/EP3258397A4/en not_active Withdrawn
- 2016-02-04 JP JP2017542458A patent/JP6594988B2/ja active Active
- 2016-02-04 WO PCT/CN2016/073441 patent/WO2016127904A1/zh active Application Filing
- 2016-02-04 SG SG10201907254XA patent/SG10201907254XA/en unknown
- 2016-02-04 KR KR1020177025509A patent/KR102079860B1/ko active IP Right Grant
- 2016-02-04 SG SG11201706625YA patent/SG11201706625YA/en unknown
-
2017
- 2017-08-11 US US15/675,177 patent/US10795964B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996247A (zh) * | 2010-11-10 | 2011-03-30 | 百度在线网络技术(北京)有限公司 | 地址数据库的建构方法及装置 |
CN102024024A (zh) * | 2010-11-10 | 2011-04-20 | 百度在线网络技术(北京)有限公司 | 地址数据库的建构方法及装置 |
CN102955832A (zh) * | 2011-08-31 | 2013-03-06 | 深圳市华傲数据技术有限公司 | 一种通讯地址识别、标准化的系统 |
US20140108442A1 (en) * | 2012-10-16 | 2014-04-17 | Google Inc. | Person-based information aggregation |
CN103473289A (zh) * | 2013-08-30 | 2013-12-25 | 深圳市华傲数据技术有限公司 | 一种通信地址补全的装置及方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3258397A4 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10795964B2 (en) | 2015-02-13 | 2020-10-06 | Alibaba Group Holding Limited | Text address processing method and apparatus |
CN111274811A (zh) * | 2018-11-19 | 2020-06-12 | 阿里巴巴集团控股有限公司 | 地址文本相似度确定方法以及地址搜索方法 |
CN111274811B (zh) * | 2018-11-19 | 2023-04-18 | 阿里巴巴集团控股有限公司 | 地址文本相似度确定方法以及地址搜索方法 |
CN109635063A (zh) * | 2018-12-06 | 2019-04-16 | 拉扎斯网络科技(上海)有限公司 | 地址库的信息处理方法、装置、电子设备和存储介质 |
CN111435360A (zh) * | 2019-01-15 | 2020-07-21 | 菜鸟智能物流控股有限公司 | 地址类型识别方法和装置以及电子设备 |
CN111435360B (zh) * | 2019-01-15 | 2023-08-29 | 菜鸟智能物流控股有限公司 | 地址类型识别方法和装置以及电子设备 |
CN111522901A (zh) * | 2020-03-18 | 2020-08-11 | 大箴(杭州)科技有限公司 | 文本中地址信息的处理方法及装置 |
CN111522901B (zh) * | 2020-03-18 | 2023-10-20 | 大箴(杭州)科技有限公司 | 文本中地址信息的处理方法及装置 |
CN116402050A (zh) * | 2022-12-26 | 2023-07-07 | 北京码牛科技股份有限公司 | 一种地址归一化及补充方法、装置、电子设备及存储介质 |
CN116402050B (zh) * | 2022-12-26 | 2023-11-10 | 北京码牛科技股份有限公司 | 一种地址归一化及补充方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
KR102079860B1 (ko) | 2020-02-20 |
KR20170117481A (ko) | 2017-10-23 |
US10795964B2 (en) | 2020-10-06 |
SG10201907254XA (en) | 2019-09-27 |
SG11201706625YA (en) | 2017-09-28 |
CN105988988A (zh) | 2016-10-05 |
US20170337292A1 (en) | 2017-11-23 |
JP2018510410A (ja) | 2018-04-12 |
EP3258397A1 (en) | 2017-12-20 |
EP3258397A4 (en) | 2017-12-20 |
JP6594988B2 (ja) | 2019-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2016127904A1 (zh) | 文本地址处理方法及装置 | |
WO2017215370A1 (zh) | 构建决策模型的方法、装置、计算机设备及存储设备 | |
US20170286534A1 (en) | User location profile for personalized search experience | |
US10311288B1 (en) | Determining identity of a person in a digital image | |
TWI703862B (zh) | 內容推薦方法及裝置 | |
WO2019024496A1 (zh) | 企业推荐方法及应用服务器 | |
CN105190595A (zh) | 唯一地识别网络连接实体 | |
WO2017059717A1 (zh) | 一种社交网络中用户信息的识别方法和系统 | |
US11163726B2 (en) | Context aware delta algorithm for genomic files | |
US8725762B2 (en) | Preventing leakage of information over a network | |
US11368901B2 (en) | Method for identifying a type of a wireless hotspot and a network device thereof | |
WO2016101811A1 (zh) | 一种信息排序方法及装置 | |
WO2020257993A1 (zh) | 内容推送方法、装置、服务端及存储介质 | |
JP2019530046A (ja) | コンピュータシステムからのユーザ情報の収集 | |
CN111160847A (zh) | 一种处理流程信息的方法和装置 | |
CN115145587A (zh) | 一种产品参数校验方法、装置、电子设备及存储介质 | |
CN110599278B (zh) | 聚合设备标识符的方法、装置和计算机存储介质 | |
CN111767481B (zh) | 访问处理方法、装置、设备和存储介质 | |
KR101798377B1 (ko) | 개인정보의 비식별화 방법 및 장치 | |
US9449110B2 (en) | Geotiles for finding relevant results from a geographically distributed set | |
CN110574018A (zh) | 基于通信交换来管理异步分析操作 | |
US20180046656A1 (en) | Constructing filterable hierarchy based on multidimensional key | |
WO2020233093A1 (zh) | 关联图谱生成方法、装置、计算机设备和存储介质 | |
WO2016021039A1 (ja) | k-匿名化処理システム及びk-匿名化処理方法 | |
CN113051293A (zh) | 基于树形结构的资源查询方法、装置和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16748705 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017542458 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20177025509 Country of ref document: KR Kind code of ref document: A |
|
REEP | Request for entry into the european phase |
Ref document number: 2016748705 Country of ref document: EP |