WO2017059797A1 - 地址的分析方法及装置 - Google Patents

地址的分析方法及装置 Download PDF

Info

Publication number
WO2017059797A1
WO2017059797A1 PCT/CN2016/101447 CN2016101447W WO2017059797A1 WO 2017059797 A1 WO2017059797 A1 WO 2017059797A1 CN 2016101447 W CN2016101447 W CN 2016101447W WO 2017059797 A1 WO2017059797 A1 WO 2017059797A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
morphemes
relevance
value
morpheme
Prior art date
Application number
PCT/CN2016/101447
Other languages
English (en)
French (fr)
Inventor
陆青
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to SG11201802898PA priority Critical patent/SG11201802898PA/en
Priority to EP16853101.0A priority patent/EP3361392A4/en
Publication of WO2017059797A1 publication Critical patent/WO2017059797A1/zh
Priority to US15/944,569 priority patent/US11113474B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present application relates to the field of computer technologies, and in particular, to an address analysis method and apparatus.
  • the first method described above has a large dependence on the position of a word in an English address.
  • the English address has a reversed or repeated order, the editing distance is enlarged, which affects the accuracy of the address analysis;
  • the second method although reducing the order of words and the effect of repetition, has a dependency on the first letter of a word, while for an English address that is not a native English speaker, there are several kinds of initial spelling of some words. Different usages (such as Hlinka and Glinka in Russian, ⁇ finest, jans common English), in this case, will lead to an enlargement of the editing distance, that is, the applicability of the second method is poor.
  • the embodiment of the present application provides an address analysis method and device, which can improve the accuracy and applicability of address analysis.
  • an address analysis method comprising:
  • the first address comprising n morphemes
  • the second address comprising m morphemes, wherein the morpheme refers to a smallest semantic unit of the addresses, n and m are Natural number;
  • an address analysis apparatus comprising: an input unit, a determining unit, an obtaining unit, and an analyzing unit;
  • the input unit is configured to input a first address and a second address, the first address includes n morphemes, and the second address includes m morphemes, where the morpheme refers to a minimum semantic of the addresses Units, n and m are natural numbers;
  • the determining unit is configured to determine a first relevance value of the n morphemes and the m morphemes;
  • the obtaining unit is configured to obtain a second relevance value of the first address and the second address according to the first association value determined by the determining unit and a preset algorithm;
  • the analyzing unit is configured to analyze the association between the first address and the second address according to the second relevance value obtained by the acquiring unit.
  • the method and device for analyzing an address provided by the present application, determining a first relevance value of n morphemes in the first address and m morphemes in the second address; obtaining the location according to the first relevance value and a preset algorithm a second association value of the first address and the second address; and analyzing the association between the first address and the second address according to the second association value. That is, in the application, first determining the first relevance value between the morphemes in the two addresses, and then obtaining the second relevance value according to the first correlation value between the morphemes and the preset algorithm, according to the second relevance degree.
  • the value is analyzed by the address, thereby avoiding the prior art that two addresses are used as two strings, and then the association of the addresses is analyzed by calculating the edit distance between the two strings, resulting in a single pair
  • the problem of large word dependence can improve the accuracy and applicability of the analysis of the address.
  • FIG. 2 is a schematic diagram of an apparatus for analyzing an address according to another embodiment of the present application.
  • the method and device for analyzing an address provided by the embodiment of the present application are applicable to a scenario for analyzing an association between addresses, where the relevance may include differences and similarities, for example, e-commerce for physical transactions.
  • the correlation between addresses in the transaction is analyzed.
  • FIG. 1 is a flowchart of a method for analyzing an address provided by an embodiment of the present application.
  • the executor of the method may be a device having a processing capability: a server or a system or a device. As shown in FIG. 1 , the method may specifically include:
  • Step 110 Enter a first address and a second address, the first address includes n morphemes, and the second address includes m morphemes, where the morpheme refers to a smallest semantic unit of the addresses, n and m is a natural number.
  • the definitions of the first address and the second address are the same.
  • the first address may include a Chinese address or an English address.
  • the Chinese address must be standardized first.
  • the traditional Chinese character is converted into a simplified Chinese character.
  • n words can be obtained, and the n words are used as the n words.
  • n morphemes for English addresses, the n words contained in the first address can be directly used as n morphemes.
  • first address or the second address is a Chinese address
  • one morpheme can be one word
  • first address or the second address is an English address
  • one morpheme can be one word
  • Step 120 Determine a first relevance value of the n morphemes and the m morphemes.
  • the first relevance value includes a difference value and a similarity value, wherein the difference value may include: an edit distance value; and the similarity value may include: a hamming distance value, Jaccard Distance value, N-Gram distance value, Jaro–Winkler distance value, or cosine distance value.
  • the first correlation value is an edit distance value.
  • the edit distance value is for two morphemes, that is, how many times a morpheme is edited (including: delete transform, insert transform, and replace Transforms, etc.
  • morpheme can be changed to another morpheme, which can be calculated according to the classical edit distance algorithm, or it can be based on the adjusted edit distance algorithm (eg, increasing the degree of difference in the replacement transform or reducing the difference in vowel missing). )computational. For example, suppose one morpheme is cafe and the other morpheme is coffee. The process from coffee to coffee is: cafe ⁇ caffe ⁇ coffe ⁇ coffee, which requires three edits, so the edit distance between cafe and coffee is 3.
  • first address or the second address is a Chinese address
  • first address or the second address is a Chinese address
  • second address is a Chinese address
  • the morphemes included in the first address or the second address can also be processed as follows:
  • each morpheme (ie, word) in the first address may be converted into a pinyin or a stroke, that is, the processed first address may include n sets of pinyin or n sets of strokes, and then determined.
  • the first address includes a first association value of the n sets of pinyin and the m sets of pinyin included in the second address, or a first association value of the m sets of strokes included in the first address and the m sets of strokes included in the second address, And determining the method is similar to determining that the first word included in the first address and the first value of the m words included in the second address are similar to each other.
  • the step 120 may specifically include:
  • Step 120 can also be described as the following steps:
  • the first address and the second address are English addresses, that is, the first address contains n words, the second address contains m words, and if n is 3, the first words contain three words: X, Y And Z, and assuming m is 4, the second address contains 4 words: A, B, C, and D, respectively, for X, respectively determine its first association with the four morphemes A, B, C, and D Degree value; for Y, determine the first relevance value of the four morphemes A, B, C, and D respectively; for Z, determine the first relevance value of the four morphemes A, B, C, and D, respectively Finally, 3 ⁇ 4 first relevance values are obtained. When the first correlation value of X and A can be expressed as d(A, X), the obtained 3 ⁇ 4 first relevance values may be as shown in Table 1.
  • Step 130 Obtain a second relevance value of the first address and the second address according to the first association value and a preset algorithm.
  • the preset algorithm may include a Hungarian algorithm or an exhaustive method or the like.
  • Hungarian algorithm by default The law is an example of the Hungarian algorithm.
  • Step 130 may specifically include:
  • Step A selecting, according to the n ⁇ m first relevance values and a preset algorithm, a second morpheme that is the most matching of the first morphemes among the n morphemes, and recording the a target affinity value of the first morpheme and the second morpheme; until n target relevance values are recorded.
  • the first morpheme is any morpheme in the first address. It can be understood that when the first address contains n morphemes, the number of the first morphemes is n, and the second morpheme matches the second. The number of morphemes is n, so the number of target relevance values recorded is also n.
  • the step of selecting, according to the n ⁇ m first relevance values and the preset algorithm, the second morphemes that are the most matching of the first morphemes of the n morphemes from the m morphemes further includes:
  • a second morpheme that best matches the first morpheme of the n morphemes is selected from the m morphemes.
  • the process of selecting the second morpheme that is the best match for the first morpheme is the optimal matching problem for solving the edited distance value, that is, ensuring the first address.
  • Each word in the word finds the corresponding word in the second address, and the overall difference value is the smallest.
  • the pre-processing may include subtracting the smallest element of the row from each row element in the matrix and/or subtracting the smallest element of the column from each column element, and the like.
  • the first relevance value of the i-th morpheme and the j-th morpheme is used as an element of the i-th row and the j-th column of the matrix, and the matrix constructed by the 3 ⁇ 4 first correlation values in the foregoing example is as follows Show:
  • each row of independent zero elements is also marked, that is, the 0 in the first row and the third column is marked, and the 0 in the second row and the first column is marked, and The 0 in column 3 and column 2 are marked to obtain the optimal matching combination between morphemes in the two addresses.
  • Table 1 can be updated to Table 2 based on the pre-processed matrix and the independent zero elements of the mark.
  • the underlined mark 0 is the independent zero element marked in the preprocessed matrix.
  • the first morpheme is X
  • the second morpheme that best matches the first morpheme is C, and records the target relevance value d(C, X) of X and C
  • the first morpheme when Y, the second morpheme that best matches the first morpheme is A, and records the target relevance value d(A, Y) of Y and A
  • the first morpheme is Z, it matches the first morpheme most.
  • the second morpheme is B, and the target correlation value d(B, Z) of Z and B is recorded, that is, three target relevance values can be recorded.
  • Step B Obtain a second relevance value of the first address and the second address according to the n target relevance values.
  • the n target relevance values may be summed, and the sum of the n target relevance values is used as the second relevance value of the first address and the second address.
  • Step 140 Analyze the association between the first address and the second address according to the second association value.
  • the second relevance value may also include a difference value and a similarity value.
  • the second association degree determined in step 130 is a difference degree value
  • the second correlation degree determined in step 120 is the similarity value
  • the similarity between the first address and the second address is Close to 1, the more similar the two, the closer the similarity value of the first address to the second address is to 0, the more dissimilar the two are.
  • the log of the similarity value may be converted into a difference value.
  • analysis results of the relevance of the present application can be used as a basis for cluster analysis, word frequency analysis, and address standardization.
  • the method for analyzing an address provided by the present application, determining a first relevance value of n morphemes in the first address and m morphemes in the second address; obtaining the first according to the first association value and a preset algorithm a second association value of the address and the second address; and analyzing the association between the first address and the second address according to the second association value.
  • an address analysis device provided by the embodiment of the present application, as shown in FIG. 2, includes: an input unit 201, a determining unit 202, an obtaining unit 203, and an analyzing unit 204.
  • the input unit 201 is configured to input a first address and a second address, where the first address includes n morphemes, and the second address includes m morphemes, where the morpheme refers to the smallest semantic unit of the addresses , n and m are both natural numbers.
  • the determining unit 202 is configured to determine a first relevance value of the n morphemes and the m morphemes.
  • the determining unit 202 is specifically configured to:
  • the first relevance value includes: an edit distance value, a Hamming distance value, a Jaccard distance value, an N-Gram distance value, a JW distance value, or a cosine distance value.
  • the obtaining unit 203 is configured to obtain a second relevance value of the first address and the second address according to the first association value determined by the determining unit 202 and a preset algorithm.
  • the obtaining unit 203 is specifically configured to:
  • the second morpheme matching the first morpheme of the n morphemes from the m morphemes according to the n ⁇ m first relevance values and a preset algorithm including:
  • a second morpheme that best matches the first morpheme of the n morphemes is selected from the m morphemes.
  • the analyzing unit 204 is configured to analyze the association between the first address and the second address according to the second relevance value obtained by the obtaining unit 203.
  • the input unit 201 inputs a first address and a second address, the first address includes n morphemes, and the second address includes m morphemes, where the morpheme refers to The smallest semantic unit of the addresses, n and m are both natural numbers; the determining unit 202 determines a first relevance value of the n morphemes and the m morphemes; and the obtaining unit 203 determines the first relevance degree according to the determined a value and a preset algorithm, obtaining a second relevance value of the first address and the second address; the analyzing unit 204, according to the second relevance value, the first address and the second address The relevance of the analysis. Thereby, the accuracy and applicability of the analysis of the address can be improved.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种地址的分析方法及装置,包括:确定第一地址中n个语素与第二地址中m个语素的第一关联度值(S120);根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值(S130);根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析(S140)。由此,可以提高地址的分析的准确性和适用性。

Description

地址的分析方法及装置
本申请要求2015年10月10日递交的申请号为201510652677.8发明名称为“地址的分析方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种地址的分析方法及装置。
背景技术
传统技术中,在对地址进行分析时,如,在对两个英文地址的关联性进行分析时,直接将两个英文地址作为两个长字符串,通过计算两个长字符串间的编辑距离来对两个英文地址的关联性进行分析;或者,先对两个英文地址进行预处理,以对一个英文地址进行预处理为例来说,可以将英文地址中的单词按照首字母进行排序,然后删除重复的单词;之后再将预处理后的两个英文地址作为两个长字符串,计算两个长字符串间的编辑距离,最后根据编辑距离来对两个英文地址的关联性进行分析。
然而,上述第一种方法对英文地址中单词的位置依赖性较大,一旦英文地址中有前后顺序颠倒或重复的情况发生,就会导致编辑距离的扩大,这影响了地址分析的准确性;而第二种方法,虽然减小了单词的顺序及重复的影响,但对单词的首字母有了依赖性,而对于非英语母语的英文地址而言,部分单词的首字母拼写确实有着若干种不同的用法(比如俄语中,Hlinka与Glinka均是Гли,нка的常见英文写法),这种情况下,会导致编辑距离更大化的扩大,也即第二种方法的适用性较差。
发明内容
本申请实施例提供了一种地址的分析方法及装置,可以提高地址的分析的准确性和适用性。
第一方面,提供了一种地址的分析方法,该方法包括:
输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数;
确定所述n个语素与所述m个语素的第一关联度值;
根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;
根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
第二方面,提供了一种地址的分析装置,该装置包括:输入单元、确定单元、获取单元和分析单元;
所述输入单元,用于输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数;
所述确定单元,用于确定所述n个语素与所述m个语素的第一关联度值;
所述获取单元,用于根据所述确定单元确定的所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;
所述分析单元,用于根据所述获取单元获得的所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
本申请提供的地址的分析方法及装置,确定第一地址中n个语素与第二地址中m个语素的第一关联度值;根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。即本申请中,首先确定两个地址中语素间的第一关联度值,然后根据语素间的第一关联度值和预设的算法获得地址见的第二关联度值,根据第二关联度值对地址进行分析,由此,可以避免现有技术中将两个地址作为两个字符串,然后通过计算两个字符串间的编辑距离来对地址的关联性进行分析时,导致的对单个词的依赖性较大的问题,从而可以提高地址的分析的准确性和适用性。
附图说明
图1为本申请一种实施例提供的地址的分析方法流程图;
图2为本申请另一种实施例提供的地址的分析装置示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为便于对本申请实施例的理解,下面将结合附图以具体实施例做进一步的解释说明,实施例并不构成对本申请实施例的限定。
本申请实施例提供的地址的分析方法及装置,适用于对地址间的关联性进行分析的场景,此处的关联性可以包括差异性和相似性,如,可以用于对实物交易的电子商务交易中的地址间的关联性进行分析。
需要说明的是,上述地址间的关联性的分析结果可以作为聚类分析、词频分析以及地址标准化等的基础。
图1为本申请一种实施例提供的地址的分析方法流程图。所述方法的执行主体可以为具有处理能力的设备:服务器或者系统或者装置,如图1所示,所述方法具体可以包括:
步骤110,输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数。
此处,第一地址和第二地址的定义相同,以第一地址为例来说,第一地址可以包括中文地址或者英文地址等。对于中文地址,则先要对中文地址进行标准化处理,如,将繁体字转换为简体字等,之后在对标准化处理后的中文地址进行分词处理,就可以得到n个词,将该n个词作为n个语素;对于英文地址,则直接可以将第一地址中包含的n个单词作为n个语素。
可以理解的是,当第一地址或者第二地址为中文地址时,则一个语素可以为一个词;而当第一地址或者第二地址为英文地址时,则一个语素可以为一个单词。
步骤120,确定所述n个语素与所述m个语素的第一关联度值。
此处,第一关联度值包括差异度值和相似度值,其中,差异度值可以包括:编辑距离值;而相似度值可以包括:汉明(hamming)距离值、杰卡德(Jaccard)距离值、N邻近字(N-Gram)距离值、Jaro–Winkler距离值或者余弦(cosine)距离值。在此说明书中,以第一关联度值为编辑距离值为例来说,编辑距离值是针对两个语素而言的,即一个语素经过多少次编辑变换(包括:删除变换、插入变换和替换变换等)可以变为另一个语素,其可以是根据经典的编辑距离算法计算的,也可以是根据调整过的编辑距离算法(如,增加替换变换的差异度或者减少元音缺失的差异度等)计算的。举例来说,假设一个语素为cafe,另一个语素为coffee,从cafe变为coffee的过程为:cafe→caffe→coffe→coffee,也即需要经过三次编辑变换,因此cafe与coffee的编辑距离值为3。
需要说明的是,在第一地址或者第二地址为中文地址时,则在执行步骤120之前, 还可以对第一地址或者第二地址中包含的语素做如下处理:
以第一地址为例来说,可以将第一地址中的每个语素(即,词)转换为拼音或者笔画,也即处理后的第一地址可以包含n组拼音或者n组笔画,然后确定第一地址包含的n组拼音与第二地址包含的m组拼音的第一关联度值,或者确定第一地址包含的n组笔画与第二地址包含的m组笔画的第一关联度值,且确定方法与确定第一地址包含的n个单词与第二地址包含的m个单词的第一关联度值类似,本申请对此不作赘述。
其中,步骤120具体可以包括:
对所述n个语素中的每个语素,确定所述每个语素与所述m个语素中各个语素的第一关联度值,以获得n×m个第一关联度值。
步骤120也可以描述为如下步骤:
1)按顺序取得n个语素中的第i个语素,其中,i=1,2,…,n;
2)按顺序取得m个语素中的第j个语素,其中,j=1,2,…,m;
3)计算第i个语素与第j个语素的第一关联度值;
4)遍历所有的n个语素以及所有的m个语素,获得n×m个第一关联度值。
以第一地址和第二地址为英文地址来说,即第一地址包含n个单词,第二地址包含m个单词,假设n为3,第一地址包含的3个单词分别为:X、Y和Z,且假设m为4,第二地址包含的4个单词分别为:A、B、C和D,则对于X,分别确定其与A、B、C和D四个语素的第一关联度值;对于Y,分别确定其与A、B、C和D四个语素的第一关联度值;对于Z,分别确定其与A、B、C和D四个语素的第一关联度值,最后获得3×4个第一关联度值。当X与A的第一关联度值可以表示为d(A,X)时,获得的3×4个第一关联度值可以如表1所示。
表1
  A B C D
X d(A,X) d(B,X) d(C,X) d(D,X)
Y d(A,Y) d(B,Y) d(C,Y) d(D,Y)
Z d(A,Z) d(B,Z) d(C,Z) d(D,Z)
步骤130,根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值。
此处,预设的算法可以包括匈牙利算法或者穷举法等。在此说明书中,以预设的算 法为匈牙利算法为例。
步骤130具体可以包括:
步骤A:根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,并记录所述第一语素与所述第二语素的目标关联度值;直至记录n个目标关联度值。
此处,第一语素是第一地址中的任一语素,可以理解的是,当第一地址中包含n个语素时,则第一语素的个数为n,与第一语素匹配的第二语素的个数为n,因此,记录的目标关联度值的个数也为n个。
其中,步骤A中根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素进一步包括:
根据所述n×m个第一关联度值,构建n×m的矩阵;
根据预设的算法,对所述n×m的矩阵进行预处理;
根据预处理后的n×m的矩阵,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素。
需要说明的是,当第一关联度值为编辑距离值时,则上述为第一语素选取最匹配的第二语素的过程即为求解编辑距离值的最优匹配问题,也即确保第一地址中的每个单词在第二地址中找到对应的单词,并且总体的差异度值最小。
在求解编辑距离值的最优匹配问题时,预处理可以包括将矩阵中每行元素减去该行最小的元素和/或将每列元素减去该列最小的元素等。
具体地,将第i个语素与第j个语素的第一关联度值作为矩阵第i行第j列的元素,则如前述例子中的3×4个第一关联度值构建的矩阵如下所示:
Figure PCTCN2016101447-appb-000001
且根据预设的算法,对上述矩阵做进行预处理后,得到如下矩阵:
Figure PCTCN2016101447-appb-000002
需要说明的是,上述对矩阵的预处理过程属于现有技术,在此不复赘述。此外,在最后得到的矩阵中,还对每行独立的零元素进行了标记,也即第1行第3列的0进行了标记,对第2行第1列的0进行了标记,并对第3行2列的0进行了标记,以便获得两个地址中语素间的最优匹配组合。根据上述预处理后的矩阵以及标记的独立零元素,可以将表1更新为表2。
表2
  A B C D
X 1 4 0 1
Y 0 0 2 0
Z 1 0 0 2
其中,用下划线标记的0即为预处理后的矩阵中标记出的独立零元素。从表2可以看出,当第一语素为X时,则与第一语素最匹配的第二语素为C,并记录X与C的目标关联度值d(C,X);当第一语素为Y时,则与第一语素最匹配的第二语素为A,并记录Y与A的目标关联度值d(A,Y);当第一语素为Z时,则与第一语素最匹配的第二语素为B,并记录Z与B的目标关联度值d(B,Z),即可以记录3个目标关联度值。
步骤B:根据所述n个目标关联度值,获得所述第一地址与所述第二地址的第二关联度值。
在一个例子中,可以对n个目标关联度值求和,将n个目标关联度值之和作为述第一地址与第二地址的第二关联度值。如前述例子中,第一地址与第二地址的第二关联度值=d(C,X)+d(A,Y)+d(B,Z)=1+4+2=7,此处的1,4和7是根据表1读取的。
步骤140,根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
可以理解的是,第二关联度值也可以包括差异度值和相似度值。
需要说明的是,当步骤130中确定的第二关联度值为差异度值时,则第一地址与第 二地址的第二关联度值越大,则表示两者越不相似;而当步骤120中确定的第二关联度值为相似度值时,则第一地址与第二地址的相似度值越接近1,则两者越相似,第一地址与第二地址的相似度值越接近0,则两者越不相似。此外,当步骤130中确定的第二关联度值为相似度值时,则可以对相似度值求log后取负号转化为差异度值。
此外,本申请的关联性的分析结果可以作为聚类分析、词频分析以及地址标准化等的基础。
本申请提供的地址的分析方法,确定第一地址中n个语素与第二地址中m个语素的第一关联度值;根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。由此,可以提高地址的分析的准确性和适用性。
综上,在实物交易的电子商务交易中,对于涉及境外交易订单而言,英文书写的收货地址特别是非英语母语地区的收货地址,拼写上的细微差别,书写顺序上的习惯等各种现实情况给地址关联性分析带来了进一步的挑战,因此本申请的地址的分析方法是必要的。
与上述地址的分析方法对应地,本申请实施例还提供的一种地址的分析装置,如图2所示,该装置包括:输入单元201、确定单元202、获取单元203和分析单元204。
输入单元201,用于输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数。
确定单元202,用于确定所述n个语素与所述m个语素的第一关联度值。
确定单元202具体用于:
对所述n个语素中的每个语素,确定所述每个语素与所述m个语素中各个语素的第一关联度值,以获得n×m个第一关联度值。
其中,所述第一关联度值包括:编辑距离值、汉明距离值、杰卡德距离值、N-Gram距离值、JW距离值或者余弦距离值。
获取单元203,用于根据确定单元202确定的所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值。
获取单元203具体用于:
根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,并记录所述第一语素与所述第二语素的目标关联度 值;直至记录n个目标关联度值;
根据所述n个目标关联度值,获得所述第一地址与所述第二地址的第二关联度值。
其中,所述根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,包括:
根据所述n×m个第一关联度值,构建n×m的矩阵;
根据预设的算法,对所述n×m的矩阵进行预处理;
根据预处理后的n×m的矩阵,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素。
分析单元204,用于根据获取单元203获得的所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
本申请实施例提供的地址的分析装置,输入单元201输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数;确定单元202确定所述n个语素与所述m个语素的第一关联度值;获取单元203根据确定的所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;分析单元204根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。由此,可以提高地址的分析的准确性和适用性。
专业人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的对象及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等, 均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种地址的分析方法,其特征在于,所述方法包括:
    输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数;
    确定所述n个语素与所述m个语素的第一关联度值;
    根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;
    根据所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
  2. 根据权利要求1所述的方法,其特征在于,所述确定所述n个语素与所述m个语素的第一关联度值,包括:
    对所述n个语素中的每个语素,确定所述每个语素与所述m个语素中各个语素的第一关联度值,以获得n×m个第一关联度值。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值,包括:
    根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,并记录所述第一语素与所述第二语素的目标关联度值;直至记录n个目标关联度值;
    根据所述n个目标关联度值,获得所述第一地址与所述第二地址的第二关联度值。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,包括:
    根据所述n×m个第一关联度值,构建n×m的矩阵;
    根据预设的算法,对所述n×m的矩阵进行预处理;
    根据预处理后的n×m的矩阵,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述第一关联度值包括:
    编辑距离值、汉明距离值、杰卡德距离值、N邻近字N-Gram距离值、JW距离值或者余弦距离值。
  6. 一种地址的分析装置,其特征在于,所述装置包括:输入单元、确定单元、获取单元和分析单元;
    所述输入单元,用于输入第一地址和第二地址,所述第一地址包含n个语素,所述第二地址包含m个语素,其中,所述语素是指所述地址中最小的语义单位,n和m均为自然数;
    所述确定单元,用于确定所述n个语素与所述m个语素的第一关联度值;
    所述获取单元,用于根据所述确定单元确定的所述第一关联度值和预设的算法,获得所述第一地址与所述第二地址的第二关联度值;
    所述分析单元,用于根据所述获取单元获得的所述第二关联度值,对所述第一地址与所述第二地址的关联性进行分析。
  7. 根据权利要求6所述的装置,其特征在于,所述确定单元具体用于:
    对所述n个语素中的每个语素,确定所述每个语素与所述m个语素中各个语素的第一关联度值,以获得n×m个第一关联度值。
  8. 根据权利要求7所述的装置,其特征在于,所述获取单元具体用于:
    根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,并记录所述第一语素与所述第二语素的目标关联度值;直至记录n个目标关联度值;
    根据所述n个目标关联度值,获得所述第一地址与所述第二地址的第二关联度值。
  9. 根据权利要求8所述的装置,其特征在于,所述根据所述n×m个第一关联度值和预设的算法,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素,包括:
    根据所述n×m个第一关联度值,构建n×m的矩阵;
    根据预设的算法,对所述n×m的矩阵进行预处理;
    根据预处理后的n×m的矩阵,从所述m个语素中选取与所述n个语素中第一语素最匹配的第二语素。
  10. 根据权利要求6-9任一项所述的装置,其特征在于,所述第一关联度值包括:
    编辑距离值、汉明距离值、杰卡德距离值、N邻近字N-Gram距离值、JW距离值或者余弦距离值。
PCT/CN2016/101447 2015-10-10 2016-10-08 地址的分析方法及装置 WO2017059797A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
SG11201802898PA SG11201802898PA (en) 2015-10-10 2016-10-08 Method and apparatus for address analysis
EP16853101.0A EP3361392A4 (en) 2015-10-10 2016-10-08 Method and device for analyzing address
US15/944,569 US11113474B2 (en) 2015-10-10 2018-04-03 Address analysis using morphemes

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510652677.8A CN106569994B (zh) 2015-10-10 2015-10-10 地址的分析方法及装置
CN201510652677.8 2015-10-10

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/944,569 Continuation US11113474B2 (en) 2015-10-10 2018-04-03 Address analysis using morphemes

Publications (1)

Publication Number Publication Date
WO2017059797A1 true WO2017059797A1 (zh) 2017-04-13

Family

ID=58487353

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/101447 WO2017059797A1 (zh) 2015-10-10 2016-10-08 地址的分析方法及装置

Country Status (5)

Country Link
US (1) US11113474B2 (zh)
EP (1) EP3361392A4 (zh)
CN (1) CN106569994B (zh)
SG (1) SG11201802898PA (zh)
WO (1) WO2017059797A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255564B (zh) * 2017-07-13 2022-09-06 菜鸟智能物流控股有限公司 一种取件点地址推荐方法及装置
CN107492021A (zh) * 2017-08-28 2017-12-19 武汉奇米网络科技有限公司 订单来源分析方法及装置
CN111159974A (zh) * 2019-12-30 2020-05-15 北京明略软件系统有限公司 地址信息的标准化方法、装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423032A (en) * 1991-10-31 1995-06-06 International Business Machines Corporation Method for extracting multi-word technical terms from text
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
CN103425640A (zh) * 2012-05-14 2013-12-04 华为技术有限公司 一种多媒体问答系统及方法
CN104679728A (zh) * 2015-02-06 2015-06-03 中国农业大学 一种文本相似度检测方法
CN104699668A (zh) * 2015-03-26 2015-06-10 小米科技有限责任公司 确定词语相似度的方法及装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6522330B2 (en) * 1997-02-17 2003-02-18 Justsystem Corporation Character processing system and method
JP2004030323A (ja) * 2002-06-26 2004-01-29 P To Pa:Kk 情報送信システム、情報送信方法、プログラム
JP2007241764A (ja) * 2006-03-09 2007-09-20 Fujitsu Ltd 構文解析プログラム、構文解析方法、構文解析装置、及び構文解析プログラムが記録されたコンピュータ読み取り可能な記録媒体
KR100691400B1 (ko) * 2006-03-31 2007-03-12 엔에이치엔(주) 부가 정보를 이용하여 형태소를 분석하는 방법 및 상기방법을 수행하는 형태소 분석기
US8290961B2 (en) * 2009-01-13 2012-10-16 Sandia Corporation Technique for information retrieval using enhanced latent semantic analysis generating rank approximation matrix by factorizing the weighted morpheme-by-document matrix
TWI601081B (zh) 2013-12-30 2017-10-01 國立台灣科技大學 文化創意設計的商業化評估方法及系統
IN2014MU00789A (zh) * 2014-03-07 2015-09-25 Tata Consultancy Services Ltd
US20160147943A1 (en) * 2014-11-21 2016-05-26 Argo Data Resource Corporation Semantic Address Parsing Using a Graphical Discriminative Probabilistic Model
US9646061B2 (en) * 2015-01-22 2017-05-09 International Business Machines Corporation Distributed fuzzy search and join with edit distance guarantees
US20180260923A1 (en) 2015-09-09 2018-09-13 Jun Sung Lee Copyright database and copyright trading method on internet
CN107086920A (zh) 2017-06-20 2017-08-22 无锡井通网络科技有限公司 基于区块链的版权确权方法
CN107358551A (zh) 2017-07-03 2017-11-17 重庆小犀智能科技有限公司 基于区块链的公证系统及方法
CN107705114A (zh) 2017-08-31 2018-02-16 中链科技有限公司 基于区块链技术的著作权数据处理方法、系统和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5423032A (en) * 1991-10-31 1995-06-06 International Business Machines Corporation Method for extracting multi-word technical terms from text
CN103425640A (zh) * 2012-05-14 2013-12-04 华为技术有限公司 一种多媒体问答系统及方法
CN103399907A (zh) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 一种基于编辑距离计算中文字符串相似度的方法及装置
CN104679728A (zh) * 2015-02-06 2015-06-03 中国农业大学 一种文本相似度检测方法
CN104699668A (zh) * 2015-03-26 2015-06-10 小米科技有限责任公司 确定词语相似度的方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3361392A4 *

Also Published As

Publication number Publication date
EP3361392A1 (en) 2018-08-15
US11113474B2 (en) 2021-09-07
EP3361392A4 (en) 2018-11-21
US20180225282A1 (en) 2018-08-09
CN106569994B (zh) 2019-02-26
CN106569994A (zh) 2017-04-19
SG11201802898PA (en) 2018-05-30

Similar Documents

Publication Publication Date Title
USRE49576E1 (en) Standard exact clause detection
WO2018157805A1 (zh) 一种自动问答处理方法及自动问答系统
WO2018040899A1 (zh) 搜索词纠错方法及装置
WO2019218473A1 (zh) 一种字段匹配方法、装置、终端设备及介质
US20150227505A1 (en) Word meaning relationship extraction device
CN108090068B (zh) 医院数据库中的表的分类方法及装置
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
WO2019237546A1 (zh) 敏感词验证方法、装置、计算机设备及存储介质
US9779728B2 (en) Systems and methods for adding punctuations by detecting silences in a voice using plurality of aggregate weights which obey a linear relationship
WO2019100619A1 (zh) 电子装置、多表关联查询的方法、系统及存储介质
WO2021196934A1 (zh) 一种基于字段相似度计算的问题推荐方法、装置和服务器
US10452785B2 (en) Translation assistance system, translation assistance method and translation assistance program
CN107943786B (zh) 一种中文命名实体识别方法及系统
CN111651986B (zh) 事件关键词提取方法、装置、设备及介质
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
WO2023029513A1 (zh) 基于人工智能的搜索意图识别方法、装置、设备及介质
WO2017059797A1 (zh) 地址的分析方法及装置
US11914626B2 (en) Machine learning approach to cross-language translation and search
CN113076748A (zh) 弹幕敏感词的处理方法、装置、设备及存储介质
CN111091883A (zh) 一种医疗文本处理方法、装置、存储介质及设备
US20220270589A1 (en) Information processing device, information processing method, and computer program product
CN111104481A (zh) 一种识别匹配字段的方法、装置及设备
CN111310452A (zh) 一种分词方法和装置
Mandravickaitė et al. Stylometric analysis of parliamentary speeches: Gender dimension
US10042843B2 (en) Method and system for searching words in documents written in a source language as transcript of words in an origin language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16853101

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 11201802898P

Country of ref document: SG

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016853101

Country of ref document: EP