CN110147393A - The entity resolution method in data-oriented space - Google Patents

The entity resolution method in data-oriented space Download PDF

Info

Publication number
CN110147393A
CN110147393A CN201910435269.5A CN201910435269A CN110147393A CN 110147393 A CN110147393 A CN 110147393A CN 201910435269 A CN201910435269 A CN 201910435269A CN 110147393 A CN110147393 A CN 110147393A
Authority
CN
China
Prior art keywords
attribute
record
similarity
mapping
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910435269.5A
Other languages
Chinese (zh)
Other versions
CN110147393B (en
Inventor
周连科
赵昱杰
张毅
苏畅
王红滨
王念滨
崔琎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201910435269.5A priority Critical patent/CN110147393B/en
Publication of CN110147393A publication Critical patent/CN110147393A/en
Application granted granted Critical
Publication of CN110147393B publication Critical patent/CN110147393B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

面向数据空间的实体解析方法,本发明涉及实体解析方法。本发明的目的是为了解决现有在数据空间中进行实体解析时,要对记录进行对比,对于不同领域的记录对,匹配概率很小,成对对比会浪费资源的问题。过程为:步骤一、构建记录图:步骤二、采用剪枝方法简化记录图;步骤三、对剪化后的记录图进行分块处理;步骤四、建立属性映射集群;步骤五、计算属性映射集的优度;步骤六、得到属性映射集群中各个映射集的优度后,在块内进行实体解析。本发明用于数据实体解析领域。

A data space-oriented entity resolution method, the invention relates to an entity resolution method. The purpose of the present invention is to solve the existing problem that when performing entity parsing in data space, it is necessary to compare records. For record pairs in different fields, the matching probability is very small, and pairwise comparison will waste resources. The process is: Step 1. Construct the record map; Step 2. Use the pruning method to simplify the record map; Step 3. Block the pruned record map; Step 4. Establish an attribute mapping cluster; Step 5. Calculate the attribute map The goodness of the set; step six, after obtaining the goodness of each mapping set in the attribute mapping cluster, perform entity resolution in the block. The invention is used in the field of data entity analysis.

Description

面向数据空间的实体解析方法Entity Resolution Method Oriented to Data Space

技术领域technical field

本发明涉及实体解析方法。The present invention relates to entity resolution methods.

背景技术Background technique

实体解析是指识别同一实体的不同描述形式的过程,旨在保障数据质量,是数据清理、数据集成及数据挖掘中的关键技术[1](Vasilis Efthymiou,Kostas Stefanidis,Vassilis Christophides.Big Data Entity Resolution:From Highly to SomehowSimilar Entity Descriptions in the Web[C]//Proceeding Big Data’15Proceedingsof the 2015IEEE International Conference on Big Data,2015,11(1):401-410P)。在传统的实体解析工作中,大部分工作依赖于数据之间的模式或语义映射。数据空间是一种新的数据集成方式,它没有严格的数据模式及语义映射,而是根据主体的需求逐渐将数据纳入并建立关系,是一种异质数据集合,其特点是数据来自多个数据源[2](葛敬军,胡长军,刘歆.面向领域科学的虚拟数据空间共享模型[J].小型微型计算机系统,2014,35(3):514-519PGE Jingjun,HU Changjun,LIU Xin.Virtual Data Space Sharing Model forDomain Science[J].Minicomputer System,2014,35(13):514-519P)。在数据空间中进行实体解析时,就失去了实体解析的有力工具,语义映射。实体解析要对记录进行对比,对于不同领域的记录对,匹配概率很小,成对对比会浪费资源。Entity resolution refers to the process of identifying different description forms of the same entity. It aims to ensure data quality and is a key technology in data cleaning, data integration and data mining [1] (Vasilis Efthymiou, Kostas Stefanidis, Vassilis Christophides.Big Data Entity Resolution :From Highly to Somehow Similar Entity Descriptions in the Web[C]//Proceeding Big Data'15Proceedings of the 2015IEEE International Conference on Big Data,2015,11(1):401-410P). In traditional entity resolution work, most work relies on schema or semantic mapping between data. Data space is a new data integration method. It does not have strict data schema and semantic mapping, but gradually incorporates data and establishes relationships according to the needs of the subject. It is a heterogeneous data collection, which is characterized by data from multiple Data source [2] (Ge Jingjun, Hu Changjun, Liu Xin. A virtual data space sharing model for domain science[J]. Small Microcomputer System, 2014,35(3):514-519PGE Jingjun, HU Changjun, LIU Xin.Virtual Data Space Sharing Model for Domain Science [J]. Minicomputer System, 2014, 35(13): 514-519P). When entity resolution is performed in data space, a powerful tool for entity resolution, semantic mapping, is lost. Entity resolution needs to compare records. For record pairs in different fields, the matching probability is very small, and pairwise comparison will waste resources.

发明内容Contents of the invention

本发明的目的是为了解决现有在数据空间中进行实体解析时,要对记录进行对比,对于不同领域的记录对,匹配概率很小,成对对比会浪费资源的问题,而提出面向数据空间的实体解析方法。The purpose of the present invention is to solve the existing problem of comparing records when performing entity analysis in data space. For record pairs in different fields, the matching probability is very small, and pairwise comparison will waste resources. entity resolution method.

面向数据空间的实体解析方法具体过程为:The specific process of the data space-oriented entity resolution method is as follows:

步骤一、构建记录图:Step 1. Build a record map:

步骤二、采用剪枝方法简化记录图;Step 2, using the pruning method to simplify the record map;

步骤三、对剪化后的记录图进行分块处理;Step 3, performing block processing on the clipped recording image;

步骤四、建立属性映射集群;Step 4. Establish an attribute mapping cluster;

步骤五、计算属性映射集的优度;Step 5, calculating the goodness of the attribute mapping set;

步骤六、得到属性映射集群中各个映射集的优度后,在块内进行实体解析。Step 6: After obtaining the superiority of each mapping set in the attribute mapping cluster, perform entity resolution in the block.

本发明的有益效果为:The beneficial effects of the present invention are:

本发明提出了分块技术[3](Batya Kening,Avigdor Gal.MFIBlocks:Aneffective blocking algorithm for entity resolution[J].Information Systems,2013,38(6):908-926P),即利用一种代价较低的计算方法对数据进行预判,即可能属于同一实体的数据记录放在一个块中,仅在块内进行记录对比。解决了现有在数据空间中进行实体解析时,要对记录进行对比,对于不同领域的记录对,匹配概率很小,成对对比会浪费资源的问题。The present invention proposes block technology [3] (Batya Kening, Avigdor Gal.MFIBlocks:Aneffective blocking algorithm for entity resolution[J].Information Systems,2013,38(6):908-926P), that is, to use a cost-effective The low calculation method predicts the data, that is, the data records that may belong to the same entity are placed in a block, and the records are compared only within the block. It solves the problem that when performing entity parsing in the data space, it is necessary to compare records. For record pairs in different fields, the matching probability is very small, and pairwise comparison will waste resources.

本文面向数据空间对多源异质数据实体解析进行理论研究。考虑到即使在无语义映射下,指向同一实体的两条记录在其属性值上也有共同点,并且将记录之间的关系纳入计算,综合两者构建记录图。针对不同情况的记录集合,通过其适用的剪枝方法,简化记录图,并提出了根据剪枝后的记录图进行分块的算法。This paper conducts theoretical research on multi-source heterogeneous data entity parsing for data space. Considering that even under the non-semantic mapping, two records pointing to the same entity have common points in their attribute values, and the relationship between records is included in the calculation, and the record graph is constructed by combining the two. According to the record collections in different situations, the record graph is simplified through the applicable pruning method, and an algorithm for dividing the record graph according to the pruned record graph is proposed.

在块内做实体解析时,利用属性值对属性做映射,通过获取块内整体数据记录的属性名所指代的信息,将块内与现有数据有共同值但仍不匹配的数据区分开来,并提出一种类似于正则表达式的方法,计算属性值的相似度,并对匹配记录的映射属性的属性值进行合并,以返回给用户一个较为全面的实体信息。When doing entity parsing in a block, use the attribute value to map the attribute, and obtain the information referred to by the attribute name of the overall data record in the block to distinguish the data that has a common value with the existing data in the block but still does not match Come, and propose a method similar to regular expressions, calculate the similarity of attribute values, and merge the attribute values of the mapped attributes of the matching records, so as to return a more comprehensive entity information to the user.

通过实验验证,本发明中所提出的方法对于实体解析有一定的正向推动作用。It is verified by experiments that the method proposed in the present invention has a certain positive effect on entity resolution.

附图说明Description of drawings

图1为本发明构建记录图流程图;Fig. 1 is the flow chart of the construction record diagram of the present invention;

图2为本发明对记录图进行剪枝流程图;Fig. 2 is that the present invention carries out pruning flowchart to record map;

图3为根据剪枝后的记录图进行分块流程图;Fig. 3 is to carry out block flow chart according to the recording chart after pruning;

图4a为异质属性映射的数据图;Figure 4a is a data map of heterogeneous attribute mapping;

图4b为异质属性映射的全局属性映射图;Figure 4b is a global attribute map of heterogeneous attribute mapping;

图4c为异质属性映射的属性映射集群的优度计算图;Figure 4c is a graph of the goodness calculation of the attribute mapping cluster of heterogeneous attribute mapping;

图5a为在两个数据集上,两种方法随与之的变化情况图;Figure 5a is a graph of the changes of the two methods on the two data sets;

图5b为在两个数据集上,两种方法随与之的变化情况图;Figure 5b is a graph of the changes of the two methods on the two data sets;

图6为两种算法的实体生成对比图。Figure 6 is a comparison diagram of the entity generation of the two algorithms.

具体实施方式Detailed ways

具体实施方式一:本发明实施方式面向数据空间的实体解析方法具体过程为:Specific implementation mode 1: The specific process of the entity resolution method oriented to the data space in the implementation mode of the present invention is as follows:

步骤一、构建记录图:Step 1. Build a record map:

步骤二、采用剪枝方法简化记录图;Step 2, using the pruning method to simplify the record map;

步骤三、对剪化后的记录图进行分块处理;Step 3, performing block processing on the clipped recording image;

步骤四、建立属性映射集群;Step 4. Establish an attribute mapping cluster;

步骤五、计算属性映射集的优度;Step 5, calculating the goodness of the attribute mapping set;

步骤六、得到属性映射集群中各个映射集的优度后,在块内进行实体解析工作,从而排除误纳入此块,指向其他实体的数据记录。Step 6: After obtaining the superiority of each mapping set in the attribute mapping cluster, perform entity resolution in the block, so as to exclude data records that are mistakenly included in this block and point to other entities.

具体实施方式二:本实施方式与具体实施方式一不同的是,所述步骤一中构建记录图;具体过程为:Embodiment 2: The difference between this embodiment and Embodiment 1 is that the record map is constructed in the step 1; the specific process is:

使用一种标签方法表示数据记录,将数据记录看做一个属性值集合;此时基于一种常识性假设[4]([4]S.Prabhakar Benny,S.Vasavi Dr,P.Anupriya.Hadoop FrameworkFor Entity Resolution Within High Velocity Streams[J].Procedia ComputerScience,2016,85:550-557P),如果两个记录指向同一实体,则它们必然包含一些相同的属性值。并将数据记录间的关系计算在内,提高准确性[5](肖启华,陈珂,黄冬梅.考虑空间相关性的数据空间特征提取法方法[J].计算机仿真,2014,31(12):425-428,433P XIAOQihua,CHEN Ke,HUANG Dongmei.Data Spatial Feature Extraction MethodConsidering Spatial Relevance[J].Computer Simulation,2014,31(12):425-428,433P)。利用一个记录图模型来表示数据空间中的记录结点及记录结点关系;通过计算两条记录之间的相似度,在两条记录之间画一条边,边权为相似度值。这种标签风格的分块方法,由于其表示方法简单,只需要获取记录的属性值,而不依赖固定的数据模式和强硬射的语义,所以在面对数据空间的异质数据集上,可以有着很强大的适用性。Use a label method to represent data records, and regard data records as a collection of attribute values; at this time, based on a common-sense assumption [4] ([4]S.Prabhakar Benny,S.Vasavi Dr,P.Anupriya.Hadoop FrameworkFor Entity Resolution Within High Velocity Streams[J].Procedia ComputerScience,2016,85:550-557P), if two records point to the same entity, they must contain some same attribute values. And calculate the relationship between data records to improve accuracy [5] (Xiao Qihua, Chen Ke, Huang Dongmei. Data spatial feature extraction method considering spatial correlation [J]. Computer Simulation, 2014, 31(12): 425-428, 433P XIAO Qihua, CHEN Ke, HUANG Dongmei. Data Spatial Feature Extraction Method Considering Spatial Relevance [J]. Computer Simulation, 2014, 31(12): 425-428, 433P). A record graph model is used to represent record nodes and record node relationships in the data space; by calculating the similarity between two records, an edge is drawn between the two records, and the edge weight is the similarity value. This label-style chunking method, because of its simple representation method, only needs to obtain the attribute value of the record, and does not rely on fixed data patterns and hard-line semantics, so it can be used on heterogeneous data sets facing the data space. It has great applicability.

设有记录集合R={r1{FullName:Tom Lloyd Malik;Job:producer,Actor;Address:L.A.},r2{Name:Tom Malik;Producer;birthPlace:L.A.},r3{Label:MikeStyles;Profession:producer;Place_of_birth:L.A.;Place_of_birth:1964},r4{MikeHarry Styles;birthPlace:L.A.;Gender:male},r5{FullName:Harry Green;Address:LOS;Sex:male;Profession:Writher},r6{Label:Harry Green;Gender:male;birthYear:1980;married}}。基于记录集合R的记录图分块方法概览如图1、图2、图3所示(为了简化示例图,图中记录集合暂不标记它们的关系)。Assuming a record set R={r 1 {FullName:Tom Lloyd Malik; Job:producer,Actor;Address:LA}, r 2 {Name:Tom Malik;Producer;birthPlace:LA},r 3 {Label:MikeStyles;Profession :producer; Place_of_birth:LA; Place_of_birth:1964},r 4 {MikeHarry Styles;birthPlace:LA;Gender:male},r 5 {FullName:Harry Green;Address:LOS;Sex:male;Profession:Writher},r 6 {Label: Harry Green; Gender: male; birthYear: 1980; married}}. An overview of record graph partitioning methods based on record set R is shown in Figure 1, Figure 2, and Figure 3 (in order to simplify the example diagram, the record sets in the figure do not mark their relationship for the time being).

步骤一一、计算两条记录之间的相似度;Step 11, calculating the similarity between two records;

步骤一二、根据数据空间的记录和相似度构建记录图。Steps 1 and 2, constructing a record graph according to records and similarities in the data space.

其它步骤及参数与具体实施方式一相同。Other steps and parameters are the same as those in Embodiment 1.

具体实施方式三:本实施方式与具体实施方式一或二不同的是,所述步骤一一中计算两条记录之间的相似度;具体过程为:Specific embodiment three: the difference between this embodiment and specific embodiment one or two is that the similarity between the two records is calculated in the step one by one; the specific process is:

标签转换函数tag()可以将一条记录转换为一个标签集(即tag:ri→T(ri))。计算两个集合交集与并集大小的比值,即可得到两条记录的标签相似度。The tag conversion function tag() can convert a record into a tag set (namely tag:r i →T(r i )). Calculate the ratio of the intersection and union sizes of the two sets to get the label similarity of the two records.

步骤一一一、计算标签相似度:Step 111, calculate the label similarity:

通过标签转换函数tag()将记录转为标签集合,计算两条记录的标签相似度,记为simtag(ri,rj):Use the tag conversion function tag() to convert the record into a tag set, and calculate the tag similarity between two records, which is recorded as sim tag (r i , r j ):

其中,T(ri)为通过标签转换函数将记录ri转换成的规范化标签集;T(rj)为通过标签转换函数将记录rj转换成的规范化标签集;Among them, T(r i ) is the normalized label set converted by the record r i through the label conversion function; T(r j ) is the normalized label set converted by the record r j through the label conversion function;

步骤一一二、计算关系相似度:Step 112, calculate the similarity of the relationship:

整合了两条记录所具有的所有关系上的综合相似度,记为simrel(ri,rj):Integrate the comprehensive similarity of all relations of two records, denoted as sim rel (r i ,r j ):

其中,Nbr(ri)表示与记录ri在rel关系上有连接的记录集合,Nbr(rj)表示与记录rj在rel关系上有连接的记录集合,REL表示在记录r1、r2上出现的所有的记录关系集合,如合作关系,师生关系,教课关系等等;Among them, Nbr(r i ) represents the record set connected with record r i in rel relation, Nbr(r j ) represents the record set connected with record r j in rel relation, REL represents the record set connected with record r 1 , r A collection of all record relationships appearing on 2 , such as cooperation relationship, teacher-student relationship, teaching relationship, etc.;

步骤一一三、整合标签相似度和关系相似度,得出综合相似度sim(ri,rj):Step 113: Integrate the label similarity and relationship similarity to obtain the comprehensive similarity sim(r i , r j ):

sim(ri,rj)=α·simtag(ri,rj)+(1-α)·simrel(ri,rj)sim(r i , r j )=α·sim tag (r i ,r j )+(1-α)·sim rel (r i ,r j )

其中,α表示标签相似度的权值。Among them, α represents the weight of label similarity.

其它步骤及参数与具体实施方式一或二相同。Other steps and parameters are the same as those in Embodiment 1 or Embodiment 2.

具体实施方式四:本实施方式与具体实施方式一至三之一不同的是,所述步骤一二中根据数据空间的记录和相似度构建记录图;具体过程为:Embodiment 4: This embodiment differs from Embodiment 1 to Embodiment 3 in that, in the step 1 or 2, a record map is constructed according to the records and similarities of the data space; the specific process is:

给定一个数据空间的记录集合R,构建一个无向图G=(R,E),称之为记录图;Given a record set R in a data space, construct an undirected graph G=(R,E), which is called a record graph;

其中R为记录集合,代表数据空间中的记录;E为边集,两个记录之间存在一条边代表记录对的相似度。Among them, R is the record set, which represents the records in the data space; E is the edge set, and there is an edge between two records to represent the similarity of the record pair.

记录图构建完成后,图内边权较小的记录对匹配概率较低,通过对记录图的边进行处理,减少不必要的记录对比较。After the record graph is constructed, the matching probability of record pairs with smaller edge weights in the graph is lower. By processing the edges of the record graph, unnecessary record pair comparisons are reduced.

其它步骤及参数与具体实施方式一至三之一相同。Other steps and parameters are the same as those in Embodiments 1 to 3.

具体实施方式五:本实施方式与具体实施方式一至四之一不同的是,所述步骤二中采用剪枝方法简化记录图;具体过程为:Specific implementation mode five: this implementation mode is different from one of specific implementation modes one to four in that the pruning method is adopted in the step 2 to simplify the record diagram; the specific process is:

根据某种规则删除权值较小的边,即对图作剪枝操作,减少冗余的匹配次数。Delete edges with smaller weights according to certain rules, that is, pruning the graph to reduce redundant matching times.

为了便于接下来描述,给出点区域的定义。For the convenience of the following description, the definition of the point area is given.

定义4.点区域:一个记录r在记录图G中出现的形式为一个结点,此记录结点本身、与之有边的邻居记录和连接它们的边所构成的区域成为点区域。点区域为记录图G中的一个子图,记为Gr={{r}∪Rr,Er}。其中,Rr为与r有边相连的邻居结点所构成的集合,Er为连接r与邻居结点的边所构成的集合。Definition 4. Point area: A record r appears in the record graph G as a node, and the area formed by the record node itself, its neighbor records with edges and the edges connecting them becomes a point area. The dot region is a subgraph in the record graph G, which is denoted as G r ={{r}∪R r ,E r }. Among them, R r is the set formed by the neighbor nodes connected with r, and E r is the set formed by the edges connecting r and the neighbor nodes.

剪枝过程主要有两个组成部分:剪枝中心与剪枝规则。The pruning process has two main components: pruning centers and pruning rules.

剪枝中心,可以分为两类:边中心化,通过遍历图的边集来选择全局最佳的待比较对,以此来筛选出不满足剪枝规则的边;结点中心化,它遍历图中所有的结点,旨在对于一个记录结点,在它的点区域,找到和它的最佳待比较对集——即,与此记录相连的边权最大的若干个记录。The pruning center can be divided into two categories: edge centralization, which selects the global best pair to be compared by traversing the edge set of the graph, so as to filter out the edges that do not meet the pruning rules; node centralization, which traverses All the nodes in the graph aim to find the best pairing set to be compared with a record node in its point area—that is, several records with the largest edge weights connected to this record.

剪枝规则,按照功能将其分为权重阈值和基数阈值。权重阈值指定了保留边的最小权值,删除所有低于此权值的边;基数阈值给出了图中保留边的最大边数,保留边权top-k的边。其中,基数阈值限定了待比较对的数量,适用于具有对时间资源有限制的应用。权值阈值根据记录对本身具有的匹配概率来决定是否修剪连接记录对的边,适用于看重有效性的应用程序。按照作用范围将剪枝规则分为全局阈值和局部阈值。全局阈值适用于整个图,也就是图中所有的边;而局部阈值适用于图的一个子集,即一个结点的点区域。Pruning rules are divided into weight threshold and cardinality threshold according to function. The weight threshold specifies the minimum weight of retained edges, and deletes all edges lower than this weight; the cardinality threshold gives the maximum number of retained edges in the graph, and retains the edges with top-k edge weights. Among them, the cardinality threshold limits the number of pairs to be compared, which is suitable for applications with limited time resources. The weight threshold determines whether to prune the edges connecting record pairs according to the matching probability of the record pair itself, which is suitable for applications that value validity. According to the scope of action, the pruning rules are divided into global threshold and local threshold. The global threshold applies to the entire graph, that is, all edges in the graph; while the local threshold applies to a subset of the graph, that is, the point region of a node.

剪枝方法采用边中心化的基数剪枝、结点中心化的基数剪枝、边中心化的阈值剪枝或结点中心化的阈值剪枝中的一种;The pruning method adopts one of edge-centered cardinality pruning, node-centered cardinality pruning, edge-centered threshold pruning, or node-centered threshold pruning;

(1)边中心化的基数剪枝:(1) Radix pruning with edge centralization:

全局基数阈值k指定了记录图要保留边的总数,保留k条权值最大的边;可以按照权值对边集降序排序,有效删除权值低的边。The global cardinality threshold k specifies the total number of edges to be retained in the record graph, and retains the k edges with the largest weight; the edge set can be sorted in descending order according to the weight, and the edges with low weight can be effectively deleted.

(2)结点中心化的基数剪枝:(2) Cardinality pruning of node centralization:

对于每个结点ri,保留连接结点ri的top-k权值(就是权值最大的k条边)的边;对于结点ri,记录结点ri的点区域Gri={{ri}∪Rri,Eri}和当前记录结点的基数阈值kri,遍历子图中的边,保留权值top-k(与ri连接的共n条边,保留权值最高的k条边,其他的边删掉)边的同时删除其他连接ri的边;For each node r i , keep the edge connecting the top-k weight of node r i (that is, the k edges with the largest weight); for node r i , record the point area G ri of node r i = {{r i }∪R ri ,E ri } and the cardinality threshold k ri of the current record node, traverse the edges in the subgraph, and retain the weight top-k (a total of n edges connected to ri, retain the highest weight k edges of , delete other edges) and delete other edges connecting r i at the same time;

其中,Rri为与记录结点ri有边相连的邻居结点所构成的集合,Eri为连接记录结点ri与邻居结点的边所构成的集合;点区域Gri为记录图G中的一个子图;Among them, R ri is the set of neighbor nodes connected to the record node r i by edges, E ri is the set of edges connecting the record node r i and neighbor nodes; the point area G ri is the record graph a subgraph in G;

一般来说,每个结点的基数阈值应取决于其点区域的边集大小(如kri=0.1×|Eri|)。Generally speaking, the cardinality threshold of each node should depend on the edge set size of its point region (eg k ri =0.1×|E ri |).

(3)边中心化的阈值剪枝:(3) Edge-centered threshold pruning:

利用权重阈值在全局范围进行剪枝,选取最小边权wmin(实验得到),遍历图中所有边,将权值低于wmin的边删除;Use the weight threshold to perform pruning in the global scope, select the minimum edge weight w min (obtained by experiment), traverse all the edges in the graph, and delete the edges with weights lower than w min ;

它遍历了图中所有的边并删掉权值低于预设阈值wmin的边,保留图中剩余的边并输出。一般情况下,匹配记录之间的权值大于不匹配记录之间的权值,因此,选取wmin目标就是确定两者之间的平衡点。It traverses all the edges in the graph and deletes the edges whose weight is lower than the preset threshold w min , keeps the remaining edges in the graph and outputs them. In general, the weight between matching records is greater than that between unmatched records, so choosing the w min target is to determine the balance point between the two.

(4)结点中心化的阈值剪枝:(4) Node-centralized threshold pruning:

对剪枝范围的选择,若采用全局阈值,则对记录图的所有结点采用一个统一的阈值,此时剪枝过程与边中心化的权值剪枝方案相同;若采用局部阈值,则可以根据用户需要,为特殊结点选取一个特定阈值,实质上,它将边中心化的权值剪枝应用在了结点ri的点区域上。与边中心化的阈值剪枝方案的主要不同为,它可以对于每个结点使用不同的阈值。它首先得到ri的点区域Gri={{ri}∪Rri,Eri},然后根据输入的局部阈值标准指定了子图剪枝的最小边权;然后,遍历Eri中的边,将权值小于设定阈值的边删除。For the selection of the pruning range, if a global threshold is used, a unified threshold is used for all nodes in the record graph. At this time, the pruning process is the same as the edge-centered weight pruning scheme; if a local threshold is used, it can be According to the user's needs, a specific threshold is selected for a special node. In essence, it applies the edge-centered weight pruning to the point area of the node r i . The main difference from the edge-centered threshold pruning scheme is that it can use different thresholds for each node. It first obtains the point region G ri of r i ={{r i }∪R ri ,E ri }, and then specifies the minimum edge weight of subgraph pruning according to the input local threshold standard; then, traverses the edges in E ri , delete the edge whose weight is less than the set threshold.

其它步骤及参数与具体实施方式一至四之一相同。Other steps and parameters are the same as in one of the specific embodiments 1 to 4.

具体实施方式六:本实施方式与具体实施方式一至五之一不同的是,所述步骤三中对剪化后的记录图进行分块处理;具体过程为:Specific embodiment six: this embodiment is different from one of the specific embodiments one to five in that in the step three, the clipped record map is processed in blocks; the specific process is:

剪枝后的图为G={R,E},R为记录集合,E为边集;The graph after pruning is G={R,E}, R is the record set, E is the edge set;

任取一个记录ri,创建一个块bi并将ri放置于bi中,若ri点区域中的结点rj与bi中所有结点均有边相连,则将rj放置于bi中,并删除rj与bi中所有结点的相连边,重复此操作直至遍历ri的所有邻居结点;此时,若bi中的结点在图中变成一个孤立的结点,无边与之相连,则从图中删除此结点;重复步骤三直至图G为空;此时,分块工作完成,得到块集合B={b1,b2,...,b|B|};Take any record r i , create a block b i and put r i in b i , if the node r j in the point area of r i is connected to all the nodes in b i , then place r j in in b i , and delete the connected edges between r j and all nodes in b i , repeat this operation until all neighbor nodes of r i are traversed; at this time, if the node in b i becomes an isolated in the graph If there is no node connected to it, delete this node from the graph; repeat step 3 until the graph G is empty; at this time, the block work is completed, and the block set B={b 1 ,b 2 ,... ,b |B| };

基于属性映射的记录对比与连接Record comparison and connection based on attribute mapping

经过分块,块内的记录信息包含大量重复的属性值,有助于对属性进行映射。通过观察全局数据的特点并作出语义映射的处理,可以发现其中一条或几条数据记录的信息有一些不同的地方,此时可以将这种记录划分出去以提高准确性。将匹配的记录信息整合,用户自取所需,可以缩减处理记录信息的时间。After being divided into blocks, the record information in the block contains a large number of repeated attribute values, which is helpful for mapping attributes. By observing the characteristics of the global data and processing the semantic mapping, it can be found that there are some differences in the information of one or several data records. At this time, such records can be divided to improve accuracy. Integrate the matching record information, and users can get what they need, which can reduce the time for processing record information.

异质属性的映射Mapping of heterogeneous attributes

实体解析中常用的方法是利用统一的属性来计算属性值相似度。但是在数据空间这种多源异质的环境下,没有准确的属性映射,所以可以反过来利用属性值来进行属性匹配。A common method in entity resolution is to use uniform attributes to calculate attribute value similarity. However, in the multi-source and heterogeneous environment of data space, there is no accurate attribute mapping, so attribute values can be used instead for attribute matching.

实例化:一个实体记录由一组属性和值组成,而一个属性的值,可能存在错误拼写,或是空值。空值是无法进行比较的。如果一个实体由一个值不为空的属性所描述,则称这个属性实例化此实体。整个的属性语义映射过程如图4a、4b、4c所示。Instantiation: An entity record consists of a set of attributes and values, and the value of an attribute may have misspellings or null values. Null values cannot be compared. An attribute is said to instantiate an entity if it is described by an attribute whose value is not null. The entire attribute semantic mapping process is shown in Figures 4a, 4b, and 4c.

其它步骤及参数与具体实施方式一至五之一相同。Other steps and parameters are the same as one of the specific embodiments 1 to 5.

具体实施方式七:本实施方式与具体实施方式一至六之一不同的是,所述步骤四中建立属性映射集群;具体过程为:Embodiment 7: The difference between this embodiment and one of Embodiments 1 to 6 is that the attribute mapping cluster is established in the step 4; the specific process is:

步骤四一、对于来自不同实体的两个属性,计算属性的相似值:Step 41. For two attributes from different entities, calculate the similarity value of the attribute:

1)属性名相似度(可以利用编辑距离或其他适合的字符串相似度量函数),记为SL,通过对比两个规范化后的属性名获得(根据数据集特点,可以规范大小写,扩展简称为全称);1) Attribute name similarity (you can use edit distance or other suitable string similarity measurement functions), denoted as S L , obtained by comparing two normalized attribute names (according to the characteristics of the data set, you can standardize the case, expand the abbreviation is the full name);

可以根据数据集中,属性名相似度的大小,来决定是否将此部分纳入计算。Whether to include this part in the calculation can be decided according to the similarity of attribute names in the data set.

2)属性值相似度(两个属性,它们可能在不同的记录上有不同的值,如属性att1在记录集合上有3个值,属性att2在记录集合上有4个值,则选取att1和att2最相似的两个值作为两个属性的属性值相似度),记为SV,对比两个属性的所有值,并保留最高的相似度得分。2) Attribute value similarity (two attributes, they may have different values in different records, such as attribute att1 has 3 values in the record set, attribute att2 has 4 values in the record set, then select att1 and The two most similar values of att2 are taken as the attribute value similarity of the two attributes), recorded as S V , compare all the values of the two attributes, and keep the highest similarity score.

基于属性名相似度和属性值相似度方法得到属性匹配对,从属性匹配对集合中,通过计算得到属性匹配集,属性匹配集的集合称为属性匹配集群。Attribute matching pairs are obtained based on attribute name similarity and attribute value similarity. From the set of attribute matching pairs, the attribute matching set is obtained through calculation. The set of attribute matching sets is called attribute matching cluster.

属性匹配集中的属性相互之间完全匹配。方法拒绝一个属性映射集包含许多松散相关的属性,且遵循广泛使用的无重复假设[6](Imen Megdiche,Oliver Teste,CassiaTrojahn.An extensible linear approach for holistic ontology matching[C].InISWC,Part I,vol.LNCS 9981.Springer,Kobe,Japan,2016:393-410P),此时限制一个名称空间(一个实体,或一个有语义限制的数据源)下的每个属性最多可以匹配另一个名称空间下的一个属性,设置全局1:1的匹配约束[7](Chuncheng Xiang,Baobao Chang,ZhifangSui.An ontology matching approach based on affinity-preserving random walks[C]//Proceeding IJCAI'15Proceedings of the 24th International Conference onArtificial Intelligence,2015:1471-1477P)。然而,全局1:1匹配约束下的属性映射集的推导不是一个简单的过程,因为一个属性经常涉及多于一个匹配的属性对,而简单地选择具有最高匹配概率估计的对可能会导致冲突。Attributes in an attribute match set match each other exactly. The method rejects that an attribute mapping set contains many loosely related attributes and follows the widely used no-repetition assumption [6] (Imen Megdiche, Oliver Teste, CassiaTrojahn. An extensible linear approach for holistic ontology matching [C].InISWC, Part I, vol.LNCS 9981.Springer, Kobe, Japan, 2016:393-410P), at this time, each attribute under a namespace (an entity, or a data source with semantic restrictions) can match at most another namespace An attribute of , setting a global 1:1 matching constraint [7] (Chuncheng Xiang, Baobao Chang, ZhifangSui.An ontology matching approach based on affinity-preserving random walks[C]//Proceeding IJCAI'15Proceedings of the 24th International Conference onArtificial Intelligence, 2015:1471-1477P). However, the derivation of attribute mapping sets under the global 1:1 matching constraint is not a straightforward process, since an attribute often involves more than one matching attribute pair, and simply selecting the pair with the highest matching probability estimate may lead to conflicts.

将得到整个属性映射集群的过程称为全局属性映射,它将一个属性匹配对集合作为一个输入,返回一个属性映射集群。设I为命名空间数,J为匹配的属性匹配对数,对于两个不同的命名空间ns、nt,ns、nt下的属性数记为Ms、Mt,ns下的第a个属性,nt下的第b个属性分别记为pa s、pb t,通过最大化所有匹配属性对的总匹配概率来优化整体属性匹配,以此满足全局1:1限制:The process of getting the entire cluster of attribute maps is called global attribute map, which takes a set of attribute matching pairs as an input and returns a cluster of attribute maps. Let I be the number of namespaces, J be the number of matching attribute pairs, for two different namespaces n s , n t , the number of attributes under n s , n t is denoted as M s , M t , under n s The a-th attribute and the b-th attribute under n t are denoted as p a s and p b t respectively, and the overall attribute matching is optimized by maximizing the total matching probability of all matching attribute pairs, so as to meet the global 1:1 restriction:

其中σ(pa s,pb t)为属性对pa s和pb t的匹配概率;Θ(pa s,pb t)为指示函数,当选择属性对pa s、pb t来形成一个属性映射集时,函数值为1,否则为0。Where σ(p a s ,p b t ) is the matching probability of the attribute pair p a s and p b t ; Θ(p a s ,p b t ) is the indicator function, when the attribute pair p a s and p b t are selected To form an attribute mapping set, the function value is 1, otherwise it is 0.

对于由匹配属性对集合来形成属性匹配集群的算法1所示:For Algorithm 1, which forms attribute matching clusters by matching attribute pairs:

步骤四二、将属性匹配对集合按匹配概率降序排列(line 2),按序处理,对于属性对pa、pb,如果属性映射集群N中没有分别包含pa、pb的属性映射集Ni、Nj,则向N中添加属性映射集{pa、pb}(line 6-7);若属性映射集群N中包含pa的属性映射集Ni且属性映射集群N中包含来自与pb同一命名空间的属性(一个数据源就是一个命名空间,如果不确定记录来自哪个数据源,则这个记录自己本身就是一个命名空间。与pb来自同一命名空间的属性,也就是说,这个属性与pb属于一个记录,是一个记录包括的两个属性。),删掉pa≈pb这个属性对(line 8-9);否则(针对上述两个如果的否定,第一个如果是说,如果属性映射集群N中没有分别包含pa、pb的属性映射集Ni、Nj,第二个如果是说若属性映射集群N中存在包含pa的属性映射集Ni且属性映射集群N中其中包含来自与pb同一命名空间的属性。否则就是这两种情况都不满足的情况。),合并Ni、Nj成为一个更大的属性映射集Nk,将Nk加入到N中,同时从N中删除Ni、Nj(line 11-12);重复步骤四二直至J为空。Step 42: Arrange the set of attribute matching pairs in descending order of matching probability (line 2), and process them sequentially. For attribute pairs p a and p b , if there is no attribute mapping set in the attribute mapping cluster N that contains p a and p b respectively N i , N j , add attribute mapping set {p a , p b } to N (line 6-7); if attribute mapping cluster N contains attribute mapping set N i of p a and attribute mapping cluster N contains Attributes from the same namespace as p b (a data source is a namespace, if you are not sure which data source the record comes from, the record itself is a namespace. Attributes from the same namespace as pb, that is, This attribute and pb belong to a record, which are two attributes included in a record.), delete the attribute pair p a ≈ p b (line 8-9); otherwise (for the negation of the above two ifs, the first if That is, if there are no attribute mapping sets N i and N j respectively containing p a and p b in the attribute mapping cluster N, the second if means that if there is an attribute mapping set N i containing p a in the attribute mapping cluster N and The attribute mapping cluster N contains attributes from the same namespace as p b . Otherwise, the two conditions are not satisfied.), merge N i , N j into a larger attribute mapping set N k , and set N Add k to N, and delete N i and N j from N at the same time (line 11-12); repeat steps 4 and 2 until J is empty.

其它步骤及参数与具体实施方式一至六之一相同。Other steps and parameters are the same as one of the specific embodiments 1 to 6.

具体实施方式八:本实施方式与具体实施方式一至七之一不同的是,所述步骤五、计算属性映射集的优度;具体过程为:Embodiment 8: The difference between this embodiment and one of Embodiments 1 to 7 is that the step 5 is to calculate the goodness of the attribute mapping set; the specific process is:

此时,已经得到了属性映射集群。一个映射集内的所有属性,互相形成了语义映射,它们所对应的信息为一类信息。甄灵敏等人[8](甄灵敏,杨晓春,王斌等.基于属性权重的实体解析技术[J].计算机研究与发展,2013,50(Suppl.):281-289P ZHEN Lingmin,YANGXiaochun,WANG Bin,et al.Entity Analysis Technology Based on AttributeAt this point, the attribute map cluster has been obtained. All the attributes in a mapping set form a semantic mapping with each other, and the information corresponding to them is a type of information. Zhen Lingmin et al. [8] (Zhen Lingmin, Yang Xiaochun, Wang Bin, etc. Entity Resolution Technology Based on Attribute Weights [J]. Computer Research and Development, 2013, 50 (Suppl.): 281-289P ZHEN Lingmin, YANGXiaochun, WANG Bin, et al. Entity Analysis Technology Based on Attribute

Weight[J].Computer Research and Development,2013,50(Suppl.):281-289P)认为,通过对一些重要的属性赋予更高的权值,也可以摒弃一些不重要的属性,来增加实体解析的准确性和效率。计算属性映射集的优度,对应上述的权值。在进行后续的实体解析时,进行加权计算,提高其准确性。Weight[J].Computer Research and Development,2013,50(Suppl.):281-289P) believes that by assigning higher weights to some important attributes, some unimportant attributes can also be discarded to increase entity resolution accuracy and efficiency. Calculate the superiority of the attribute mapping set, corresponding to the above weights. When performing subsequent entity resolution, weighted calculations are performed to improve its accuracy.

属性映射集的优度(good()):一个属性映射集,其中的属性名可能不同,但是对应一类信息。属性映射集所指向的一类信息,其对应的属性值信息的本身,对实体解析的帮助大小称为属性映射集的优度。Goodness of attribute mapping set (good()): An attribute mapping set, in which the attribute names may be different, but correspond to one type of information. A class of information pointed to by an attribute mapping set, and its corresponding attribute value information itself, is called the goodness of the attribute mapping set to help entity parsing.

本发明通过以下三个方面计算属性映射集优度:The present invention calculates the attribute mapping set goodness through the following three aspects:

a、辨别性:a. Discrimination:

一个属性映射集,如果其值较为多变,跨度较大,对实体解析的帮助会很小。属性映射集对应的值,在相对较小范围内变化会对实体解析更有用。设R为记录集合,对于一个属性映射集Ni,在R上定义一个辨别优度函数,记为discr(Ni):An attribute mapping set, if its value is more changeable and the span is larger, will not help entity resolution very much. The value corresponding to the attribute map set is more useful for entity resolution if it changes in a relatively small range. Let R be the record set, and for an attribute mapping set N i , define a discrimination function on R, denoted as discr(N i ):

其中,val(r,Ni)提取了记录在属性映射集里属性Ci(属性映射集里的属性,来自不同的命名空间,通过计算,假设他们是指向同一类信息的。对这个映射集赋予权值,来增加实体解析的准确度,辨别性就是其中一种权值计算方法。如果这个映射集的属性,对应的属性值差异较大,则认为这个属性映射集的帮助较小,它们指向一类信息的概率也较低,要赋予较低的权值,以免影响准确度。)上的属性值,norm()为对不同来源的属性值进行规范化;U为并集符号;Among them, val(r, N i ) extracts the attribute C i recorded in the attribute mapping set (the attributes in the attribute mapping set come from different namespaces, and through calculation, it is assumed that they point to the same type of information. For this mapping set Giving weights to increase the accuracy of entity parsing. Discrimination is one of the weight calculation methods. If the attributes of this mapping set have a large difference in the corresponding attribute values, it is considered that this attribute mapping set is less helpful, and they The probability of pointing to a type of information is also low, and a lower weight should be assigned to avoid affecting the accuracy. ), norm() is to normalize the attribute values from different sources; U is the union symbol;

b、丰富性:b. Richness:

属性映射集所具备的值越多,为实体解析提供的信息越丰富,也就是说,一个记录在此属性映射集上的属性,其值不为空,对实体解析有益。设R为记录集合,对于一个属性映射集Ni,在R上定义一个丰富优度函数,记为abund(Ni):The more values an attribute mapping set has, the more information it provides for entity resolution. That is to say, the value of an attribute recorded in this attribute mapping set is not empty, which is beneficial to entity resolution. Let R be a record set, and for an attribute mapping set N i , define an abundance function on R, denoted as abund(N i ):

此时有Θ()表示函数名,如果记录r在属性集Ni上有值,则此函数值为1,无值,则为0。At this time there is Θ() represents the function name, if the record r has a value in the attribute set N i , the value of this function is 1, and if there is no value, it is 0.

辨别性和丰富性进行加和,从大到小进行排序,按从大到小的顺序计算多样性;Sum discrimination and richness, sort from large to small, and calculate diversity in order from large to small;

c、多样性:c. Diversity:

为了增加多样性,并减少不同属性映射集之间的重复,对于存在冗余信息的属性映射集,可以减少其优度。每选择一个属性映射集,需要将它与之前确定选择使用的映射集进行比较,如果它的信息和之前的映射集产生了大量的重复,则降低其多样性。设Ni为当前的映射集,Nselected为已经选择的映射集群;对于给定Nselected,Ni的多样性记为div(Ni|Nselected):To increase diversity and reduce duplication among different attribute mapping sets, the goodness of attribute mapping sets with redundant information can be reduced. Every time an attribute mapping set is selected, it needs to be compared with the previously selected mapping set. If its information has a lot of duplication with the previous mapping set, its diversity will be reduced. Let N i be the current mapping set, N selected is the selected mapping cluster; for a given N selected , the diversity of Ni is recorded as div(N i |N selected ):

其中,Si表示由Ni实例化的记录集合(如果对于属性映射集Ni,其中有{att1,att2,...,attn}共n个属性,若记录r包含其中一个属性,并这个属性值不为空,则称属性映射集Ni实例化了这个记录r。),即Sj表示由Nj实例化的记录集合,即Sv(vx,vy)为vx、vy的相似度(利用编辑距离计算或其他合适的字符串相似度函数计算),vx为记录r在Ni下的属性值,vy为记录r在Nj下的属性值;div(Ni|Nj)表示选取Nj作为计算记录相似度的计算映射集后,在考量是否选取Ni作为计算记录相似的映射集时,考虑Ni的信息和Nj的信息的重复度;Among them, S i represents the record set instantiated by N i (if for the attribute mapping set Ni, there are n attributes {att1,att2,...,attn}, if record r contains one of the attributes, and this attribute value is not empty, it is said that the attribute mapping set Ni instantiates this record r.), that is S j represents the set of records instantiated by N j , i.e. S v (v x , v y ) is the similarity of v x and v y (calculated by edit distance calculation or other suitable string similarity functions), v x is the attribute value of record r under N i , v y is to record the attribute value of r under N j ; div(N i |N j ) indicates that after selecting N j as the calculation mapping set for calculating record similarity, when considering whether to select N i as the calculation mapping set for calculating record similarity, consider The repetition degree of the information of N i and the information of N j ;

通过两步来结合优度。第一步合并辨别性和多样性,它们反映了一个属性映射集的静态优度,对属性映射集的静态优度降序排步结合多样性来更新优度。第二步结合多样性来更新优度。设N为排序的属性映射集群,对于一个映射集Ni∈N,映射集Ni整体优度为good(Ni):Combining goodness in two steps. The first step combines discrimination and diversity, which reflect the static goodness of an attribute mapping set, and the descending order of the static goodness of the attribute mapping set combines diversity to update the goodness. The second step incorporates diversity to update the goodness. Let N be the sorted attribute mapping cluster. For a mapping set N i ∈ N, the overall goodness of the mapping set N i is good(N i ):

comb(Ni)=α·discr(Ni)+(1-α)·abund(Ni)comb(N i )=α·discr(N i )+(1-α)·abund(N i )

其中,α为辨别性优度的权值,γ为为静态优度的权值,0≤α,γ≤1;comb(Ni)为静态属性集优度,合并了辨别性和丰富性。Among them, α is the weight of discriminative goodness, γ is the weight of static goodness, 0≤α, γ≤1; comb(N i ) is the goodness of static attribute set, combining discrimination and richness.

其它步骤及参数与具体实施方式一至七之一相同。Other steps and parameters are the same as one of the specific embodiments 1 to 7.

具体实施方式九:本实施方式与具体实施方式一至八之一不同的是,所述步骤六中得到属性映射集群中各个映射集的优度后,在块内进行实体解析工作,从而排除误纳入此块,指向其他实体的数据记录;计算过程如下:Embodiment 9: The difference between this embodiment and one of Embodiments 1 to 8 is that in the step 6, after obtaining the superiority of each mapping set in the attribute mapping cluster, the entity resolution work is performed in the block, so as to eliminate misincorporation This block points to the data records of other entities; the calculation process is as follows:

有记录对ri、rj,此时ri有m个属性,rj有n个属性,其中,通过映射的属性有p个,且p≤min(m,n),映射的属性为{att1,att2,...,attp},ri与rj的相似度为:There are record pairs r i , r j , at this time r i has m attributes, r j has n attributes, among them, there are p attributes that pass the mapping, and p≤min(m,n), the mapped attributes are { att 1 ,att 2 ,...,att p }, the similarity between r i and r j is:

其中,Nl为属性映射集,simcontent(ri.attl,rj.attl)为ri、rj两条记录映射属性的属性值相似度;attl为一个属性映射集所对应的属性,此映射集包含了ri与rj中某一属性的映射;将两条记录的相似度与预先设定的阈值λ相比较,若大于阈值λ,则认为匹配;由于按序处理,所以,当记录对匹配时,合并为一条新记录,新记录涵盖原记录对的信息;simcontent()的计算方法和整合过程在下面进行详细描述。Among them, N l is the attribute mapping set, sim content (r i .att l , r j .att l ) is the attribute value similarity of r i and r j two record mapping attributes; att l is the corresponding Attributes, this mapping set contains the mapping between r i and a certain attribute in r j ; compare the similarity of two records with the preset threshold λ, if it is greater than the threshold λ, it is considered a match; due to sequential processing , so, when the record pair matches, it is merged into a new record, and the new record covers the information of the original record pair; the calculation method and integration process of sim content () are described in detail below.

所述simcontent(ri.attl,rj.attl)的具体求解过程为:类正则表达式的记录信息整合The specific solution process of the sim content (r i .att l , r j .att l ) is: the integration of record information similar to regular expressions

在记录的合并过程中,可以利用一种类似于正则表达式方法来对信息来进行合并,因为这种信息合并和正则表达式有一些相似之处,都是对一类信息确定一个规律。合并记录对时,映射属性的两个值合为一个类正则表达式。In the process of merging records, a method similar to regular expressions can be used to merge information, because this kind of information merging has some similarities with regular expressions, and both determine a rule for a type of information. When merging record pairs, the two values of the mapped attribute are combined into a regular expression-like.

类正则表达式概念Regular expression-like concepts

由于映射的属性其值有相似性,可以对相同部分进行统一,不同部分进行保留,以此形成一个类正则表达式,可以有效合并匹配信息。Since the values of the mapped attributes are similar, the same part can be unified and different parts can be reserved to form a regular expression, which can effectively merge matching information.

类正则表达式(Similar to Regular Expression,StRE):设∑为一个字母表,ε为空字符。一个类正则表达式StRE=S[1]S[2]...S[n]。其中对于任意i(1≤i≤n),元素S[i]={ci,1,ci,2,...,ci,ni},对于j(1≤j≤ni),ci,j∈∑∪{ε}。以下将类正则表达式简称为表达式。Similar to Regular Expression (StRE): Let ∑ be an alphabet and ε be an empty character. A regular expression-like StRE=S[1]S[2]...S[n]. where for any i (1≤i≤n), element S[i]={c i,1 ,c i,2 ,...,c i,ni }, for j (1≤j≤n i ), c i,j ∈∑∪{ε}. Hereinafter, the class regular expression is simply referred to as expression.

R为属性值对,S为根据属性值对衍生出的表达式,G为S实例化的集合,此时R∈G。如类正则表达式{t,n}ight可以实例化出tight和night。R is an attribute-value pair, S is an expression derived from an attribute-value pair, and G is a collection of S instantiations. At this time, R∈G. For example, the class regular expression {t,n}ight can instantiate tight and night.

一个属性值的表达式为其值本身。而由两个属性值合并而成的表达式,或一个属性值和一个表达式合并而成的类正则表达式则需要进行推理计算。An expression for a property value is its value itself. However, an expression formed by combining two attribute values, or a regular expression formed by combining an attribute value and an expression requires inference calculation.

(1)、计算类正则表达式的相似度;具体过程为:(1), calculate the similarity of regular expressions; the specific process is:

一个属性值的表达式为本身,计算两个表达式相似度,可以利用编辑距离;记编辑距离函数为D(i,j),i、j分别表示两个字符串a与b的长度,采用动态规划来计算编辑距离;得出字符串间的编辑距离相似度函数simedit(a,b):The expression of an attribute value is itself, and the similarity between two expressions can be calculated by using the edit distance; record the edit distance function as D(i, j), where i and j represent the lengths of two strings a and b respectively, using Dynamic programming to calculate the edit distance; get the edit distance similarity function sim edit (a,b) between strings:

计算一个表达式和一个属性值的相似度,或者两个表达式的相似度,与编辑距离相似。由于表达式中可能存在多个元素或空字符,所以对编辑距离函数稍加改动,改动原则如下:(1)对于多个字符,如果两个表达式元素中含有相同字符,则表达式元素匹配;(2)对于空字符,如果不匹配,包含空字符的表达式元素取空字符,此时不占长度,对编辑距离无影响。Computes the similarity between an expression and an attribute value, or between two expressions, similar to edit distance. Since there may be multiple elements or empty characters in the expression, the edit distance function is slightly modified. The principle of modification is as follows: (1) For multiple characters, if two expression elements contain the same character, the expression element matches ;(2) For the null character, if there is no match, the expression element containing the null character takes the null character, which does not account for the length at this time, and has no effect on the edit distance.

通过这个以上原则得出两个表达式的最小编辑距离;表达式S1、S2的编辑距离D(|S1|,|S2|)如下:Through the above principles, the minimum edit distance of two expressions is obtained; the edit distance D(|S 1 |,|S 2 |) of expressions S 1 and S 2 is as follows:

S[i]代表表达式的第i个元素,S1[i]表示表达式S1的第i个元素,S2[j]表示表达式S2的第j个元素,且有初始条件:S[i] represents the i-th element of the expression, S 1 [i] represents the i-th element of the expression S 1 , S 2 [j] represents the j-th element of the expression S 2 , and has initial conditions:

其中,函数MU对应原则(1),只要表达式元素间含有相同的字符,则匹配;原则(1)对于多个字符,如果两个表达式元素中含有相同字符,则表达式元素匹配;Among them, the function MU corresponds to principle (1), as long as the expression elements contain the same character, then match; principle (1) for multiple characters, if two expression elements contain the same character, then the expression elements match;

函数NU对应原则(2),表达式元素中若存在空字符,则直接将其忽略,不对编辑距离的计算产生影响,可以维持编辑距离最小;原则(2)对于空字符,如果不匹配,包含空字符的表达式元素取空字符,此时不占长度,对编辑距离无影响。最后两个表达式的编辑距离为D(|S1|,|S2|);The function NU corresponds to the principle (2). If there is a null character in the expression element, it will be ignored directly, which will not affect the calculation of the edit distance, and the edit distance can be kept at the minimum; principle (2) for the null character, if it does not match, include The expression element of a null character takes a null character, which does not take up the length at this time and has no effect on the edit distance. The edit distance of the last two expressions is D(|S 1 |,|S 2 |);

此时,表达式间的编辑距离相似度为simedit(S1,S2):At this point, the edit distance similarity between expressions is sim edit (S 1 ,S 2 ):

其中,simedit(S1,S2)即为simcontent(),S1与S2为函数simcontent(ri.attl,rj.attl)中的两个属性值attlAmong them, sim edit (S 1 , S 2 ) is sim content (), S 1 and S 2 are two attribute values att l in the function sim content (r i .att l , r j .att l );

在对比合并过程中,将记录的属性值看做一个个表达式进行计算,对比计算完成后,根据表达式生成属性信息。During the comparison and merging process, the recorded attribute values are regarded as expressions for calculation, and after the comparison calculation is completed, the attribute information is generated according to the expressions.

(2)、类正则表达式的生成(2), the generation of regular expressions

根据两个表达式生成新的表达式,原则是引入最少的无关示例。例如,属性值对cute kid和cut kind,则可以有表达式S1=cut{e,ε}ki{d,n}{d,ε},S1可得到8个实例,引入了6个无关实例,也可以有表达式S2=cut{e,ε}ki{n,ε}d,可得到4个实例,引入2个无关实例。所以S2优于S1Generate a new expression from two expressions, the principle is to introduce the least irrelevant examples. For example, for the attribute value pair cute kid and cut kind, there can be an expression S 1 =cut{e,ε}ki{d,n}{d,ε}, and S 1 can get 8 instances, introducing 6 irrelevant Instances, there may also be an expression S 2 =cut{e,ε}ki{n,ε}d, 4 instances can be obtained, and 2 irrelevant instances are introduced. So S 2 is better than S 1 .

本发明利用表达式的编辑距离矩阵M,从M[|S1|,|S2|]出发,一直回溯到M[0,0],即可得到表达式的表达式,生成规则如下所示:The present invention utilizes the edit distance matrix M of the expression to start from M[|S 1 |,|S 2 |] and go back to M[0,0] to obtain the expression of the expression. The generation rules are as follows :

特别地Particularly

其中,k为回溯过程中的某一位置,S[k]为当前位置的元素,当i=j=0时回溯结束;此时,表达式S1、S2合并成类正则表达式为S=...S[k]...S[0](其中k按回溯的顺序倒序排列)。且在生成的三个条件同时满足的情况下,函数MU的优先级高于函数NU。此时有助于相同字符的合并,减少无关实例的引入。NU(S2[j])为表达式S2的第j个元素对应的函数,NU(S1[i])为表达式S1的第i个元素对应的函数,MU(S1[i],S2[j])为表达式S2的第j个元素和表达式S1的第i个元素对应的函数;Among them, k is a certain position in the backtracking process, S[k] is the element of the current position, and when i=j=0, the backtracking ends; at this time, the expressions S 1 and S 2 are combined into a regular expression of S =...S[k]...S[0] (where k is arranged in the reverse order of the backtracking order). And when the three generated conditions are satisfied at the same time, the priority of the function MU is higher than that of the function NU. At this time, it is helpful to merge the same characters and reduce the introduction of irrelevant instances. NU(S 2 [j]) is the function corresponding to the jth element of the expression S 2 , NU(S 1 [i]) is the function corresponding to the i-th element of the expression S 1 , MU(S 1 [i ], S 2 [j]) is the function corresponding to the j element of expression S 2 and the i element of expression S 1 ;

用一个例子来说明类正则表达式的相似度计算以及生成过程。设属性值attvalue1=“cute kid”,attvalue2=“cut kind”。而这对应的表达式分别为S1=cutekid,S2=cut kind。此时可根据类正则表达式的生成公式得到距离矩阵,如表1所示。An example is used to illustrate the similarity calculation and generation process of class regular expressions. Set attribute values attvalue 1 = "cute kid", attvalue 2 = "cut kind". The corresponding expressions are respectively S 1 =cutekid and S 2 =cut kind. At this time, the distance matrix can be obtained according to the generation formula of the regular expression, as shown in Table 1.

表1类正则表达式的相似度计算与生成矩阵Table 1 Similarity calculation and generation matrix of regular expressions

其中矩阵中的每个值M[i,j]表示S1前i个元素与S2前j个元素的编辑距离,由于此时表达式为属性值本身,所以表达式的编辑距离等于属性值的编辑距离。S1与S2的编辑距离为M[7,7]=2,表达式相似度为1-2/7=5/7。假设此时两条记录的相似度超过了阈值,被确认匹配,则需要合并两个属性值。根据表达式的生成规则,在表达式矩阵M的基础上,从M[7,7]开始回溯。M[7,7]=M[6,6]+MU(S1[6],S2[6]),所以回溯到M[6,6]的位置,以此类推回溯。回溯路径在表1中加粗标记。最后得到表达式cut{e,ε}ki{ε,n}d。Each value M[i,j] in the matrix represents the edit distance between the first i elements of S 1 and the first j elements of S 2. Since the expression is the attribute value itself at this time, the edit distance of the expression is equal to the attribute value edit distance. The edit distance between S 1 and S 2 is M[7,7]=2, and the expression similarity is 1-2/7=5/7. Assuming that the similarity of the two records exceeds the threshold and they are confirmed to match, the two attribute values need to be merged. According to the expression generation rules, on the basis of the expression matrix M, start backtracking from M[7,7]. M[7,7]=M[6,6]+MU(S 1 [6], S 2 [6]), so backtrack to the position of M[6,6], and so on. The backtracking paths are marked in bold in Table 1. Finally, the expression cut{e,ε}ki{ε,n}d is obtained.

(3)、实体信息的生成(3) Generation of entity information

在块处理结束后,对于属性映射集中的属性值,经过一步步的记录对比与信息合并生成,最后得到一个属性映射集的最终的表达式。可以在类正则表达式的生成过程中记录每一个元素出现的频数,利用这个带频数的类正则表达式,在每个类正则表达式元素中选取出现频数最大的字符作为该元素的值,最后生成一个出现频率最高的字符串作为该属性映射集的属性值。如三个属性值Mike Doe、M.Doe、Mike D.,得到表达式M{i:2,.:1}{k:2,ε:1}{e:2,ε:1}D{o:2,.:1}{e:2,ε:1},最后根据频次得到属性值Mike Doe。After the block processing is completed, the attribute values in the attribute mapping set are generated through step-by-step record comparison and information merging, and finally a final expression of the attribute mapping set is obtained. The occurrence frequency of each element can be recorded in the generation process of the regular expression-like, and the character with the highest frequency of occurrence can be selected in each regular-expression-like element as the value of the element by using the regular expression with frequency. Generate a string with the highest frequency as the attribute value of this attribute map set. Such as three attribute values Mike Doe, M.Doe, Mike D., get the expression M{i:2,.:1}{k:2,ε:1}{e:2,ε:1}D{o :2,.:1}{e:2,ε:1}, and finally get the attribute value Mike Doe according to the frequency.

这种类正则表达式,对于由于排版、拼写等原因造成错误字符,这种错误字符出现频次必然要少于正确拼写的字符,所以取频次最大的字符有助于过滤噪声。For this type of regular expression, for wrong characters caused by typesetting, spelling, etc., the frequency of such wrong characters must be less than that of correctly spelled characters, so taking the character with the highest frequency helps to filter noise.

其它步骤及参数与具体实施方式一至八之一相同。Other steps and parameters are the same as those in Embodiments 1 to 8.

采用以下实施例验证本发明的有益效果:Adopt the following examples to verify the beneficial effects of the present invention:

实施例一:Embodiment one:

本实施例具体是按照以下步骤制备的:This embodiment is specifically prepared according to the following steps:

实验数据集Experimental dataset

分块实验使用两个数据集,简称为D1和D2。D1抽取自DBPedia与IMDB所共有的电影信息数据集。D2抽取自两个版本的DBPedia所构成的数据集[9](George Papadakis,Jonathan Svirsky,Avigdor Gal,Themis Palpanas.Comparative analysis ofapproximate blocking techniques for entity resolution[J].Proceedings of theVLDB Endowment,2016,9(9):684-695P)。其中D1有50796条记录,包含22403个实体,D2包含335479条记录,共有89258个实体。The chunking experiment uses two data sets, referred to as D 1 and D 2 for short. D 1 is extracted from the movie information dataset shared by DBPedia and IMDB. D 2 is extracted from the dataset composed of two versions of DBPedia [9] (George Papadakis, Jonathan Svirsky, Avigdor Gal, Themis Palpanas. Comparative analysis of approximate blocking techniques for entity resolution [J]. Proceedings of the VLDB Endowment, 2016, 9 (9):684-695P). Wherein D1 has 50796 records containing 22403 entities and D2 contains 335479 records containing 89258 entities in total.

实验评价标准Experimental Evaluation Criteria

本文采用采用的评价标准[10](Chirag Nagpal,Kyle Miller,BenediktBoecking,Artur Dubrawski.An Entity Resolution approach to isolate instancesof Human Trafficking online[J].Computer Science,2017,3(18):10-18P):The evaluation criteria used in this paper [10] (Chirag Nagpal, Kyle Miller, Benedikt Boecking, Artur Dubrawski.An Entity Resolution approach to isolate instances of Human Trafficking online[J].Computer Science,2017,3(18):10-18P):

分块标准:对完整性(PC),消减率(RR),F值(F=2×PC×RR/(PC+RR));Blocking criteria: pair integrity (PC), reduction rate (RR), F value (F=2×PC×RR/(PC+RR));

解析标准:准确率(P)、召回率(R)和F值(F=2×P×R/(P+R))。Analysis criteria: precision (P), recall (R) and F value (F=2×P×R/(P+R)).

分块方法的实验分析Experimental Analysis of Blocking Method

在分块过程中,本发明描述了四种剪枝方法,将计算记录相似度的参数α设置值为0.6,然后进行剪枝:对于边中心化的阈值剪枝方案(ECWP),在数据集上阈值wmin分别取0.6,0.5时,F值最高;对于结点中心化的阈值剪枝方案(NCWP),实验基于一种平均分布假设,使用统一的权值阈值,执行过程及阈值与ECWP相同;对于边中心化的基数剪枝方案(ECCP),在剪枝时,需要保留共k条边,k随总数|E|而变化,取k=0.5*|E|时,在两个数据集上F值均达到最高;对于结点中心化的基数剪枝方案(NCCP),剪枝过程中,对每个结点ri所连接的边,保留kri条,此时仍基于平均分布假设,取kri=0.5*|Eri|时,其F值均达到最高。此时,对于四种剪枝方法,取其表现最优的阈值,进行对比,如表2所示。In the block process, the present invention describes four kinds of pruning methods, and the parameter α setting value of calculating record similarity is 0.6, and then pruning: for the edge-centered threshold pruning scheme (ECWP), in the data set When the upper threshold w min is 0.6 and 0.5 respectively, the F value is the highest; for the node-centralized threshold pruning scheme (NCWP), the experiment is based on an even distribution assumption, using a unified weight threshold, the execution process and the threshold and ECWP The same; for the edge-centralized cardinal pruning scheme (ECCP), a total of k edges need to be retained during pruning, and k varies with the total number |E|. When k=0.5*|E|, the two data The F value on the set reaches the highest; for the node-centralized cardinal pruning scheme (NCCP), during the pruning process, k ri pieces are kept for the edges connected to each node r i , which is still based on the average distribution Assume that when k ri =0.5*|E ri |, the F values all reach the highest. At this time, for the four pruning methods, the threshold with the best performance is selected for comparison, as shown in Table 2.

表2剪枝方法在两个数据集上的PC、RR值Table 2 PC and RR values of the pruning method on the two data sets

可以看出:边中心化的剪枝方法剪掉了更多无用的记录对,着重效率,适合规模较大的实体解析任务,尤其在匹配实体较少的预期情况下;而对于结点中心化的剪枝方法,保留更多的匹配记录对。边中心化的算法保留了权值为top-k或权值大于预设阈值的边,而结点中心化算法确保每个结点都与其最相似的记录相连,更适用于注重准确性的应用;权值阈值算法保留了相似度较高的记录对,保证了算法的准确性;基数阈值控制了待比较记录对的数量,保留了权值top-k的边,会对准确性有影响,但对方法效率有保证。It can be seen that the edge-centralized pruning method cuts out more useless record pairs, focuses on efficiency, and is suitable for large-scale entity resolution tasks, especially when there are fewer matching entities; while for node-centralized The pruning method keeps more matching record pairs. The edge-centralized algorithm retains edges whose weight is top-k or whose weight is greater than a preset threshold, while the node-centralized algorithm ensures that each node is connected to its most similar record, which is more suitable for applications that focus on accuracy ;The weight threshold algorithm retains record pairs with high similarity, which ensures the accuracy of the algorithm; the cardinality threshold controls the number of record pairs to be compared, and retains the top-k edge, which will affect the accuracy. But there are guarantees for method efficiency.

整体的准确性均保持在97%以上,缩减率也保持在60%以上,说明对块图进行剪枝,对撇弃一定量的无用记录对,在一定程度上保证了算法效率。The overall accuracy is maintained above 97%, and the reduction rate is also maintained above 60%, indicating that pruning the block graph and discarding a certain amount of useless record pairs guarantees the efficiency of the algorithm to a certain extent.

属性映射与表达式的实验分析Experimental Analysis of Attribute Mapping and Expression

在计算属性集群的映射优度时,实验将参数值设为α=0.5,λ=0.4。选取最短距离法进行结果对比。对两个阶段进行结果测评:一是实体解析的结果,而是真实实体信息的生成。When calculating the mapping goodness of attribute clusters, the experiment sets the parameter values as α=0.5 and λ=0.4. The shortest distance method was selected for comparison of results. The results of two stages are evaluated: one is the result of entity resolution, but the generation of real entity information.

对于实体解析的结果,两种方法的准确率、召回率如图5a、5b所示。可以看出,随着阈值的上升,召回率下降,准确率大幅上升。在两个数据集上,阈值取0.6时,本文方法均达到了较高的F值。方法在两个数据集上,两种方法的召回率差别不大,而本文基于属性映射和类正则表达式的处理方法使得召回率稍佳。准确率明显好于基于最短距离的方法,准确率更高。表明本文的方法更适应数据空间的记录特点。而最短距离较依赖数据的语义,在语义性较强的环境下,能发更大的作用。For the results of entity parsing, the accuracy and recall of the two methods are shown in Figure 5a and 5b. It can be seen that as the threshold increases, the recall rate decreases and the precision rate increases significantly. On the two data sets, when the threshold value is 0.6, the method in this paper has achieved a higher F value. Methods On the two data sets, the recall rate of the two methods is not much different, while the processing method based on attribute mapping and regular expression in this paper makes the recall rate slightly better. The accuracy rate is significantly better than the method based on the shortest distance, and the accuracy rate is higher. It shows that the method in this paper is more suitable for the recording characteristics of the data space. The shortest distance is more dependent on the semantics of the data, and it can play a greater role in an environment with strong semantics.

对于实体信息的生成阶段,大部分实体解析工作的重点在于解析结果,而对实体信息和合并与实体的生成关注不高。利用数据集D1,在其上的生成实体数与实际实体数如图6所示。利用实体解析过程中产生的最佳阈值进行实体生成。从图中可以看出,本文解析结果更准确,生成实体数与真实实体数目较接近。但是其准确率仍明显高于最短距离法。For the generation phase of entity information, most of the entity resolution work focuses on the analysis results, but not much attention is paid to the generation of entity information and merging and entities. Using the data set D 1 , the number of generated entities and the actual number of entities on it are shown in Figure 6 . Entity generation is performed using the optimal threshold generated during entity resolution. It can be seen from the figure that the analytical results of this paper are more accurate, and the number of generated entities is closer to the number of real entities. But its accuracy is still significantly higher than the shortest distance method.

本发明还可有其它多种实施例,在不背离本发明精神及其实质的情况下,本领域技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。The present invention can also have other various embodiments, without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention, but these corresponding changes and deformations are all Should belong to the scope of protection of the appended claims of the present invention.

Claims (9)

1. the entity resolution method in data-oriented space, it is characterised in that: the method detailed process are as follows:
Step 1: building record figure;
Step 2: simplifying record figure using pruning method;
Step 3: carrying out piecemeal processing to the record figure after cutting;
Step 4: establishing attribute mapping cluster;
Step 5: the goodness of computation attribute mapping ensemblen;
Step 6: obtain attribute mapping cluster in each mapping ensemblen goodness after, entity resolution is carried out in block.
2. the entity resolution method in data-oriented space according to claim 1, it is characterised in that: constructed in the step 1 Record figure;Detailed process are as follows:
Similarity between two step 1 one, calculating records;
Step 1 two constructs record figure according to the record and similarity of data space.
3. the entity resolution method in data-oriented space according to claim 2, it is characterised in that: the step 1 one is fallen into a trap Calculate the similarity between two records;Detailed process are as follows:
Step 1 one by one, calculate label similarity:
Record is switched into tag set by label transfer function tag (), the label similarity of two records is calculated, is denoted as simtag(ri,rj):
Wherein, T (ri) it is that r will be recorded by label transfer functioniThe standardization tally set being converted into;T(rj) it is to be turned by label Exchange the letters number will record rjThe standardization tally set being converted into;
Step 1 one or two, calculated relationship similarity:
Two comprehensive similarities recorded in possessed all relationships are incorporated, sim is denoted asrel(ri,rj):
Wherein, Nbr (ri) indicate and record riThere are the set of records ends of connection, Nbr (r in rel relationshipj) indicate and record rj? There is the set of records ends of connection in rel relationship, REL is indicated in record r1、r2All record set of relationship of upper appearance;
Step 1 one or three integrates label similarity and relationship similarity, obtains comprehensive similarity sim (ri,rj):
sim(ri,rj)=α simtag(ri,rj)+(1-α)·simrel(ri,rj)
Wherein, α indicates the weight of label similarity.
4. the entity resolution method in data-oriented space according to claim 3, it is characterised in that: root in the step 1 two According to record and similarity the building record figure of data space;Detailed process are as follows:
The set of records ends R of a given data space, constructs a non-directed graph G=(R, E), referred to as record figure;
Wherein R is set of records ends, represents the record in data space;E is side collection, represents and remembers there are a line between two records The similarity of record pair.
5. the entity resolution method in data-oriented space according to claim 4, it is characterised in that: used in the step 2 Pruning method simplifies record figure;Detailed process are as follows:
Pruning method use while centralization radix beta pruning, node centralization radix beta pruning, while centralization threshold value beta pruning or One of threshold value beta pruning of node centralization;
(1) the radix beta pruning of side centralization:
Global radix threshold value k, which specifies record figure, will retain the sum on side, retain the side of k maximum weight;
(2) the radix beta pruning of node centralization:
For each node ri, retain link node riTop-k weight side;
(3) the threshold value beta pruning of side centralization:
Beta pruning is carried out in global scope using weight threshold, chooses minimum side right wmin, weight is lower than by all sides in traversing graph wminEdge contract;
(4) the threshold value beta pruning of node centralization:
One unified threshold value, the weight beta pruning scheme phase of beta pruning process and side centralization are used to all nodes of record figure Together.
6. the entity resolution method in data-oriented space according to claim 5, it is characterised in that: to cutting in the step 3 Record figure after change carries out piecemeal processing;Detailed process are as follows:
Figure after beta pruning is G={ R, E }, and R is set of records ends, and E is side collection;
Appoint and takes a record ri, create a block biAnd by riIt is placed in biIn, if riNode r in point regionjWith biIn all knots Point has Bian Xianglian, then by rjIt is placed in biIn, and delete rjWith biIn all nodes connected side, repeat this operation until time Go through riAll neighbor nodes;At this point, if biIn node become an isolated node in figure, it is boundless to be attached thereto, then from This node is deleted in figure;Step 3 is repeated until figure G is sky;At this point, piecemeal work is completed, set of blocks B={ b is obtained1, b2,...,b|B|}。
7. the entity resolution method in data-oriented space according to claim 6, it is characterised in that: established in the step 4 Attribute maps cluster;Detailed process are as follows:
Step 4 one, for two attributes from different entities, the similar value of computation attribute:
The similar value of attribute, is denoted as SV, all values of two attributes are compared, retains highest similarity score, obtains attributes match It is right;
If I is NameSpace number, J is matched attributes match logarithm, the NameSpace n different for twos、nt, ns、ntUnder Attribute number is denoted as Ms、Mt, nsUnder a-th of attribute, ntUnder b-th of attribute be denoted as p respectivelya s、pb t, by maximizing all Properties pair of total matching probability matches to optimize integrity attribute, meets global 1:1 limitation with this:
Wherein σ (pa s,pb t) it is attribute to pa sAnd pb tMatching probability;Θ(pa s,pb t) it is indicator function, when selection attribute pair pa s、pb tCome when forming an attribute mapping ensemblen, otherwise functional value 1 is 0;
Step 4 two is arranged attributes match by matching probability descending set, is sequentially handled, for attribute to pa、pbIf P is not separately included in attribute mapping cluster Na、pbAttribute mapping ensemblen Ni、Nj, then attribute mapping ensemblen { p is added into Na、pb}; If it includes p in cluster N that attribute, which maps,aAttribute mapping ensemblen NiAnd comprising coming from and p in attribute mapping cluster NbSame NameSpace Attribute, delete pa≈pbThis attribute pair;Otherwise, merge Ni、NjThe attribute mapping ensemblen N bigger as onek, by NkIt is added to In N, while N is deleted from Ni、Nj;Step 4 two is repeated until J is sky.
8. the entity resolution method in data-oriented space according to claim 7, it is characterised in that: calculated in the step 5 The goodness of attribute mapping ensemblen;Detailed process are as follows:
Pass through following three aspects computation attribute mapping ensemblen goodness:
A, discrimination property:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, a discrimination goodness function is defined on R, is denoted as discr (Ni):
Wherein, val (r, Ni) be extracted and be recorded in attribute C in attribute mapping ensembleniOn attribute value, norm () be to separate sources Attribute value standardize;U is union symbol;
B, rich:
Set of records ends is let R be, for an attribute mapping ensemblen Ni, an abundant goodness function is defined on R, is denoted as abund (Ni):
Have at this timeΘ () representative function name, if record r is in property set NiOn have value, Then this functional value is 1, and no value is then 0;
Distinguish property and it is rich sum up, be ranked up from big to small, by from big to small sequence calculate diversity;
C, diversity:
If NiFor current mapping ensemblen, NselectedFor the mapping cluster having been selected;For giving Nselected, NiDiversity note For div (Ni|Nselected):
Wherein, SiIt indicates by attribute mapping ensemblen NiThe set of records ends of instantiation, i.e.,SjIndicate by Attribute mapping ensemblen NjThe set of records ends of instantiation, i.e.,Sv(vx,vy) it is vx、vySimilarity, vx For attribute value of the record r at Ni, vyIt is record r in NjUnder attribute value;div(Ni|Nj) indicate to choose NjIt is recorded as calculating After the calculating mapping ensemblen of similarity, whether N is chosen consideringiWhen recording similar mapping ensemblen as calculating, N is considerediInformation and NjInformation multiplicity;
If N is the attribute mapping cluster of sequence, for a mapping ensemblen Ni∈ N, mapping ensemblen NiWhole goodness is good (Ni):
comb(Ni)=α discr (Ni)+(1-α)·abund(Ni)
Wherein, α is the weight of discrimination property goodness, and γ is the weight of static goodness, 0≤alpha, gamma≤1;comb(Ni) it is static belong to Property collection goodness, incorporate discrimination property and it is rich.
9. the entity resolution method in data-oriented space according to claim 8, it is characterised in that: obtained in the step 6 In attribute mapping cluster after the goodness of each mapping ensemblen, entity resolution is carried out in block;Calculating process is as follows:
There is record to ri、rj, r at this timeiThere are m attribute, rjThere is n attribute, wherein have p, and p≤min by the attribute of mapping (m, n), the attribute of mapping are { att1,att2,...,attp, riWith rjSimilarity are as follows:
Wherein, NlFor attribute mapping ensemblen, simcontent(ri.attl,rj.attl) it is ri、rjThe attribute value of two record attribute mappings Similarity;attlFor attribute corresponding to an attribute mapping ensemblen;
The similarity that two record is compared with preset threshold value λ, if more than threshold value λ, then it is assumed that matching;Due to sequentially Processing, so, when record is to matching, a new record is merged into, new record covers the information of former record pair.
CN201910435269.5A 2019-05-23 2019-05-23 Entity parsing method for data space in movie information dataset Expired - Fee Related CN110147393B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910435269.5A CN110147393B (en) 2019-05-23 2019-05-23 Entity parsing method for data space in movie information dataset

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910435269.5A CN110147393B (en) 2019-05-23 2019-05-23 Entity parsing method for data space in movie information dataset

Publications (2)

Publication Number Publication Date
CN110147393A true CN110147393A (en) 2019-08-20
CN110147393B CN110147393B (en) 2021-08-13

Family

ID=67593022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910435269.5A Expired - Fee Related CN110147393B (en) 2019-05-23 2019-05-23 Entity parsing method for data space in movie information dataset

Country Status (1)

Country Link
CN (1) CN110147393B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221995A (en) * 2019-10-10 2020-06-02 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145286A1 (en) * 2009-12-15 2011-06-16 Chalklabs, Llc Distributed platform for network analysis
CN102646137A (en) * 2012-04-19 2012-08-22 中国人民解放军总参谋部第六十三研究所 Automatic entity basic information generation system and method based on Markov model
CN106202502A (en) * 2016-07-20 2016-12-07 福州大学 In music information network, user interest finds method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110145286A1 (en) * 2009-12-15 2011-06-16 Chalklabs, Llc Distributed platform for network analysis
CN102646137A (en) * 2012-04-19 2012-08-22 中国人民解放军总参谋部第六十三研究所 Automatic entity basic information generation system and method based on Markov model
CN106202502A (en) * 2016-07-20 2016-12-07 福州大学 In music information network, user interest finds method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
甄灵敏等: "基于属性权重的实体解析技术", 《计算机研究与发展》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111221995A (en) * 2019-10-10 2020-06-02 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory
CN111221995B (en) * 2019-10-10 2023-10-03 南昌市微轲联信息技术有限公司 Sequence matching method based on big data and set theory

Also Published As

Publication number Publication date
CN110147393B (en) 2021-08-13

Similar Documents

Publication Publication Date Title
Zheng et al. Efficient simrank-based similarity join over large graphs
CN114218400A (en) Semantic-based data lake query system and method
CN107480694B (en) Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform
Megdiche et al. An extensible linear approach for holistic ontology matching
CN104809244B (en) Data digging method and device under a kind of big data environment
CN113535788A (en) A retrieval method, system, device and medium for marine environmental data
CN119129722B (en) Method for constructing knowledge graph based on large language model and vector library
CN109033303A (en) A kind of extensive knowledge mapping fusion method based on reduction anchor point
WO2015051481A1 (en) Determining collection membership in a data graph
US11947596B2 (en) Index machine
Singh et al. Probabilistic data structure-based community detection and storage scheme in online social networks
Essayeh et al. Towards ontology matching based system through terminological, structural and semantic level
CN104156431B (en) A kind of RDF keyword query methods based on sterogram community structure
Drakopoulos et al. A semantically annotated JSON metadata structure for open linked cultural data in Neo4j
Zheng et al. Efficient simrank-based similarity join
CN110147393A (en) The entity resolution method in data-oriented space
Fu et al. IbLT: An effective granular computing framework for hierarchical community detection
CN106933844B (en) Construction method of reachability query index facing large-scale RDF data
Flouris et al. Issues in complex event processing systems
CN116702788A (en) Unsupervised social event detection method based on increment and hierarchical structure entropy minimization
Xu Deep mining method for high-dimensional big data based on association rule
CN115438789A (en) Multi-space semantic data stream reasoning method based on rule embedded representation
Ran et al. Machine learning informed by micro-and mesoscopic statistical physics methods for community detection
Li et al. A novel approach for mining probabilistic frequent itemsets over uncertain data streams
Bagui et al. A review of data mining algorithms on Hadoop's MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210813