CN116361476A - 一种基于插值法的知识图谱负样本合成方法 - Google Patents

一种基于插值法的知识图谱负样本合成方法 Download PDF

Info

Publication number
CN116361476A
CN116361476A CN202211455256.2A CN202211455256A CN116361476A CN 116361476 A CN116361476 A CN 116361476A CN 202211455256 A CN202211455256 A CN 202211455256A CN 116361476 A CN116361476 A CN 116361476A
Authority
CN
China
Prior art keywords
negative
sample
cand
samples
negative sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211455256.2A
Other languages
English (en)
Other versions
CN116361476B (zh
Inventor
谢禹舜
顾钊铨
方滨兴
张小松
王乐
牛伟纳
韩伟红
李树栋
张登辉
谭润楠
龙宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Higher Research Institute Of University Of Electronic Science And Technology Shenzhen
University of Electronic Science and Technology of China
Guangzhou University
Original Assignee
Higher Research Institute Of University Of Electronic Science And Technology Shenzhen
University of Electronic Science and Technology of China
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Higher Research Institute Of University Of Electronic Science And Technology Shenzhen, University of Electronic Science and Technology of China, Guangzhou University filed Critical Higher Research Institute Of University Of Electronic Science And Technology Shenzhen
Priority to CN202211455256.2A priority Critical patent/CN116361476B/zh
Publication of CN116361476A publication Critical patent/CN116361476A/zh
Application granted granted Critical
Publication of CN116361476B publication Critical patent/CN116361476B/zh
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种基于插值法的知识图谱负样本合成方法,包括以下步骤:S1:候选集筛选:从负样本中筛选负样本集合cand_il,作为mixup操作的候选集;S2:mixup样本合成:选择cand_il中的负样本进行合成得到cand_im,再将cand_im中的负样本和正样本
Figure DDA0003953290660000011
进行二次mixup合成;S3:训练更新:将得到的负样本集合cand_il、cand_im、cand_ik再筛选得到cand_is,并将其用于模型训练和更新强负样本集合
Figure DDA0003953290660000012
本发明易于实现,运算速度快,不增加原始嵌入模型的复杂度;能增强虚拟负样本的多样性,提升知识图谱嵌入模型的性能,易于叠加到已有的知识图谱嵌入模型中。

Description

一种基于插值法的知识图谱负样本合成方法
技术领域
本发明属于知识图谱嵌入领域,具体涉及一种基于插值法的知识图谱负样本合成方法。
背景技术
知识图谱(Knowledge Graph)是一种大规模的语义网络知识库,他采取符号化的表达方式,通过三元组的形式将知识存储于计算机中,因为知识图谱具有语义丰富、结构友好、易于理解等优势,所以近年来被广泛应用于态势感知、推荐系统、自然语言处理等领域。
虽然知识图谱具有明显优势,但是知识图谱中的知识是大量缺失的,为了补全知识图谱,目前最常用的技术是知识图谱嵌入。知识图谱嵌入是将知识图谱中的实体和关系嵌入到一个低维连续空间内,在方便计算的同时还保留知识图谱的结构信息。
知识图谱嵌入的训练过程中,需要提供正样本和负样本,使模型具备识别正、负样本的能力。正样本通常为现有的事实知识,负样本则是通过替换正样本中的头/尾实体而生成,该项技术称之为负采样技术。现有的负采样技术利用多种信息,在大量负样本候选集中进行筛选,得到有利于模型训练的强负样本,比如专利“一种知识图谱嵌入训练方法和相关装置”(CN202110013880.6)使用图谱的拓扑结构辅助筛选负样本。近年来有研究注意到mixup合成样本领域,比如专利“少标记半监督学习中的插值对比学习方法”(CN202210024335.1)利用mixup插值方法在嵌入空间中生成虚拟的正样本对,解决标签数据较少的问题。
发明内容
鉴于现有问题,本发明的目的在于提供一种基于插值法的知识图谱负样本合成方法,通过对算法等技术方案的改进,以解决上述技术问题。
本发明提供如下的技术方案:
一种基于插值法的知识图谱负样本合成方法,包括以下步骤:
S1:候选集筛选:从负样本中筛选负样本集合cand_il,作为mixup操作的候选集;S2:mixup样本合成:将集合cand_il中的负样本进行mixup混合得到cand_im,再将cand_im中的负样本和正样本
Figure BDA0003953290640000021
进行mixup合成,得到强负样本cand_ik;S3:训练更新:将得到的负样本集合cand_il、cand_im、cand_ik再筛选一次得到cand_is,在模型训练中使用cand_is集合,更新强负样本集合/>
Figure BDA0003953290640000022
步骤S1包括以下步骤:
S11:在嵌入模型的第(e+1)轮训练过程中,对数量为n的正样本集合
Figure BDA0003953290640000023
和的每一个正样本/>
Figure BDA0003953290640000024
获取对应的样本数量为s的负样本集合/>
Figure BDA0003953290640000025
和上一轮模型更新得到的数量为h的强负样本集合/>
Figure BDA0003953290640000026
S12:从实体集合ε中随机挑选实体替换正样本
Figure BDA0003953290640000027
的参数h或t,生成数量为f的候选负样本集合/>
Figure BDA0003953290640000028
S13:从负样本集合NSi中任选n1个负样本,与
Figure BDA0003953290640000029
中的h个合成负样本相加,得到样本数量为n2的负样本集合
Figure BDA00039532906400000210
计算负样本集合cand_i中所有负样本和正样本/>
Figure BDA00039532906400000211
之间的相似度Ci
S14:将负样本集合cand_i中的样本根据相似度Ci的数值大小从大到小排序,取前top-l个样本记为负样本集合
Figure BDA00039532906400000212
负样本集合
Figure BDA00039532906400000213
中的负样本数量为l,负样本集合cand_il即为mixup操作的候选集。
优选地,步骤S13通过以下公式计算相似度Ci
Figure BDA00039532906400000214
其中,
Figure BDA00039532906400000215
为正样本/>
Figure BDA00039532906400000216
的嵌入形式,/>
Figure BDA00039532906400000217
为负样本集合cand_i中负样本/>
Figure BDA00039532906400000218
的嵌入形式。
步骤S2包括以下步骤:
S21:在负样本集合
Figure BDA00039532906400000219
通过对相似度Ci进行归一化操作得到每个样本对应的概率P1i和候选集cand_il的多项概率分布,依据候选集的多项概率分布和每个样本对应的概率P1i对候选集cand_il进行两次抽样,对得到的两个样本/>
Figure BDA00039532906400000220
和/>
Figure BDA00039532906400000221
进行mixup合成操作;
S22:重复以上操作m次,得到负样本集合
Figure BDA00039532906400000222
S23:计算cand_im中所有负样本和正样本
Figure BDA00039532906400000223
之间的相似度Cj
S24:在负样本集合
Figure BDA0003953290640000031
中,根据每个样本的相似度Cj,通过对的相似度Cj进行归一化操作计算每个样本对应的概率/>
Figure BDA0003953290640000032
Figure BDA0003953290640000033
和候选集cand_im的多项概率分布,通过概率P2j和候选集的多项概率分布对候选集cand_im进行一次抽样,将得到的一个负样本/>
Figure BDA0003953290640000034
与正样本/>
Figure BDA0003953290640000035
进行mixup合成操作;
S25:重复以上操作k次,得到负样本集合
Figure BDA0003953290640000036
优选地,步骤S21通过公式
Figure BDA0003953290640000037
计算概率P1i;通过公式:
Figure BDA0003953290640000038
Figure BDA0003953290640000039
计算mixup合成操作的结果,其中,αi为超参数,
Figure BDA00039532906400000310
是样本/>
Figure BDA00039532906400000311
和/>
Figure BDA00039532906400000312
经过mixup合成之后的样本,||.||是对样本/>
Figure BDA00039532906400000313
取L2正则化,其中,L2正则化公式为
Figure BDA00039532906400000314
n为W的维度;步骤S24通过公式/>
Figure BDA00039532906400000315
Figure BDA00039532906400000316
计算概率P2j;通过公式:
Figure BDA00039532906400000317
Figure BDA00039532906400000318
计算mixup合成操作的结果,其中,βi为超参数,
Figure BDA00039532906400000319
是正样本/>
Figure BDA00039532906400000320
和负样本/>
Figure BDA00039532906400000321
经过mixup合成之后的样本,||·||是对样本/>
Figure BDA00039532906400000322
取L2正则化,L2正则化公式为
Figure BDA00039532906400000323
n为W的维度。
步骤S23通过以下公式计算相似度Cj
Figure BDA00039532906400000324
其中,
Figure BDA00039532906400000325
为正样本/>
Figure BDA00039532906400000326
的嵌入形式,/>
Figure BDA00039532906400000327
为cand_im中负样本/>
Figure BDA00039532906400000328
的嵌入形式。
步骤S3包括以下步骤:
S31:将负样本集合cand_il、cand_im、cand_ik中所有负样本汇总,作为正样本
Figure BDA00039532906400000329
对应的负样本集合
Figure BDA00039532906400000330
S32:使用第e轮训练得到的嵌入模型Modele对cand_is中的所有负样本
Figure BDA00039532906400000331
进行打分,计算得到/>
Figure BDA0003953290640000041
根据scorei计算每个负样本对应的权重Pi
S33:将cand_is中的样本根据权重Pi的大小,从大到小排序,取前top-h个样本更新强负样本集合
Figure BDA0003953290640000042
优选地,步骤S32通过公式:
Figure BDA0003953290640000043
计算权重Pi,其中,ε为超参数,s为cand_is中的样本总数。
优选地,当训练模型是基于平移距离的知识图谱嵌入模型时,损失函数为:
Figure BDA0003953290640000044
其中,margin为超参数,
Figure BDA0003953290640000045
为Model对正样本/>
Figure BDA0003953290640000046
的打分,/>
Figure BDA0003953290640000047
为Model对负样本/>
Figure BDA0003953290640000048
的打分,Pj的值为步骤S3中计算得到的权重Pi;当训练模型是基于语义匹配的知识图谱嵌入模型时,损失函数为:
Figure BDA0003953290640000049
其中,
Figure BDA00039532906400000410
为Model对正样本/>
Figure BDA00039532906400000411
的打分,/>
Figure BDA00039532906400000412
为Model对负样本/>
Figure BDA00039532906400000413
的打分,Pj的值为步骤S3中计算得到的权重Pi
本发明的有益技术效果在于:
1.本发明提供的技术方案易于实现,运算效率高,不会增加原始嵌入模型的算法复杂度;
2.本发明提供的技术方案通过将mixup迁移到知识图谱嵌入领域,并进行双重mixup操作,增强了虚拟负样本的多样性;
3.本发明提供的技术方案可以挖掘出更多有利于模型训练的强负样本,提升知识图谱嵌入模型的整体性能;
4.本发明提供的技术方案可以叠加到任何已有的知识图谱嵌入模型中使用。
附图说明
图1是本发明实施例基于插值法的知识图谱负样本合成方法的流程示意图。
具体实施方式
下面对本发明的实施例作详细说明,下述的实施例在以本发明技术方案为前提下进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,在不冲突的情况下,本文所描述的实施例可以与其它实施例相结合。
实施例1
参见图,在本发明实施例提供的基于插值法的知识图谱负样本合成方法,包括以下步骤:S1:候选集筛选:从负样本中筛选负样本集合cand_il,作为mixup操作的候选集;S2:mixup样本合成:将集合cand_il中的负样本进行mixup混合得到cand_im,再将cand_im中的负样本和正样本
Figure BDA0003953290640000051
进行mixup合成,得到强负样本cand_ik;S3:训练更新:将得到的负样本集合cand_il、cand_im、cand_ik再筛选一次得到cand_is,在模型训练中使用cand_is集合,更新强负样本集合/>
Figure BDA00039532906400000511
步骤S1包括以下步骤:
S11:在嵌入模型的第(e+1)轮训练过程中,对数量为n的正样本集合
Figure BDA0003953290640000052
中的每一个正样本/>
Figure BDA0003953290640000053
获取对应的样本数量为s的负样本集合/>
Figure BDA0003953290640000054
和上一轮模型更新得到的数量为h的强负样本集合/>
Figure BDA0003953290640000055
S12:从实体集合ε中随机挑选实体替换正样本
Figure BDA0003953290640000056
的参数h或t,生成数量为f的候选负样本集合/>
Figure BDA0003953290640000057
S13:从负样本集合NSi中任选n1个负样本,与
Figure BDA0003953290640000058
中的h个合成负样本相加,得到样本数量为n2的负样本集合
Figure BDA0003953290640000059
计算负样本集合cand_i中所有负样本和正样本/>
Figure BDA00039532906400000510
之间的相似度Ci
S14:将负样本集合cand_i中的样本根据相似度Ci的数值大小从大到小排序,取前top-l个样本记为负样本集合
Figure BDA0003953290640000061
负样本集合
Figure BDA0003953290640000062
中的负样本数量为l,负样本集合cand_il即为mixup操作的候选集。
步骤S13通过以下公式计算相似度Ci
Figure BDA0003953290640000063
其中,
Figure BDA0003953290640000064
为正样本/>
Figure BDA0003953290640000065
的嵌入形式,/>
Figure BDA0003953290640000066
为负样本集合cand_i中负样本/>
Figure BDA0003953290640000067
的嵌入形式。
步骤S2包括以下步骤:
S21:在负样本集合
Figure BDA0003953290640000068
通过对相似度Ci进行归一化操作得到每个样本对应的概率P1i和候选集cand_il的多项概率分布,依据候选集的多项概率分布和每个样本对应的概率P1i对候选集cand_il进行两次抽样,对得到的两个样本/>
Figure BDA0003953290640000069
和/>
Figure BDA00039532906400000610
进行mixup合成操作;
S22:重复以上操作m次,得到负样本集合
Figure BDA00039532906400000611
S23:计算cand_im中所有负样本和正样本
Figure BDA00039532906400000612
之间的相似度Cj
S24:在负样本集合
Figure BDA00039532906400000613
中,根据每个样本的相似度Cj,通过对的相似度Cj进行归一化操作计算每个样本对应的概率/>
Figure BDA00039532906400000614
Figure BDA00039532906400000615
和候选集cand_im的多项概率分布,通过概率P2j和候选集的多项概率分布对候选集cand_im进行一次抽样,将得到的一个负样本/>
Figure BDA00039532906400000616
与正样本/>
Figure BDA00039532906400000617
进行mixup合成操作;
S25:重复以上操作k次,得到负样本集合
Figure BDA00039532906400000618
步骤S21通过公式
Figure BDA00039532906400000619
计算概率P1i;通过公式:
Figure BDA00039532906400000620
Figure BDA00039532906400000621
计算mixup合成操作的结果,其中,αi为超参数,
Figure BDA00039532906400000622
是样本/>
Figure BDA00039532906400000623
和/>
Figure BDA00039532906400000624
经过mixup合成之后的样本,||·||是对样本/>
Figure BDA00039532906400000625
取L2正则化,其中,L2正则化公式为
Figure BDA00039532906400000626
n为W的维度;步骤S24通过公式/>
Figure BDA00039532906400000627
Figure BDA00039532906400000628
计算概率P2j;通过公式:
Figure BDA00039532906400000629
Figure BDA0003953290640000071
计算mixup合成操作的结果,其中,βi为超参数,
Figure BDA0003953290640000072
是正样本/>
Figure BDA0003953290640000073
和负样本/>
Figure BDA0003953290640000074
经过mixup合成之后的样本,||·||是对样本/>
Figure BDA0003953290640000075
取L2正则化,L2正则化公式为
Figure BDA0003953290640000076
n为W的维度。
步骤S23通过以下公式计算相似度Cj
Figure BDA0003953290640000077
其中,
Figure BDA0003953290640000078
为正样本/>
Figure BDA0003953290640000079
的嵌入形式,/>
Figure BDA00039532906400000710
为cand_im中负样本/>
Figure BDA00039532906400000711
的嵌入形式。
步骤S3包括以下步骤:
S31:将负样本集合cand_il、cand_im、cand_ik中所有负样本汇总,作为正样本
Figure BDA00039532906400000712
对应的负样本集合
Figure BDA00039532906400000713
S32:使用第e轮训练得到的嵌入模型Modele对cand_is中的所有负样本
Figure BDA00039532906400000714
进行打分,计算得到/>
Figure BDA00039532906400000715
根据scorei计算每个负样本对应的权重Pi
S33:将cand_is中的样本根据权重Pi的大小,从大到小排序,取前top-h个样本更新强负样本集台
Figure BDA00039532906400000716
步骤S32通过公式:
Figure BDA00039532906400000717
计算权重Pi,其中,ε为超参数,s为cand_is中的样本总数。
当训练模型是基于平移距离的知识图谱嵌入模型时,损失函数为:
Figure BDA00039532906400000718
其中,margin为超参数,
Figure BDA00039532906400000719
为Model对正样本/>
Figure BDA00039532906400000720
的打分,/>
Figure BDA00039532906400000721
为Model对负样本/>
Figure BDA00039532906400000722
的打分,Pj的值为步骤S3中计算得到的权重Pi;当训练模型是基于语义匹配的知识图谱嵌入模型时,损失函数为:
Figure BDA00039532906400000723
其中,
Figure BDA00039532906400000724
为Model对正样本/>
Figure BDA00039532906400000725
的打分,/>
Figure BDA00039532906400000726
为Model对负样本/>
Figure BDA00039532906400000727
的打分,Pj的值为步骤S3中计算得到的权重Pi
实施例2
本发明另一优选实施例,在实施例1的基础上,在中文知识图谱中,对于某一个正样本
Figure BDA0003953290640000081
为(江西,省份,中国),经过/>
Figure BDA0003953290640000082
(假定k=5)嵌入表示为((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),通过从实体集合ε中任选头、尾实体进行随机替换生成大量负样本NSi,从NSi中任选n1(假定n1=5)个负样本:{((0.4,0.6,0.2,0.5,0.9),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.5,0.3,0.1,0.7,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.2,0.3,0.9,0.5,0.4)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.6,0.5,0.9,0.7,0.5)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.1,0.2,0.3,0.5))},同时强负样本集合/>
Figure BDA0003953290640000083
Figure BDA0003953290640000084
Figure BDA0003953290640000085
(假定h=3),将上述8个负样本混合后形成负样本集合cand_i:{((0.4,0.6,0.2,0.5,0.9),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.5,0.3,0.1,0.7,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.2,0.3,0.9,0.5,0.4)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.6,0.5,0.9,0.7,0.5)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.1,0.2,0.3,0.5)),((0.4,0.2,0.9,0.4,0.6),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.5,0.6,0.8,0.9,0.4),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7))}。
计算cand_i中所有负样本和正样本
Figure BDA0003953290640000086
之间的相似度Ci,得到相似度列表为{0.79,0.51,0.82,0.56,0.63,0.92,0.95,0.84},根据相似度降序排序,取top-l个样本(假定l=4),形成负样本集合cand_il:{((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.2,0.3,0.9,0.5,0.4)),((0.4,0.2,0.9,0.4,0.6),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.5,0.6,0.8,0.9,0.4),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7))}
通过对相似度Ci进行归一化操作得到每个样本对应的概率P1i和候选集cand_il的多项概率分布,依据候选集的多项概率分布和每个样本对应的概率P1i对候选集cand_il进行两次抽样和mixup合成操作,重复m次(假定m=4),挑选((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.2,0.3,0.9,0.5,0.4))和((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7))进行mixup得到((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.25,0.45,0.65,0.7,0.55))(假定α1=0.5);挑选((0.4,0.2,0.9,0.4,0.6),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))和((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7))进行mixup操作得到((0.25,0.35,0.75,0.35,0.4),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7))(假定α2=0.5),挑选((0.5,0.6,0.8,0.9,0.4),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))和((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7))进行mixup得到((0.3,0.55,0.7,0.6,0.3),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7))(假定α3=0.5),挑选((0.4,0.2,0.9,0.4,0.6),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))和((0.5,0.6,0.8,0.9,0.4),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))进行mixup得到((0.45,0.4,0.85,0.65,0.5),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))(假定α4=0.5),最后得到cand_im:{((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.25,0.45,0.65,0.7,0.55)),((0.25,0.35,0.75,0.35,0.4),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7)),((0.3,0.55,0.7,0.6,0.3),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7)),((0.45,0.4,0.85,0.65,0.5),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))}。
计算cand_im中所有样本和正样本之间的相似度Cj,得到相似度列表为{0.89,0.98,0.87,0.92},通过对的相似度Cj进行归一化操作计算每个样本对应的概率
Figure BDA0003953290640000091
和候选集cand_im的多项概率分布,依据候选集cand_im的多项概率分布和每个样本对应的概率P2j对候选集cand_im进行一次抽样并和正样本进行mixup操作,重复k次(假定k=4),挑选((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.25,0.45,0.65,0.7,0.55))样本((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))进行mixup得到((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.325,0.325,0.675,0.75,0.725))(假定β1=0.5),挑选((0.25,0.35,0.75,0.35,0.4),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7))和正样本((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))进行mixup得到((0.175,0.425,0.675,0.325,0.3),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7))(假定β2=0.5),挑选((0.3,0.55,0.7,0.6,0.3),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7))正样本((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))进行mixup得到((0.2,0.525,0.65,0.45,0.25),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7))(假定β3=0.5),挑选((0.45,0.4,0.85,0.65,0.5),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))和正样本((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))进行mixup得到((0.275,0.45,0.725,0.475,0.35),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))(假定β4=0.5),最后得到cand_ik:{((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.325,0.325,0.675,0.75,0.725)),((0.175,0.425,0.675,0.325,0.3),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7)),((0.2,0.525,0.65,0.45,0.25),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7)),((0.275,0.45,0.725,0.475,0.35),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))}。
混合负样本集合cand_il、cand_im和cand_ik得到cand_is:{((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.2,0.3,0.9,0.5,0.4)),((0.4,0.2,0.9,0.4,0.6),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.5,0.6,0.8,0.9,0.4),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.3,0.6,0.4,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.25,0.45,0.65,0.7,0.55)),((0.25,0.35,0.75,0.35,0.4),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7)),((0.3,0.55,0.7,0.6,0.3),(0.4,0.2,0.6,0.8,0.9),(0.4,0.35,0.5,0.9,0.7)),((0.45,0.4,0.85,0.65,0.5),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7)),((0.1,0.5,0.6,0.3,0.2),(0.4,0.2,0.6,0.8,0.9),(0.325,0.325,0.675,0.75,0.725)),((0.175,0.425,0.675,0.325,0.3),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7)),((0.2,0.525,0.65,0.45,0.25),(0.4,0.2,0.6,0.8,0.9),(0.45,0.225,0.55,0.9,0.7)),((0.275,0.45,0.725,0.475,0.35),(0.4,0.2,0.6,0.8,0.9),(0.5,0.1,0.6,0.9,0.7))},根据上一轮的嵌入模型计算权重Pi列表为:{0.95,0.65,0.574,0.85,0.42,0.285,0.65,0.21,0.98,0.356,0.36,0.6},将权重降序排序,取top-h(假定h=3)个负样本,更新强负样本集合
Figure BDA0003953290640000101
Figure BDA0003953290640000112
将正样本
Figure BDA0003953290640000111
和负样本集合cand_is作为一对数据输入模型训练,以上步骤重复n次,获取到n对训练数据,然后输入嵌入模型完成一次训练。最后再让模型重复E轮训练,得到模型嵌入模型ModelE
本发明上述实施例提供的方法,易于实现,运算速度快,不增加原始嵌入模型的复杂度;能增强虚拟负样本的多样性,提升知识图谱嵌入模型的性能,易于叠加到已有的知识图谱嵌入模型中。
以上详细描述了本发明的较佳具体实施例。应当理解,本领域的普通技术无需创造性劳动就可以根据本发明的构思做出诸多修改和变化。因此,凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的试验可以得到的技术方案,皆应在由权利要求书所确定的保护范围内。

Claims (9)

1.一种基于插值法的知识图谱负样本合成方法,其特征在于,包括以下步骤:
S1:候选集筛选:从负样本中筛选负样本集合cand_il,作为mixup操作的候选集;
S2:mixup样本合成:将集合cand_il中的负样本进行mixup混合得到cand_im,再将cand_im中的负样本和正样本
Figure QLYQS_1
进行mixup合成,得到强负样本cand_ik;
S3:训练更新:将得到的负样本集合cand_il、cand_im、cand_ik再筛选一次得到cand_is,在模型训练中使用cand_is集合,更新强负样本集合
Figure QLYQS_2
2.根据权利要求1所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S1包括以下步骤:
S11:在嵌入模型的第(e+1)轮训练过程中,对数量为n的正样本集合
Figure QLYQS_3
中的每一个正样本/>
Figure QLYQS_4
获取对应的样本数量为s的负样本集合/>
Figure QLYQS_5
和上一轮模型更新得到的数量为h的强负样本集合/>
Figure QLYQS_6
S12:从实体集合ε中随机挑选实体替换正样本
Figure QLYQS_7
的参数h或t,生成数量为f的候选负样本集合/>
Figure QLYQS_8
S13:从所述负样本集合NSi中任选n1个负样本,与
Figure QLYQS_9
中的h个合成负样本相加,得到样本数量为n2的负样本集合
Figure QLYQS_10
计算所述负样本集合cand_i中所有负样本和正样本/>
Figure QLYQS_11
之间的相似度Ci
S14:将所述负样本集合cand_i中的样本根据相似度Ci的数值大小从大到小排序,取前top-l个样本记为负样本集合
Figure QLYQS_12
所述负样本集合
Figure QLYQS_13
中的负样本数量为l,所述负样本集合cand_il即为mixup操作的候选集。
3.根据权利要求2所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S13通过以下公式计算所述相似度Ci
Figure QLYQS_14
其中,
Figure QLYQS_15
为正样本/>
Figure QLYQS_16
的嵌入形式,/>
Figure QLYQS_17
为所述负样本集合cand_i中负样本/>
Figure QLYQS_18
的嵌入形式。
4.根据权利要求3所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S2包括以下步骤:
S21:在负样本集合
Figure QLYQS_19
通过对相似度Ci进行归一化操作得到每个样本对应的概率P1i和候选集cand_il的多项概率分布,依据候选集的多项概率分布和每个样本对应的概率P1i对候选集cand_il进行两次抽样,对得到的两个样本/>
Figure QLYQS_20
和/>
Figure QLYQS_21
进行mixup合成操作;
S22:重复以上操作m次,得到负样本集合
Figure QLYQS_22
Figure QLYQS_23
S23:计算cand_im中所有负样本和正样本
Figure QLYQS_24
之间的相似度Cj
S24:在负样本集合
Figure QLYQS_25
中,根据每个样本的相似度Cj,通过对的相似度Cj进行归一化操作计算每个样本对应的概率/>
Figure QLYQS_26
Figure QLYQS_27
和候选集cand_im的多项概率分布,通过概率P2j和候选集的多项概率分布对候选集cand_im进行一次抽样,将得到的一个负样本/>
Figure QLYQS_28
与正样本/>
Figure QLYQS_29
进行mixup合成操作;
S25:重复以上操作k次,得到负样本集合
Figure QLYQS_30
5.根据权利要求4所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S21通过公式
Figure QLYQS_31
计算所述概率P1i;通过公式:
Figure QLYQS_32
Figure QLYQS_33
计算所述mixup合成操作的结果,其中,αi为超参数,
Figure QLYQS_34
是样本/>
Figure QLYQS_35
和/>
Figure QLYQS_36
经过mixup合成之后的样本,‖·‖是对样本/>
Figure QLYQS_37
取L2正则化,其中,L2正则化公式为
Figure QLYQS_38
n为W的维度;所述步骤S24通过公式/>
Figure QLYQS_39
计算所述概率P2j;通过公式:
Figure QLYQS_40
Figure QLYQS_41
计算所述mixup合成操作的结果,其中,βi为超参数,
Figure QLYQS_42
是正样本/>
Figure QLYQS_43
和负样本/>
Figure QLYQS_44
经过mixup合成之后的样本,‖·‖是对样本/>
Figure QLYQS_45
取L2正则化,L2正则化公式为
Figure QLYQS_46
n为W的维度。
6.根据权利要求4所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S23通过以下公式计算所述相似度Cj
Figure QLYQS_47
其中,
Figure QLYQS_48
为正样本/>
Figure QLYQS_49
的嵌入形式,/>
Figure QLYQS_50
为cand_im中负样本/>
Figure QLYQS_51
的嵌入形式。
7.根据权利要求6所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S3包括以下步骤:
S31:将负样本集合cand_il、cand_im、cand_ik中所有负样本汇总,作为正样本
Figure QLYQS_52
对应的负样本集合
Figure QLYQS_53
S32:使用第e轮训练得到的嵌入模型Modele对cand_is中的所有负样本
Figure QLYQS_54
进行打分,计算得到/>
Figure QLYQS_55
根据scorei计算每个负样本对应的权重Pi
S33:将cand_is中的样本根据权重Pi的大小,从大到小排序,取前top-h个样本更新强负样本集合
Figure QLYQS_56
8.根据权利要求7所述的基于插值法的知识图谱负样本合成方法,其特征在于,所述步骤S32通过公式:
Figure QLYQS_57
计算所述权重Pi,其中,ε为超参数,s为cand_is中的样本总数。
9.根据权利要求8所述的基于插值法的知识图谱负样本合成方法,其特征在于,当训练模型是基于平移距离的知识图谱嵌入模型时,损失函数为:
Figure QLYQS_58
其中,margin为超参数,
Figure QLYQS_59
为Model对正样本/>
Figure QLYQS_60
的打分,/>
Figure QLYQS_61
为Model对负样本/>
Figure QLYQS_62
的打分,Pj的值为步骤S3中计算得到的权重Pi;当训练模型是基于语义匹配的知识图谱嵌入模型时,损失函数为:
Figure QLYQS_63
其中,
Figure QLYQS_64
为Model对正样本/>
Figure QLYQS_65
的打分,/>
Figure QLYQS_66
为Model对负样本/>
Figure QLYQS_67
的打分,Pj的值为步骤S3中计算得到的权重Pi
CN202211455256.2A 2022-11-21 2022-11-21 一种基于插值法的知识图谱负样本合成方法 Active CN116361476B (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211455256.2A CN116361476B (zh) 2022-11-21 2022-11-21 一种基于插值法的知识图谱负样本合成方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211455256.2A CN116361476B (zh) 2022-11-21 2022-11-21 一种基于插值法的知识图谱负样本合成方法

Publications (2)

Publication Number Publication Date
CN116361476A true CN116361476A (zh) 2023-06-30
CN116361476B CN116361476B (zh) 2024-05-17

Family

ID=86915203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211455256.2A Active CN116361476B (zh) 2022-11-21 2022-11-21 一种基于插值法的知识图谱负样本合成方法

Country Status (1)

Country Link
CN (1) CN116361476B (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182245A (zh) * 2020-09-28 2021-01-05 中国科学院计算技术研究所 一种知识图谱嵌入模型的训练方法、系统和电子设备
US20210224690A1 (en) * 2020-01-21 2021-07-22 Royal Bank Of Canada System and method for out-of-sample representation learning
CN115048538A (zh) * 2022-08-04 2022-09-13 中国科学技术大学 基于关系增强负采样的多模态知识图谱补全方法与系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210224690A1 (en) * 2020-01-21 2021-07-22 Royal Bank Of Canada System and method for out-of-sample representation learning
CN112182245A (zh) * 2020-09-28 2021-01-05 中国科学院计算技术研究所 一种知识图谱嵌入模型的训练方法、系统和电子设备
CN115048538A (zh) * 2022-08-04 2022-09-13 中国科学技术大学 基于关系增强负采样的多模态知识图谱补全方法与系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TIROSHAN MADUSHANKA 等: "MDNCaching: A Strategy to Generate Quality Negatives for Knowledge Graph Embedding", SPRINGER NATURE SWITZERLAND AG 2022, 30 June 2022 (2022-06-30), pages 877 - 888 *
雷景生 等: "基于上下文语义增强的实体关系联合抽取", 计算机应用, vol. 43, no. 5, 30 September 2022 (2022-09-30), pages 1438 - 1444 *

Also Published As

Publication number Publication date
CN116361476B (zh) 2024-05-17

Similar Documents

Publication Publication Date Title
CN110413986B (zh) 一种改进词向量模型的文本聚类多文档自动摘要方法及系统
CN109325229B (zh) 一种利用语义信息计算文本相似度的方法
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN108460012A (zh) 一种基于gru-crf的命名实体识别方法
CN110826303A (zh) 一种基于弱监督学习的联合信息抽取方法
Liu et al. Optimization-based key frame extraction for motion capture animation
CN110263174A (zh) —基于焦点关注的主题类别分析方法
WO2021026044A1 (en) Framework for learning to transfer learn
CN113220865A (zh) 一种文本相似词汇检索方法、系统、介质及电子设备
CN106650820A (zh) 一种手写电气元器件符号与标准电气元器件符号的匹配识别方法
CN117216578A (zh) 基于元学习的可自定义标签深度学习模型构建方法及系统
CN116361476A (zh) 一种基于插值法的知识图谱负样本合成方法
CN108256030A (zh) 一种基于本体的密度自适应概念语义相似度计算方法
Cihan Camgoz et al. Particle filter based probabilistic forced alignment for continuous gesture recognition
Wu et al. Recognition of pear leaf disease under complex background based on DBPNet and modified mobilenetV2
Feng et al. Prototypical networks relation classification model based on entity convolution
Xu et al. A sophisticated offline network developed for recognizing handwritten Chinese character efficiently
CN110597982A (zh) 一种基于词共现网络的短文本主题聚类算法
Wu et al. Active 3-D shape cosegmentation with graph convolutional networks
CN113011519A (zh) 一种多尺度分类数据挖掘方法
Hu et al. VIGraph: Self-supervised Learning for Class-Imbalanced Node Classification
Peng et al. Named entity recognition based on reinforcement learning and adversarial training
CN109146058A (zh) 具有变换不变能力且表达一致的卷积神经网络
Zhang et al. Leaf Cultivar Identification via Prototype-enhanced Learning
CN109101570A (zh) 一种基于图摘要的图模式挖掘方法

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant