WO2022083011A1 - 基于聚类等级关系自动识别方法、系统、设备及存储介质 - Google Patents

基于聚类等级关系自动识别方法、系统、设备及存储介质 Download PDF

Info

Publication number
WO2022083011A1
WO2022083011A1 PCT/CN2021/071206 CN2021071206W WO2022083011A1 WO 2022083011 A1 WO2022083011 A1 WO 2022083011A1 CN 2021071206 W CN2021071206 W CN 2021071206W WO 2022083011 A1 WO2022083011 A1 WO 2022083011A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
words
level
occurrence
cluster
Prior art date
Application number
PCT/CN2021/071206
Other languages
English (en)
French (fr)
Inventor
张凯
刘杰
周建设
赵晴
Original Assignee
首都师范大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 首都师范大学 filed Critical 首都师范大学
Publication of WO2022083011A1 publication Critical patent/WO2022083011A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a cluster-based automatic identification method, system, device and computer storage medium of hierarchical relationship between words.
  • the present application provides a method, system, device and computer storage medium for automatic identification of hierarchical relationship between words based on clustering.
  • a first aspect of the present application provides a clustering-based automatic identification method for hierarchical relationships between words, the method comprising:
  • the calculation formula of the co-occurrence weight between the words is:
  • W(T i ,T j ) represents the co-occurrence weight of words T i and T j
  • tf(T i T j ) represents the co-occurrence frequency of words T i and T j in the document
  • tf(T i ) represents The frequency of the word T i in the document
  • WeightingFactor(T i ,T j ) is the adjustment factor
  • the adjustment factor its calculation formula is:
  • min(length(d i )) represents the minimum length in a document where words T i and T j co-occur, Indicates the average length of co-occurring documents, and k is the number of co-occurring documents.
  • the eigenvector its calculation formula is:
  • V(T) ( ⁇ T 1 ,W 1 >, ⁇ T 2 ,W 2 >,..., ⁇ T k ,W k >)
  • T 1 , T 2 ,...,T k represent words related to word T
  • W 1 , W 2 ,...,W k are the co-occurrence weights of word T and T 1 , T 2 ,...,T k , respectively.
  • the calculation formula of the semantic similarity is:
  • Sim(T 1 , T 2 ) represents the semantic similarity between words T 1 and T 2
  • W 1i represents the value of the i-th dimension of the feature vector of the word T 1
  • W 2i represents the i-th dimension of the feature vector of the word T 2
  • value, k represents the dimension of the feature vector
  • n represents the number of identical words in the feature vector.
  • the rank coefficient its calculation formula is:
  • H(T i ) is the rank coefficient of the word T i
  • tf(T i ) represents the word frequency of the word T i
  • len(T i ) represents the word length.
  • the hierarchical clustering algorithm includes: simple connectivity, full connectivity and average connectivity.
  • the hierarchical clustering algorithm is preferably average connectivity.
  • the threshold value is 0.1.
  • the algorithm flow for identifying the upper and lower relations of words in the cluster is as follows:
  • S501 Determine the number of levels, and classify the words in the cluster into each word level according to the level coefficient; the words with high level coefficients are located in the high word level, the highest word level is L 0 , and the rest are L 1 , L 2 , . . . ,L i ;
  • S503 determine whether the bottom layer is reached, if yes, end, otherwise continue to perform the operation of S502.
  • a second aspect of the present application provides a system for automatic identification of hierarchical relationships between words based on clustering, wherein the system includes: a document acquisition module, a word division module, a calculation module, and a result display module, wherein:
  • the document acquisition module is used to acquire documents to identify the hierarchical relationship between words
  • the calculation module realizes the calculation method of determining the relationship between the words as described above, thereby determining the hierarchical relationship between the words;
  • the result display module is used to display the hierarchical relationship between words of each word.
  • a third aspect of the present application provides a device for automatically identifying hierarchical relationships between words based on clustering, wherein the device includes:
  • a processor coupled to the memory
  • the processor invokes the executable program code stored in the memory to execute the aforementioned method.
  • a fourth aspect of the present application provides a computer storage medium, characterized in that, the storage medium stores computer instructions, and when the computer instructions are invoked, they are used to execute the aforementioned method.
  • the similarity between words without literal similarity features can be identified;
  • the distribution of word clusters is relatively uniform, and the similarity between words within the cluster is relatively high;
  • the grade recognition algorithm used can basically classify the words in the cluster into different grades, and then manually determine and adjust to determine the difference between words. hierarchical relationship.
  • FIG. 1 is a schematic flowchart of a clustering-based automatic identification method for hierarchical relationships between words disclosed in an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of an algorithm for recognizing the upper-lower relationship of words in a cluster by a cluster-based automatic recognizing method for hierarchical relationship between words disclosed in an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a system for automatic identification of hierarchical relationships between words based on clustering disclosed in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a device for automatically identifying hierarchical relationships between words based on clustering disclosed in an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for automatically identifying hierarchical relationships between words based on clustering disclosed in an embodiment of the present application.
  • a first aspect of the present application provides a clustering-based automatic identification method for hierarchical relationships between words, the method comprising:
  • the calculation formula for obtaining the co-occurrence weights between words is:
  • W(T i ,T j ) represents the co-occurrence weight of words T i and T j
  • tf(T i T j ) represents the co-occurrence frequency of words T i and T j in the document
  • tf(T i ) represents The frequency of the word T i in the document
  • WeightingFactor(T i ,T j ) is the adjustment factor
  • the calculation formula of the adjustment factor is:
  • min(length(d i )) represents the minimum length in a document where words T i and T j co-occur, Represents the average length of co-occurring documents, and k is the number of co-occurring documents.
  • the calculation formula for constructing the feature vector is:
  • V(T) ( ⁇ T 1 ,W 1 >, ⁇ T 2 ,W 2 >,..., ⁇ T k ,W k >)
  • T 1 , T 2 ,...,T k represent words related to word T
  • W 1 , W 2 ,...,W k are the co-occurrence weights of word T and T 1 , T 2 ,...,T k , respectively.
  • the calculation formula for obtaining the semantic similarity between words is:
  • Sim(T 1 , T 2 ) represents the semantic similarity between words T 1 and T 2
  • W 1i represents the value of the i-th dimension of the feature vector of the word T 1
  • W 2i represents the i-th dimension of the feature vector of the word T 2
  • value, k represents the dimension of the feature vector
  • n represents the number of identical words in the feature vector.
  • the calculation formula for obtaining the grade coefficient between words is:
  • H(T i ) is the rank coefficient of the word T i
  • tf(T i ) represents the word frequency of the word T i
  • len(T i ) represents the word length.
  • the hierarchical clustering algorithm includes: simple connectivity, full connectivity, and average connectivity.
  • the hierarchical clustering using the average connectivity algorithm has a better effect when the threshold is 0.1.
  • S501 Determine the number of levels, and classify the words in the cluster into each word level according to the level coefficient; the words with high level coefficients are located in the high word level, the highest word level is L 0 , and the rest are L 1 , L 2 , . . . ,L i ;
  • S503 determine whether the bottom layer is reached, if yes, end, otherwise continue to perform the operation of S502.
  • FIG. 3 is a schematic structural diagram of a system for automatic identification of hierarchical relationships between words based on clustering disclosed in an embodiment of the present application.
  • a second aspect of the present application provides a system for automatic identification of hierarchical relationships between words based on clustering, wherein the system includes: a document acquisition module, a word division module, a calculation module, and a result display module. module, where:
  • the document acquisition module is used to acquire documents to identify the hierarchical relationship between words
  • the calculation module realizes the calculation method of determining the relationship between each word as in Embodiment 1, thereby determining the hierarchical relationship between the words of each word;
  • the result display module is used to display the hierarchical relationship between words of each word.
  • FIG. 4 is a schematic structural diagram of a device for automatic identification of hierarchical relationships between words based on clustering disclosed in an embodiment of the present application.
  • a third aspect of the present application provides a device for automatic identification of hierarchical relationships between words based on clustering, wherein the device includes:
  • a processor coupled to the memory
  • the processor invokes the executable program code stored in the memory to execute the method for automatically identifying the hierarchical relationship between words based on clustering in Embodiment 1.
  • This embodiment provides a computer storage medium, characterized in that, the storage medium stores computer instructions, and when the computer instructions are invoked, the computer instructions are used to perform the cluster-based automatic identification of the hierarchical relationship between words in the first embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请的提出了一种基于聚类的词间等级关系自动识别方法,通过结合同现统计和分布相似度计算,然后对词间的等级关系进行识别。其中,通过调整因子对DICE测度的计算方式进行改进;再者,在相似度计算上增加了一个调整系数;然后,将各个词进行聚类,形成簇;根据等级系数将簇内的词化到各等级中,并对其识别上下位关系。

Description

基于聚类等级关系自动识别方法、系统、设备及存储介质 技术领域
本申请涉及人工智能领域,具体而言,涉及一种基于聚类的词间等级关系自动识别方法、系统、设备以及计算机存储介质。
背景技术
网络的飞速发展,带来了信息资源的爆炸性增长,为人们提供方便的同时也使人们逐渐意识到被“淹没”在信息的海洋中,如何准确、高效的从海量信息中获取所需的信息成为亟待解决的问题。目前的网络信息检索工具(如搜索引擎等)大多采用基于关键词字面匹配的全文检索方式,这种方法简单可行,查找方便,具有较高的检全率,但检索返回的信息过多,其中只有很少一部分符合检索者的要求,检准率低,同时,也存在漏检和误检现象。利用规范化控制的叙词表,将其应用到标引和检索过程中能有效提高检准率。然而传统叙词表在词表编制和维护,以及在网络信息检索环境中的应用都面临着一定的困难,因而研究如何自动构造自然语言叙词表具有十分重要的意义。
因此,如何准确自动识别词间关系,是目前急需解决的技术问题。
发明内容
为了解决上述如何自动识别词间关系的技术问题,本申请提供了一种基于聚类的词间等级关系自动识别方法、系统、设备以及计算机存储介质。
本申请的第一方面提供了一种基于聚类的词间等级关系自动识别方法,所述方法包括:
S1、选定文档作为同现窗口,获取文档中的每个词,采用DICE测度对每个词进行关联度计算,并根据同现窗口的大小调整DICE测度的计算结果;
S2、根据各词自身在文档中的频率、各词之间的同现频率以及调整因子来计算各词之间的同现权重,从而得到各个词间的关联度;
S3、从中选取一个词T,根据词T与其他词的同现权重,抽取与词T最相关的K个词,并构造特征向量;
S4、对各词通过层次聚类算法进行聚类,将各词单独划分为一簇,计算各个簇之间的语义相似度;设定阈值,将语义相似度小于阈值的簇进行合并,直至所有的词合并为一个大簇;
S5、将大簇内的词根据等级系数识别其上下位关系。
优选地,所述的各词之间的同现权重,其计算公式为:
Figure PCTCN2021071206-appb-000001
其中,W(T i,T j)表示词T i和T j的同现权重,tf(T iT j)表示词T i和T j在文档中的同现频率,tf(T i)表示词T i在文档中的频率,WeightingFactor(T i,T j)为调整因子;
优选地,所述的调整因子,其计算公式为:
Figure PCTCN2021071206-appb-000002
min(length(d i))表示词T i和T j同现文档中的最小长度,
Figure PCTCN2021071206-appb-000003
表示同现文档的平均长度,k为同现文档篇数。
优选地,所述的特征向量,其计算公式为:
V(T)=(<T 1,W 1>,<T 2,W 2>,…,<T k,W k>)
其中,T 1,T 2,…,T k表示与词T相关的词,W 1,W 2,…,W k分别为词T与T 1,T 2,…,T k的同现权重。
优选地,所述的语义相似度,其计算公式为:
Figure PCTCN2021071206-appb-000004
其中,Sim(T 1,T 2)表示词T 1和T 2的语义相似度,W 1i表示词T 1的特征向量第i维的值,W 2i表示词T 2的特征向量第i维的值,k表示特征向量的维数,n表示特征向量中相同词的个数。
优选地,所述的等级系数,其计算公式为:
Figure PCTCN2021071206-appb-000005
H(T i)是词T i的等级系数,tf(T i)表示词T i的词频,len(T i)表示词长。
优选地,所述的层次聚类算法,包括:单连通、全连通以及平均连通。
优选地,所述的层次聚类算法优选为平均连通。
优选地,所述的阈值为0.1。
优选地,所述的识别簇内的词上下位关系,其算法流程为:
S501、确定等级数,将簇内的词按等级系数归入到各词级中;等级系数高的词位于高词级中,最高词级为L 0,其余依次为L 1,L 2,…,L i
S502、在相邻词级间产生上下位关系。取词级L i中的一个词T,计算词T与词级L i-1中的每个词的相似度,取相似度最大的词作为词T的上位词;继续从词级L i中取词,直至为L i中所有的词建立上下位关系;检查词级L i-1中的词,将没有下位词的词移至词级L i
S503、判断是否到达底层,是则结束,否则继续执行S502的操作。
本申请第二方面提供一种基于聚类的词间等级关系自动识别的系统,其特征在于,所述系统,包括:文档获取模块,划分词语模块,计算模块,结果显示模块,其中:
文档获取模块,用于获取文档,从而进行词间等级关系识别;
划分词语模块,用于获取文档内的各个词;
计算模块,实现如前所述的确定各词间关系的计算方式,从而确定各个词的词间等级关系;
结果显示模块,用于显示各个词的词间等级关系。
本申请第三方面提供一种基于聚类的词间等级关系自动识别的设备,其特征在于,所述设备包括:
存储有可执行程序代码的存储器;
与所述存储器耦合的处理器;
所述处理器调用所述存储器中存储的所述可执行程序代码,执行如前所述的方法。
本申请的第四方面提供了一种计算机存储介质,其特征在于,所述存储介质存储有计算机指令,所述计算机指令被调用时,用于执行如前所述的方法。
本发明的有益效果在于:
通过同现分析计算词间的相关度,可以识别出无字面相似特征的词间的相似性;在此基础上,运用等级识别方法,基本上可以将表达不同主题范畴的词区分开来,生成的词簇分布较均匀,簇内词间相似度较高;采用的等级识别算法,基本上可以将簇内的词归入到不同的等级中,再经过人工判定和调整即可确定词间的等级关系。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本申请的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1是本申请实施例公开的一种基于聚类的词间等级关系自动识别方法的流程示意图。
图2是本申请实施例公开的一种基于聚类的词间等级关系自动识别方法的识别簇内的词上下位关系的算法流程示意图。
图3是本申请实施例公开的一种基于聚类的词间等级关系自动识别的系统的结构示意图。
图4是本申请实施例公开的一种基于聚类的词间等级关系自动识别的设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本申请实施例的组件可以以各种不同的配置来布置和设计。
因此,以下对在附图中提供的本申请的实施例的详细描述并非旨在限制要求保护的本申请的范围,而是仅仅表示本申请的选定实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
在本申请的描述中,需要说明的是,若出现术语“上”、“下”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,或者是该发明产品使用时惯常摆放的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。
此外,若出现术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
需要说明的是,在不冲突的情况下,本申请的实施例中的特征可以相互结合。
实施例1
请参阅图1,图1是本申请实施例公开的一种基于聚类的词间等级关系自动识别方法的流程示意图。如图1所示,本申请的第一方面提供了一种基于聚类的词间等级关系自动识别方法,所述方法包括:
S1、选定文档作为同现窗口,获取文档中的每个词,采用DICE测度对每个词进行关联度计算,并根据同现窗口的大小调整DICE测度的计算结果;
S2、根据各词自身在文档中的频率、各词之间的同现频率以及调整因子来计算各词之间的同现权重,从而得到各个词间的关联度;
S3、从中选取一个词T,根据词T与其他词的同现权重,抽取与词T最相关的K个词,并构造特征向量;
S4、对各词通过层次聚类算法进行聚类,将各词单独划分为一簇,计算各个簇之间的语义相似度;设定阈值,将语义相似度小于阈值的簇进行合并,直至所有的词合并为一个大簇;
S5、将大簇内的词根据等级系数识别其上下位关系。
在本实施例中,获取各词之间的同现权重的计算公式为:
Figure PCTCN2021071206-appb-000006
其中,W(T i,T j)表示词T i和T j的同现权重,tf(T iT j)表示词T i和T j在文档中的同现频率,tf(T i)表示词T i在文档中的频率,WeightingFactor(T i,T j)为调整因子;
在本实施例中,调整因子的计算公式为:
Figure PCTCN2021071206-appb-000007
min(length(d i))表示词T i和T j同现文档中的最小长度,
Figure PCTCN2021071206-appb-000008
表示同现文档的平均长度,k为同现文档篇数,通过计算词间的同现关联度,可以构造出“关联概念空间”:以词为点,以同现权重为边上权值的无向图。
在本实施例中,构造特征向量的计算公式为:
V(T)=(<T 1,W 1>,<T 2,W 2>,…,<T k,W k>)
其中,T 1,T 2,…,T k表示与词T相关的词,W 1,W 2,…,W k分别为词T与T 1,T 2,…,T k的同现权重。
在本实施例中,获取词间的语义相似度的计算公式为:
Figure PCTCN2021071206-appb-000009
其中,Sim(T 1,T 2)表示词T 1和T 2的语义相似度,W 1i表示词T 1的特征向量第i维的值,W 2i表示词T 2的特征向量第i维的值,k表示特征向量的维数,n表示特征向量中相同词的个数。
在本实施例中,获取词间的等级系数的计算公式为:
Figure PCTCN2021071206-appb-000010
H(T i)是词T i的等级系数,tf(T i)表示词T i的词频,len(T i)表示词长。
在本实施例中,层次聚类算法,包括:单连通、全连通以及平均连通。
其中,采用平均连通算法的层次聚类,在阈值为0.1时效果较好。
在本实施例中,识别簇内的词上下位关系,其算法流程为:
S501、确定等级数,将簇内的词按等级系数归入到各词级中;等级系数高的词位于高词级中,最高词级为L 0,其余依次为L 1,L 2,…,L i
S502、在相邻词级间产生上下位关系。取词级L i中的一个词T,计算词T与词级L i-1中的每个词的相似度,取相似度最大的词作为词T的上位词;继续从词级L i中取词,直至为L i中所有的词建立上下位关系;检查词级L i-1中的词,将没有下位词的词移至词级L i
S503、判断是否到达底层,是则结束,否则继续执行S502的操作。
实施例2
请参阅图3,图3是本申请实施例公开的一种基于聚类的词间等级关系自动识别的系统的结构示意图。如图3所示,本申请第二方面提供一种基于聚类的词间等级关系自动识别的系统,其特征在于,所述系统,包括:文档获取模块,划分词语模块,计算模块,结果显示模块,其中:
文档获取模块,用于获取文档,从而进行词间等级关系识别;
划分词语模块,用于获取文档内的各个词;
计算模块,实现如实施例1中确定各词间关系的计算方式,从而确定各个词的词间等级关系;
结果显示模块,用于显示各个词的词间等级关系。
实施例3
请参阅图4,图4是本申请实施例公开的一种基于聚类的词间等级关系自动识别的设备的结构示意图。如图4所示,本申请第三方面提供一种基于聚类的词间等级关系自动识别的设备,其特征在于,所述设备包括:
存储有可执行程序代码的存储器;
与所述存储器耦合的处理器;
所述处理器调用所述存储器中存储的所述可执行程序代码,执行实施例1中的基于聚类的词间等级关系自动识别的方法。
实施例4
本实施例提供了一种计算机存储介质,其特征在于,所述存储介质存储有计算机指令,所述计算机指令被调用时,用于执行实施例1中的基于聚类的词间等级关系自动识别的方法。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种基于聚类的词间等级关系自动识别方法,应用于电子设备,其特征在于,所述自动识别方法,包括:
    S1、选定文档作为同现窗口,获取文档中的每个词,采用DICE测度对每个词进行关联度计算,并根据同现窗口的大小调整DICE测度的计算结果;
    S2、根据各词自身在文档中的频率、各词之间的同现频率以及调整因子来计算各词之间的同现权重,从而得到各个词间的关联度;
    S3、从中选取一个词T,根据词T与其他词的同现权重,抽取与词T最相关的K个词,并构造特征向量;
    S4、对各词通过层次聚类算法进行聚类,将各词单独划分为一簇,计算各个簇之间的语义相似度;设定阈值,将语义相似度小于阈值的簇进行合并,直至所有的词合并为一个大簇;
    S5、将簇内的词根据等级系数划到各等级中,并识别其上下位关系。
  2. 如权利要求1所述的方法,其特征在于,所述的各词之间的同现权重,其计算公式为:
    Figure PCTCN2021071206-appb-100001
    其中,W(T i,T j)表示词T i和T j的同现权重,tf(T iT j)表示词T i和T j在文档中的同现频率,tf(T i)表示词T i在文档中的频率,WeightingFactor(T i,T j)为调整因子;
  3. 如权利要求2所述的方法,其特征在于,所述的调整因子,其计算公式为:
    Figure PCTCN2021071206-appb-100002
    min(length(d i))表示词T i和T j同现文档中的最小长度,
    Figure PCTCN2021071206-appb-100003
    表示同现文档的平均长度,k为同现文档篇数。
  4. 如权利要求1所述的方法,其特征在于,所述的特征向量,其计算公式为:
    V(T)=(<T 1,W 1>,<T 2,W 2>,…,<T k,W k>)
    其中,T 1,T 2,…,T k表示与词T相关的词,W 1,W 2,…,W k分别为词T与T 1,T 2,…,T k的同现权重。
  5. 如权利要求4所述的方法,其特征在于,所述的语义相似度,其计算公式为:
    Figure PCTCN2021071206-appb-100004
    其中,Sim(T 1,T 2)表示词T 1和T 2的语义相似度,W 1i表示词T 1的特征向量第i维的值,W 2i表示词T 2的特征向量第i维的值,k表示特征向量的维数,n表示特征向量中相同词的个数。
  6. 如权利要求1所述的方法,其特征在于,所述的等级系数,其计算公式为:
    Figure PCTCN2021071206-appb-100005
    H(T i)是词T i的等级系数,tf(T i)表示词T i的词频,len(T i)表示词长。
  7. 如权利要求1所述的方法,其特征在于,所述的层次聚类算法,包括:单连通、全连通以及平均连通。
  8. 如权利要求7所述的方法,其特征在于,所述的层次聚类算法优选为平均连通。
  9. 如权利要求8所述的方法,其特征在于,所述的阈值优选为0.1。
  10. 如权利要求1所述的方法,其特征在于,所述的识别簇内的词上下位关系,其算法流程为:
    步骤1:确定等级数,将簇内的词按等级系数归入到各词级中;等级系数高的词位于高词级中,最高词级为L 0,其余依次为L 1,L 2,…,L i
    步骤2:在相邻词级间产生上下位关系;取词级L i中的一个词T,计算词T与词级L i-1中的每个词的相似度,取相似度最大的词作为词T的上位词;继续从词级L i中取词,直至为L i中所有的词建立上下位关系;检查词级L i-1中的词,将没有下位词的词移至词级L i
    步骤3:判断是否到达底层,是则结束,否则继续执行步骤2的操作。
PCT/CN2021/071206 2020-10-22 2021-01-12 基于聚类等级关系自动识别方法、系统、设备及存储介质 WO2022083011A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011138197.7A CN112307204A (zh) 2020-10-22 2020-10-22 基于聚类等级关系自动识别方法、系统、设备及存储介质
CN202011138197.7 2020-10-22

Publications (1)

Publication Number Publication Date
WO2022083011A1 true WO2022083011A1 (zh) 2022-04-28

Family

ID=74326971

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/071206 WO2022083011A1 (zh) 2020-10-22 2021-01-12 基于聚类等级关系自动识别方法、系统、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112307204A (zh)
WO (1) WO2022083011A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113204620A (zh) * 2021-05-12 2021-08-03 首都师范大学 一种叙词表自动构建的方法、系统、设备以及计算机存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129479A (zh) * 2011-04-29 2011-07-20 南京邮电大学 一种基于概率潜在语义分析模型的万维网服务发现方法
US20120215523A1 (en) * 2010-01-08 2012-08-23 International Business Machines Corporation Time-series analysis of keywords
CN105574005A (zh) * 2014-10-10 2016-05-11 富士通株式会社 对包含多个文档的源数据进行聚类的装置和方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7191175B2 (en) * 2004-02-13 2007-03-13 Attenex Corporation System and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space
CN104778204B (zh) * 2015-03-02 2018-03-02 华南理工大学 基于两层聚类的多文档主题发现方法
US9852359B2 (en) * 2015-09-14 2017-12-26 International Business Machines Corporation System, method, and recording medium for efficient cohesive subgraph identification in entity collections for inlier and outlier detection
CN106934005A (zh) * 2017-03-07 2017-07-07 重庆邮电大学 一种基于密度的文本聚类方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120215523A1 (en) * 2010-01-08 2012-08-23 International Business Machines Corporation Time-series analysis of keywords
CN102129479A (zh) * 2011-04-29 2011-07-20 南京邮电大学 一种基于概率潜在语义分析模型的万维网服务发现方法
CN105574005A (zh) * 2014-10-10 2016-05-11 富士通株式会社 对包含多个文档的源数据进行聚类的装置和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DU HUIPING, HE LIN: "Automatic Recognition of Hierarchical Relationship of Thesaurus Based on Word Clustering", INFORMATION SCIENCE, vol. 26, no. 11, 15 November 2008 (2008-11-15), pages 1680 - 1684, XP055923439, ISSN: 1007-7634 *

Also Published As

Publication number Publication date
CN112307204A (zh) 2021-02-02

Similar Documents

Publication Publication Date Title
US10216766B2 (en) Large-scale image tagging using image-to-topic embedding
US8341112B2 (en) Annotation by search
US10496699B2 (en) Topic association and tagging for dense images
US8180766B2 (en) Bayesian video search reranking
US9508038B2 (en) Using ontological information in open domain type coercion
US8645391B1 (en) Attribute-value extraction from structured documents
US20210216576A1 (en) Systems and methods for providing answers to a query
US20090282012A1 (en) Leveraging cross-document context to label entity
US9864795B1 (en) Identifying entity attributes
CN108304552B (zh) 一种基于知识库特征抽取的命名实体链接方法
US11625537B2 (en) Analysis of theme coverage of documents
Alexandrov et al. An approach to clustering abstracts
CN112559684A (zh) 一种关键词提取及信息检索方法
CN108228541A (zh) 生成文档摘要的方法和装置
Chen et al. Georeferencing places from collective human descriptions using place graphs
CN110232185A (zh) 面向金融行业软件测试基于知识图谱语义相似度计算方法
CN114997288A (zh) 一种设计资源关联方法
WO2022083011A1 (zh) 基于聚类等级关系自动识别方法、系统、设备及存储介质
CN112307364B (zh) 一种面向人物表征的新闻文本发生地抽取方法
US10810266B2 (en) Document search using grammatical units
CN115687960B (zh) 一种面向开源安全情报的文本聚类方法
Wang et al. A joint chinese named entity recognition and disambiguation system
CN107423294A (zh) 一种社群图像检索方法及系统
Weng et al. A study on searching for similar documents based on multiple concepts and distribution of concepts
Lai et al. An unsupervised approach to discover media frames

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21881437

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 010823)

122 Ep: pct application non-entry in european phase

Ref document number: 21881437

Country of ref document: EP

Kind code of ref document: A1