CN101281530A - Key word hierarchy clustering method based on conception deriving tree - Google Patents

Key word hierarchy clustering method based on conception deriving tree Download PDF

Info

Publication number
CN101281530A
CN101281530A CNA2008100377271A CN200810037727A CN101281530A CN 101281530 A CN101281530 A CN 101281530A CN A2008100377271 A CNA2008100377271 A CN A2008100377271A CN 200810037727 A CN200810037727 A CN 200810037727A CN 101281530 A CN101281530 A CN 101281530A
Authority
CN
China
Prior art keywords
node
tree
nodes
concept
weights
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100377271A
Other languages
Chinese (zh)
Inventor
骆祥峰
方宁
徐炜民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CNA2008100377271A priority Critical patent/CN101281530A/en
Publication of CN101281530A publication Critical patent/CN101281530A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a keyword hierarchical clustering method based on concept derivative tree, which comprises the steps of extracting a plurality of field keywords of multiple texts in the same filed, and building a hierarchical tree-like model for the field keywords in accordance with semantic relationship. According to the invention, the semantic correlation relationship can be automatically acquired, and the strength of the semantic correlation relationship between the keywords can be calculated. The method can cluster keywords hierarchically, simply and effectively, making it convenient for computers to master and understand. The method effectively improves knowledge acquisition property, which provides technical support for personal intelligent search, automatic recommend, aided discovery and acquisition of creative knowledge, and accumulation and fusion of related knowledge.

Description

Key word hierarchy clustering method based on conception deriving tree
Technical field
The present invention relates to a kind of method of computing machine automatic cluster text key word, more particularly, relate to a kind of text key word hierarchy clustering method based on conception deriving tree.
Background technology
Text knowledge obtain with one of elementary cell of representing be the keyword of text.The precision that text key word obtains automatically directly has influence on the quality that performance that text knowledge obtains and text body are set up.
The class keyword that belongs to a plurality of text institute co-occurrence in a field presents the lowermost layer knowledge of this field text, is that this field text knowledge represents and one of elementary cell of obtaining.Automatically the precision of obtaining of text field keyword directly has influence on the performance of text field knowledge acquisition and the effect that the domain knowledge body is set up, thereby influences the quality and the effect of Internet resources service.The automatic acquisition methods of text field keyword is referring to disclosed relevant patent " extracting method of text key word " (publication number CN101067808), the present invention mainly discusses the automatic acquisition of the secondary relationship of text key word, and the calculating of intensity, and final stratification cluster text field keyword.
What the secondary relationship of text field keyword referred to is exactly the notion with related meanings (keyword) of a notion (keyword) from semantically being expanded.For example,, from medical domain, can expand as lung cancer for the notion cancer, cancer of the stomach, leukaemia etc. are about the notion of cancer aspect.
Field keyword (notion) can produce certain semantic secondary relationship, and the present invention relates to tissue according to the semanteme of different levels, and coherent this secondary relationship is showed.
The present invention proposes to organize and represent notion in the field and the semantic secondary relationship between the notion with the conception deriving tree-model.Utilize the semantic information between the notion, it is built into the data model with level.And then the notion in the field effectively organized.
The present invention can effectively improve the performance of knowledge acquisition.This will be for the search of the individualized intelligent of resource under the network environment, recommend automatically, the auxiliary discovery of innovation knowledge with obtain, the gathering of relevant knowledge and fusion etc. provide technical support.
Summary of the invention
The object of the present invention is to provide a kind of field keyword (notion) effectively can being organized together, and by the key word hierarchy clustering method based on conception deriving tree of certain secondary relationship with these keywords (notion) stratification, conception deriving tree-model proposed by the invention only relates to Semantic Similarity.
Design of the present invention is: the inside multilayer semantic relation of organizing field keyword (notion) with a kind of tree-shaped visualization structure of cum rights value.
According to above-mentioned inventive concept, the present invention adopts following technical proposals:
A kind of field key word hierarchy clustering method based on conception deriving tree, the field keyword of many pieces of texts that it is characterized in that extracting same field is some, and according to semantic relation its level is turned to a tree shaped model, and the concrete operations step is as follows:
1. it is some (with reference to disclosed relevant patent " extracting method of the keyword of text " to extract the text field keyword from the many pieces of texts in same field, publication number CN101067808), a keyword is exactly a notion, and this field keyword is a concept set;
2. in concept set, a notion is as a node, and root node is the title of this field keyword, and concept node is all nodes except that root node;
3. when making up the ground floor concept node, select have a direct secondary relationship with root node node just as the child node of this node, and this node is exactly a father node, select the closest plurality of nodes of this father node secondary relationship as the ground floor node, the degree of depth that how much is used for controlling the whole tree that derives of node number; The concept node of deletion ground floor from concept set appears in order to prevent that concept node is redundant;
4. when making up second layer concept node, select respectively with ground floor in the closest plurality of nodes of secondary relationship of each node constitute subtree (the node number is too much unsuitable), the notion in the same node layer can repeat;
5. from concept set, delete existing node in the tree that derives, the node of setting different levels that prevents to derive occurs redundant, repeat previous step, make up the 3rd layer and more level, be empty or can not from concept set, add new node up to concept set, so just can construct the data model of one tree shape;
6. concern to come weights between computing node and the node according to conception deriving of different nature.Relation between two node is divided into following three kinds of situations:
(1) if the father node of two nodes is same nodes, the parallelogram law (or other similar approach) in the utilization mechanics calculates the weights between these two notions;
(2) if the relation that the existence between two nodes is directly derived, then weights are drawn by the derive hierarchical direct (HD) of tree of node place;
(3) if do not have a relation of directly deriving between two nodes, then seek their nearest associated nodes, and calculate and to calculate their weights to this associated nodes respectively, calculate weights between these two notions by parallelogram law (or other similar approach) again.
The present invention compared with prior art, have following conspicuous outstanding substantive distinguishing features and remarkable advantage: the present invention is by method provided by the invention, can obtain the semantic association relation between the keyword automatically, and the intensity of the relation of the semantic association between the calculating keyword, can be by different level, the simple and direct semantic relation of representing efficiently between the text field keyword, be convenient to computing machine and grasp and understand processing.The present invention can effectively improve the performance of knowledge acquisition, this will be for the search of the individualized intelligent of resource under the large-scale network environment, recommend automatically, the auxiliary discovery of innovation knowledge with obtain, the gathering of relevant knowledge and fusion etc. provide technical support.
Description of drawings
Fig. 1 is a conception deriving tree that comprises one deck notion.
Fig. 2 is the cooccurrence relation table of ground floor conception deriving node.
Fig. 3 is the part that a complete conception is derived and set.
Fig. 4 is a conception deriving tree that has calculated secondary relationship weight between the node.
Embodiment
Details are as follows in conjunction with the accompanying drawings for a preferred embodiment of the present invention.
Concrete steps based on the text key word hierarchical clustering of conception deriving tree are as follows:
1. we can extract 163 of field keywords (with reference to disclosed relevant patent " extracting method of the keyword of text ", publication number CN101067808) for 112 pieces of papers of the National People's Congress 5000.The size of the concept set that constitutes is 163, and promptly having 163 does not have the notion that repeats.
2. root node can determine voluntarily according to related field.We can be defined as root node " political economy " according to the field, place for 112 pieces of paper training samples of the National People's Congress 5000.Such building method can make whole conception deriving tree more representative.
3. the node of structure ground floor at first will be determined the number of ground floor node.The node number of ground floor is too much unsuitable, because just can influence the degree of depth of whole tree like this.The purpose that makes up conception deriving tree is the sort of secondary relationship with level that is to find between notion and the notion, therefore should be as far as possible at the range of low layer position control tree.The tree node that we get ground floor in this example adds up to 4 roots of whole notion centralized concept sum and rounds up.So just, the number of the tree node of ground floor can be controlled in the more satisfactory scope.Therefore the number of ground floor tree node is
Figure A20081003772700051
After having determined the number of tree node of ground floor, from concept set, select the notion that occurrence number is maximum in 112 pieces of papers (piece of writing frequently) again, extract 4 maximum keywords of front.Because notion and this relation by 112 pieces of fields that paper constituted that these frequencies of occurrences are maximum are the closest, therefore with they tree nodes as ground floor, as shown in Figure 1.
4. making up the second layer and more in the concept node of deriving on upper strata, we can make up the cooccurrence relation table of a ground floor conception deriving node earlier, as shown in Figure 2.The number of times that this table expression ground floor node and other notions occur in same piece of writing paper has only been selected 15 and ground floor node here with the more notion of occurrence number as an example.Next to be processed is the problem of the node number of subtree.The same with root node, the node on the subtree is too much unsuitable.Can decide the number of node according to following formula: promptly get greater than preceding 1/ η with the notion number of occurrence number.η wherein can be according to the actual needs value, in the present example η=15.The child node number of each node can both be controlled in the proper scope like this.The node of same level can repeat in the conception deriving tree.
5. behind the node that has made up a level, just the node in this level (notion) is removed from concept set, so just can avoid in the conception deriving tree producing the redundancy that concerns between the notion of different levels and the notion.Then can be according to recursive algorithm, other levels of conception deriving tree are built one by one, and all notions in covering concept set perhaps can't be added new concept node, so complete tree just makes up has finished, a conception deriving tree part wherein shown in Figure 3.
6. can begin to calculate weight between two notions from the top layer notion of the conception deriving tree that builds.
Because in the process that makes up conception deriving tree, we generally are controlled at the number of plies of tree in 7 layers.In order to make the weights between each node layer that evident difference be arranged, the weights ω in the path between concept node and the father node of top layer directly can being derived iBe made as 0.7; Every one deck downwards derives directly that the weights in path just correspondingly deduct 0.1 between concept node and the father node, i.e. level weights difference Δ iBe 0.1.Secondary relationship between the node of each layer will constantly weaken with the increase of level like this.ω herein iAnd Δ iCan carry out corresponding setting according to the different needs of different field.According to this rule, just can construct the complete field concept tree that derives, as shown in Figure 4.Relation between two nodes is divided into following three kinds of situations:
(1) if node C 5With node C 11Father node be same node.Can use the parallelogram law in the mechanics, regard these two notions two vectors of different directions in the mechanics as, the result who calculates is the weights behind these two generalization by the representation of groups.Therefore the relation between them can be calculated with following method:
ω 5 - 11 = ω 5 2 + ω 11 2 - ω 5 ω 11
ω 5=0.7, ω 11=0.7, so get ω 5-11=0.7.Be C 5And C 11Between exist certain conception deriving incidence relation, and the weights of deriving are 0.7.
(2) if node C 12With node C 11Between the existence relation of directly deriving, weights ω then 12-11Can directly draw 0.6 by tree.
(3) if node C 12With node C 5Nearest associated nodes be downward root node (Node 0).Therefore, their weights are multiplied each other by the weights of each layer and get.
ω 12 - node 0 = ω 12 - 11 × ω 11 = 0.6 × 0.7 = 0.42 ;
ω 5 - node 0 = ω 5 ;
Similar first kind of situation utilizes parallelogram law to draw
ω 5 - 12 = ω 5 2 + ω 12 - node 0 2 - ω 5 ω 12 - node 0 = 0.61 .

Claims (2)

1. plant the key word hierarchy clustering method based on conception deriving tree, the field keyword of many pieces of texts that it is characterized in that extracting same field is some, and according to semantic relation its level is turned to a tree shaped model, and the concrete operations step is as follows:
A) extraction text field keyword is some from the many pieces of texts in same field, and a keyword is exactly a notion, and this field keyword is a concept set;
B) in concept set, a notion is as a node, and root node is the title of this field keyword, and concept node is all nodes except that root node;
When c) making up the ground floor concept node, select have a direct secondary relationship with root node node just as the child node of this node, and this node is exactly a father node, select with the closest plurality of nodes of the secondary relationship of this father node as the ground floor node, the degree of depth that how much is used for controlling the whole tree that derives of node number; The concept node of deletion ground floor from concept set appears in order to prevent that concept node is redundant;
When d) making up second layer concept node, select respectively with ground floor in the closest plurality of nodes of secondary relationship of each node constitute subtree, the notion in the same node layer can repeat;
E) from concept set, delete existing node in the tree that derives, the different levels node of setting to prevent to derive occurs redundant, repeat previous step, make up the 3rd layer and more level, up to concept set is empty, perhaps can not from concept set, add new node, so just can construct the data model of one tree shape;
F) concern to come weights between computing node and the node according to conception deriving of different nature.
2. by the described key word hierarchy clustering method of claim 1 based on conception deriving tree, it is characterized in that concerning to come weights between computing node and the node according to conception deriving of different nature in the described step (f), the relation between two nodes is divided into following three kinds of situations:
A) if the father node of two nodes is same nodes, the parallelogram law in the utilization mechanics calculates the weights between these two notions;
B) if the relation that the existence between two nodes is directly derived, then weights are drawn by the derive hierarchical direct (HD) of tree of node place;
C) if do not have a relation of directly deriving between two nodes, then seek their nearest associated nodes, and calculate and calculate their weights respectively, calculate weights between these two notions by parallelogram law again to this associated nodes.
CNA2008100377271A 2008-05-20 2008-05-20 Key word hierarchy clustering method based on conception deriving tree Pending CN101281530A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100377271A CN101281530A (en) 2008-05-20 2008-05-20 Key word hierarchy clustering method based on conception deriving tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100377271A CN101281530A (en) 2008-05-20 2008-05-20 Key word hierarchy clustering method based on conception deriving tree

Publications (1)

Publication Number Publication Date
CN101281530A true CN101281530A (en) 2008-10-08

Family

ID=40014005

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100377271A Pending CN101281530A (en) 2008-05-20 2008-05-20 Key word hierarchy clustering method based on conception deriving tree

Country Status (1)

Country Link
CN (1) CN101281530A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645083B (en) * 2009-01-16 2012-07-04 中国科学院声学研究所 Acquisition system and method of text field based on concept symbols
CN103034656A (en) * 2011-09-29 2013-04-10 日立(中国)研究开发有限公司 Chapter content tiering method and device, and article content tiering method and device
CN103164415A (en) * 2011-12-09 2013-06-19 富士通株式会社 Expansion keyword obtaining method based on microblog platform and equipment
CN103177124A (en) * 2013-04-15 2013-06-26 昆明理工大学 Dielectric constant database searching method and dielectric constant database searching system
CN103314371A (en) * 2010-12-31 2013-09-18 肖岩 Retrieval method and system
CN104216932A (en) * 2013-09-29 2014-12-17 北大方正集团有限公司 Method and system for measuring knowledge point relationship strength
CN104462084A (en) * 2013-09-13 2015-03-25 Sap欧洲公司 Search refinement advice based on multiple queries
WO2015043073A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Key knowledge point recommendation method and system
CN106339399A (en) * 2015-07-13 2017-01-18 阿里巴巴集团控股有限公司 Method and device for recommending keywords
CN108038220A (en) * 2017-12-22 2018-05-15 新奥(中国)燃气投资有限公司 A kind of keyword methods of exhibiting and device
WO2018177411A1 (en) * 2017-04-01 2018-10-04 上海半坡网络技术有限公司 System for real-time expression of semantic mind map, and operation method therefor
CN109033084A (en) * 2018-07-26 2018-12-18 国信优易数据有限公司 A kind of semantic hierarchies tree constructing method and device
JP2020060816A (en) * 2018-10-04 2020-04-16 Tis株式会社 Information processing apparatus, information processing method, and program
CN112470145A (en) * 2018-08-14 2021-03-09 赫尔实验室有限公司 Hypergraph-based method for segmenting and clustering consumer observable objects of a vehicle
CN114168751A (en) * 2021-12-06 2022-03-11 厦门大学 Medical knowledge concept graph-based medical text label identification method and system

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645083B (en) * 2009-01-16 2012-07-04 中国科学院声学研究所 Acquisition system and method of text field based on concept symbols
CN103314371A (en) * 2010-12-31 2013-09-18 肖岩 Retrieval method and system
CN103034656A (en) * 2011-09-29 2013-04-10 日立(中国)研究开发有限公司 Chapter content tiering method and device, and article content tiering method and device
CN103034656B (en) * 2011-09-29 2016-04-20 日立(中国)研究开发有限公司 Chapters and sections content layered approach and device, article content layered approach and device
CN103164415A (en) * 2011-12-09 2013-06-19 富士通株式会社 Expansion keyword obtaining method based on microblog platform and equipment
CN103164415B (en) * 2011-12-09 2016-03-23 富士通株式会社 Based on expanded keyword acquisition methods and the equipment of microblog
CN103177124B (en) * 2013-04-15 2016-03-30 昆明理工大学 A kind of specific inductive capacity database index method and system
CN103177124A (en) * 2013-04-15 2013-06-26 昆明理工大学 Dielectric constant database searching method and dielectric constant database searching system
CN104462084B (en) * 2013-09-13 2019-08-16 Sap欧洲公司 Search refinement is provided based on multiple queries to suggest
CN104462084A (en) * 2013-09-13 2015-03-25 Sap欧洲公司 Search refinement advice based on multiple queries
US10289623B2 (en) 2013-09-29 2019-05-14 Peking University Founder Group Co. Ltd. Method and system for key knowledge point recommendation
CN104216932A (en) * 2013-09-29 2014-12-17 北大方正集团有限公司 Method and system for measuring knowledge point relationship strength
CN104516904A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Key knowledge point recommendation method and system
CN104216932B (en) * 2013-09-29 2017-11-07 北大方正集团有限公司 The measure and its system of a kind of knowledge point relationship strength
CN104516904B (en) * 2013-09-29 2018-04-03 北大方正集团有限公司 A kind of Key Points recommend method and its system
WO2015043073A1 (en) * 2013-09-29 2015-04-02 北大方正集团有限公司 Key knowledge point recommendation method and system
CN106339399B (en) * 2015-07-13 2019-07-23 阿里巴巴集团控股有限公司 Keyword recommendation method and device
CN106339399A (en) * 2015-07-13 2017-01-18 阿里巴巴集团控股有限公司 Method and device for recommending keywords
WO2018177411A1 (en) * 2017-04-01 2018-10-04 上海半坡网络技术有限公司 System for real-time expression of semantic mind map, and operation method therefor
US10970489B2 (en) 2017-04-01 2021-04-06 Shanghai Banpo Network Technologies Ltd. System for real-time expression of semantic mind map, and operation method therefor
CN108038220A (en) * 2017-12-22 2018-05-15 新奥(中国)燃气投资有限公司 A kind of keyword methods of exhibiting and device
CN109033084A (en) * 2018-07-26 2018-12-18 国信优易数据有限公司 A kind of semantic hierarchies tree constructing method and device
CN112470145A (en) * 2018-08-14 2021-03-09 赫尔实验室有限公司 Hypergraph-based method for segmenting and clustering consumer observable objects of a vehicle
JP2020060816A (en) * 2018-10-04 2020-04-16 Tis株式会社 Information processing apparatus, information processing method, and program
JP7170487B2 (en) 2018-10-04 2022-11-14 Tis株式会社 Information processing device and program
CN114168751A (en) * 2021-12-06 2022-03-11 厦门大学 Medical knowledge concept graph-based medical text label identification method and system

Similar Documents

Publication Publication Date Title
CN101281530A (en) Key word hierarchy clustering method based on conception deriving tree
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
CN104537116B (en) A kind of books searching method based on label
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
Chung et al. Thematic mapping-from unstructured documents to taxonomies
CN101685455A (en) Method and system of data retrieval
CN106372087B (en) information map generation method facing information retrieval and dynamic updating method thereof
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN103488648A (en) Multilanguage mixed retrieval method and system
CN102419778A (en) Information searching method for discovering and clustering sub-topics of query statement
CN106570191A (en) Wikipedia-based Chinese and English cross-language entity matching method
CN111190900A (en) JSON data visualization optimization method in cloud computing mode
CN103927177A (en) Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
Krishna et al. An efficient approach for text clustering based on frequent itemsets
CN108304519A (en) A kind of knowledge forest construction method based on chart database
CN105447104A (en) Knowledge map generating method and apparatus
CN103020283A (en) Semantic search method based on dynamic reconfiguration of background knowledge
CN107391690B (en) Method for processing document information
CN103927176B (en) Method for generating program feature tree on basis of hierarchical topic model
Fortuna et al. Advancing topic ontology learning through term extraction
CN102982063A (en) Control method based on tuple elaboration of relation keywords extension
Sun et al. Automatic generation of survey paper based on template tree
Alfarra et al. Graph-based technique for extracting keyphrases in a single-document (gtek)
Kwatra et al. Extractive and abstractive summarization for hindi text using hierarchical clustering
Qingjie et al. Research on domain knowledge graph based on the large scale online knowledge fragment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20081008