CN101281530A - Key word hierarchy clustering method based on conception deriving tree - Google Patents
Key word hierarchy clustering method based on conception deriving tree Download PDFInfo
- Publication number
- CN101281530A CN101281530A CNA2008100377271A CN200810037727A CN101281530A CN 101281530 A CN101281530 A CN 101281530A CN A2008100377271 A CNA2008100377271 A CN A2008100377271A CN 200810037727 A CN200810037727 A CN 200810037727A CN 101281530 A CN101281530 A CN 101281530A
- Authority
- CN
- China
- Prior art keywords
- node
- tree
- nodes
- concept
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a keyword hierarchical clustering method based on concept derivative tree, which comprises the steps of extracting a plurality of field keywords of multiple texts in the same filed, and building a hierarchical tree-like model for the field keywords in accordance with semantic relationship. According to the invention, the semantic correlation relationship can be automatically acquired, and the strength of the semantic correlation relationship between the keywords can be calculated. The method can cluster keywords hierarchically, simply and effectively, making it convenient for computers to master and understand. The method effectively improves knowledge acquisition property, which provides technical support for personal intelligent search, automatic recommend, aided discovery and acquisition of creative knowledge, and accumulation and fusion of related knowledge.
Description
Technical field
The present invention relates to a kind of method of computing machine automatic cluster text key word, more particularly, relate to a kind of text key word hierarchy clustering method based on conception deriving tree.
Background technology
Text knowledge obtain with one of elementary cell of representing be the keyword of text.The precision that text key word obtains automatically directly has influence on the quality that performance that text knowledge obtains and text body are set up.
The class keyword that belongs to a plurality of text institute co-occurrence in a field presents the lowermost layer knowledge of this field text, is that this field text knowledge represents and one of elementary cell of obtaining.Automatically the precision of obtaining of text field keyword directly has influence on the performance of text field knowledge acquisition and the effect that the domain knowledge body is set up, thereby influences the quality and the effect of Internet resources service.The automatic acquisition methods of text field keyword is referring to disclosed relevant patent " extracting method of text key word " (publication number CN101067808), the present invention mainly discusses the automatic acquisition of the secondary relationship of text key word, and the calculating of intensity, and final stratification cluster text field keyword.
What the secondary relationship of text field keyword referred to is exactly the notion with related meanings (keyword) of a notion (keyword) from semantically being expanded.For example,, from medical domain, can expand as lung cancer for the notion cancer, cancer of the stomach, leukaemia etc. are about the notion of cancer aspect.
Field keyword (notion) can produce certain semantic secondary relationship, and the present invention relates to tissue according to the semanteme of different levels, and coherent this secondary relationship is showed.
The present invention proposes to organize and represent notion in the field and the semantic secondary relationship between the notion with the conception deriving tree-model.Utilize the semantic information between the notion, it is built into the data model with level.And then the notion in the field effectively organized.
The present invention can effectively improve the performance of knowledge acquisition.This will be for the search of the individualized intelligent of resource under the network environment, recommend automatically, the auxiliary discovery of innovation knowledge with obtain, the gathering of relevant knowledge and fusion etc. provide technical support.
Summary of the invention
The object of the present invention is to provide a kind of field keyword (notion) effectively can being organized together, and by the key word hierarchy clustering method based on conception deriving tree of certain secondary relationship with these keywords (notion) stratification, conception deriving tree-model proposed by the invention only relates to Semantic Similarity.
Design of the present invention is: the inside multilayer semantic relation of organizing field keyword (notion) with a kind of tree-shaped visualization structure of cum rights value.
According to above-mentioned inventive concept, the present invention adopts following technical proposals:
A kind of field key word hierarchy clustering method based on conception deriving tree, the field keyword of many pieces of texts that it is characterized in that extracting same field is some, and according to semantic relation its level is turned to a tree shaped model, and the concrete operations step is as follows:
1. it is some (with reference to disclosed relevant patent " extracting method of the keyword of text " to extract the text field keyword from the many pieces of texts in same field, publication number CN101067808), a keyword is exactly a notion, and this field keyword is a concept set;
2. in concept set, a notion is as a node, and root node is the title of this field keyword, and concept node is all nodes except that root node;
3. when making up the ground floor concept node, select have a direct secondary relationship with root node node just as the child node of this node, and this node is exactly a father node, select the closest plurality of nodes of this father node secondary relationship as the ground floor node, the degree of depth that how much is used for controlling the whole tree that derives of node number; The concept node of deletion ground floor from concept set appears in order to prevent that concept node is redundant;
4. when making up second layer concept node, select respectively with ground floor in the closest plurality of nodes of secondary relationship of each node constitute subtree (the node number is too much unsuitable), the notion in the same node layer can repeat;
5. from concept set, delete existing node in the tree that derives, the node of setting different levels that prevents to derive occurs redundant, repeat previous step, make up the 3rd layer and more level, be empty or can not from concept set, add new node up to concept set, so just can construct the data model of one tree shape;
6. concern to come weights between computing node and the node according to conception deriving of different nature.Relation between two node is divided into following three kinds of situations:
(1) if the father node of two nodes is same nodes, the parallelogram law (or other similar approach) in the utilization mechanics calculates the weights between these two notions;
(2) if the relation that the existence between two nodes is directly derived, then weights are drawn by the derive hierarchical direct (HD) of tree of node place;
(3) if do not have a relation of directly deriving between two nodes, then seek their nearest associated nodes, and calculate and to calculate their weights to this associated nodes respectively, calculate weights between these two notions by parallelogram law (or other similar approach) again.
The present invention compared with prior art, have following conspicuous outstanding substantive distinguishing features and remarkable advantage: the present invention is by method provided by the invention, can obtain the semantic association relation between the keyword automatically, and the intensity of the relation of the semantic association between the calculating keyword, can be by different level, the simple and direct semantic relation of representing efficiently between the text field keyword, be convenient to computing machine and grasp and understand processing.The present invention can effectively improve the performance of knowledge acquisition, this will be for the search of the individualized intelligent of resource under the large-scale network environment, recommend automatically, the auxiliary discovery of innovation knowledge with obtain, the gathering of relevant knowledge and fusion etc. provide technical support.
Description of drawings
Fig. 1 is a conception deriving tree that comprises one deck notion.
Fig. 2 is the cooccurrence relation table of ground floor conception deriving node.
Fig. 3 is the part that a complete conception is derived and set.
Fig. 4 is a conception deriving tree that has calculated secondary relationship weight between the node.
Embodiment
Details are as follows in conjunction with the accompanying drawings for a preferred embodiment of the present invention.
Concrete steps based on the text key word hierarchical clustering of conception deriving tree are as follows:
1. we can extract 163 of field keywords (with reference to disclosed relevant patent " extracting method of the keyword of text ", publication number CN101067808) for 112 pieces of papers of the National People's Congress 5000.The size of the concept set that constitutes is 163, and promptly having 163 does not have the notion that repeats.
2. root node can determine voluntarily according to related field.We can be defined as root node " political economy " according to the field, place for 112 pieces of paper training samples of the National People's Congress 5000.Such building method can make whole conception deriving tree more representative.
3. the node of structure ground floor at first will be determined the number of ground floor node.The node number of ground floor is too much unsuitable, because just can influence the degree of depth of whole tree like this.The purpose that makes up conception deriving tree is the sort of secondary relationship with level that is to find between notion and the notion, therefore should be as far as possible at the range of low layer position control tree.The tree node that we get ground floor in this example adds up to 4 roots of whole notion centralized concept sum and rounds up.So just, the number of the tree node of ground floor can be controlled in the more satisfactory scope.Therefore the number of ground floor tree node is
After having determined the number of tree node of ground floor, from concept set, select the notion that occurrence number is maximum in 112 pieces of papers (piece of writing frequently) again, extract 4 maximum keywords of front.Because notion and this relation by 112 pieces of fields that paper constituted that these frequencies of occurrences are maximum are the closest, therefore with they tree nodes as ground floor, as shown in Figure 1.
4. making up the second layer and more in the concept node of deriving on upper strata, we can make up the cooccurrence relation table of a ground floor conception deriving node earlier, as shown in Figure 2.The number of times that this table expression ground floor node and other notions occur in same piece of writing paper has only been selected 15 and ground floor node here with the more notion of occurrence number as an example.Next to be processed is the problem of the node number of subtree.The same with root node, the node on the subtree is too much unsuitable.Can decide the number of node according to following formula: promptly get greater than preceding 1/ η with the notion number of occurrence number.η wherein can be according to the actual needs value, in the present example η=15.The child node number of each node can both be controlled in the proper scope like this.The node of same level can repeat in the conception deriving tree.
5. behind the node that has made up a level, just the node in this level (notion) is removed from concept set, so just can avoid in the conception deriving tree producing the redundancy that concerns between the notion of different levels and the notion.Then can be according to recursive algorithm, other levels of conception deriving tree are built one by one, and all notions in covering concept set perhaps can't be added new concept node, so complete tree just makes up has finished, a conception deriving tree part wherein shown in Figure 3.
6. can begin to calculate weight between two notions from the top layer notion of the conception deriving tree that builds.
Because in the process that makes up conception deriving tree, we generally are controlled at the number of plies of tree in 7 layers.In order to make the weights between each node layer that evident difference be arranged, the weights ω in the path between concept node and the father node of top layer directly can being derived
iBe made as 0.7; Every one deck downwards derives directly that the weights in path just correspondingly deduct 0.1 between concept node and the father node, i.e. level weights difference Δ
iBe 0.1.Secondary relationship between the node of each layer will constantly weaken with the increase of level like this.ω herein
iAnd Δ
iCan carry out corresponding setting according to the different needs of different field.According to this rule, just can construct the complete field concept tree that derives, as shown in Figure 4.Relation between two nodes is divided into following three kinds of situations:
(1) if node C
5With node C
11Father node be same node.Can use the parallelogram law in the mechanics, regard these two notions two vectors of different directions in the mechanics as, the result who calculates is the weights behind these two generalization by the representation of groups.Therefore the relation between them can be calculated with following method:
ω
5=0.7, ω
11=0.7, so get ω
5-11=0.7.Be C
5And C
11Between exist certain conception deriving incidence relation, and the weights of deriving are 0.7.
(2) if node C
12With node C
11Between the existence relation of directly deriving, weights ω then
12-11Can directly draw 0.6 by tree.
(3) if node C
12With node C
5Nearest associated nodes be downward root node (Node
0).Therefore, their weights are multiplied each other by the weights of each layer and get.
Similar first kind of situation utilizes parallelogram law to draw
Claims (2)
1. plant the key word hierarchy clustering method based on conception deriving tree, the field keyword of many pieces of texts that it is characterized in that extracting same field is some, and according to semantic relation its level is turned to a tree shaped model, and the concrete operations step is as follows:
A) extraction text field keyword is some from the many pieces of texts in same field, and a keyword is exactly a notion, and this field keyword is a concept set;
B) in concept set, a notion is as a node, and root node is the title of this field keyword, and concept node is all nodes except that root node;
When c) making up the ground floor concept node, select have a direct secondary relationship with root node node just as the child node of this node, and this node is exactly a father node, select with the closest plurality of nodes of the secondary relationship of this father node as the ground floor node, the degree of depth that how much is used for controlling the whole tree that derives of node number; The concept node of deletion ground floor from concept set appears in order to prevent that concept node is redundant;
When d) making up second layer concept node, select respectively with ground floor in the closest plurality of nodes of secondary relationship of each node constitute subtree, the notion in the same node layer can repeat;
E) from concept set, delete existing node in the tree that derives, the different levels node of setting to prevent to derive occurs redundant, repeat previous step, make up the 3rd layer and more level, up to concept set is empty, perhaps can not from concept set, add new node, so just can construct the data model of one tree shape;
F) concern to come weights between computing node and the node according to conception deriving of different nature.
2. by the described key word hierarchy clustering method of claim 1 based on conception deriving tree, it is characterized in that concerning to come weights between computing node and the node according to conception deriving of different nature in the described step (f), the relation between two nodes is divided into following three kinds of situations:
A) if the father node of two nodes is same nodes, the parallelogram law in the utilization mechanics calculates the weights between these two notions;
B) if the relation that the existence between two nodes is directly derived, then weights are drawn by the derive hierarchical direct (HD) of tree of node place;
C) if do not have a relation of directly deriving between two nodes, then seek their nearest associated nodes, and calculate and calculate their weights respectively, calculate weights between these two notions by parallelogram law again to this associated nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100377271A CN101281530A (en) | 2008-05-20 | 2008-05-20 | Key word hierarchy clustering method based on conception deriving tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008100377271A CN101281530A (en) | 2008-05-20 | 2008-05-20 | Key word hierarchy clustering method based on conception deriving tree |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101281530A true CN101281530A (en) | 2008-10-08 |
Family
ID=40014005
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008100377271A Pending CN101281530A (en) | 2008-05-20 | 2008-05-20 | Key word hierarchy clustering method based on conception deriving tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101281530A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645083B (en) * | 2009-01-16 | 2012-07-04 | 中国科学院声学研究所 | Acquisition system and method of text field based on concept symbols |
CN103034656A (en) * | 2011-09-29 | 2013-04-10 | 日立(中国)研究开发有限公司 | Chapter content tiering method and device, and article content tiering method and device |
CN103164415A (en) * | 2011-12-09 | 2013-06-19 | 富士通株式会社 | Expansion keyword obtaining method based on microblog platform and equipment |
CN103177124A (en) * | 2013-04-15 | 2013-06-26 | 昆明理工大学 | Dielectric constant database searching method and dielectric constant database searching system |
CN103314371A (en) * | 2010-12-31 | 2013-09-18 | 肖岩 | Retrieval method and system |
CN104216932A (en) * | 2013-09-29 | 2014-12-17 | 北大方正集团有限公司 | Method and system for measuring knowledge point relationship strength |
CN104462084A (en) * | 2013-09-13 | 2015-03-25 | Sap欧洲公司 | Search refinement advice based on multiple queries |
WO2015043073A1 (en) * | 2013-09-29 | 2015-04-02 | 北大方正集团有限公司 | Key knowledge point recommendation method and system |
CN106339399A (en) * | 2015-07-13 | 2017-01-18 | 阿里巴巴集团控股有限公司 | Method and device for recommending keywords |
CN108038220A (en) * | 2017-12-22 | 2018-05-15 | 新奥(中国)燃气投资有限公司 | A kind of keyword methods of exhibiting and device |
WO2018177411A1 (en) * | 2017-04-01 | 2018-10-04 | 上海半坡网络技术有限公司 | System for real-time expression of semantic mind map, and operation method therefor |
CN109033084A (en) * | 2018-07-26 | 2018-12-18 | 国信优易数据有限公司 | A kind of semantic hierarchies tree constructing method and device |
JP2020060816A (en) * | 2018-10-04 | 2020-04-16 | Tis株式会社 | Information processing apparatus, information processing method, and program |
CN112470145A (en) * | 2018-08-14 | 2021-03-09 | 赫尔实验室有限公司 | Hypergraph-based method for segmenting and clustering consumer observable objects of a vehicle |
CN114168751A (en) * | 2021-12-06 | 2022-03-11 | 厦门大学 | Medical knowledge concept graph-based medical text label identification method and system |
-
2008
- 2008-05-20 CN CNA2008100377271A patent/CN101281530A/en active Pending
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101645083B (en) * | 2009-01-16 | 2012-07-04 | 中国科学院声学研究所 | Acquisition system and method of text field based on concept symbols |
CN103314371A (en) * | 2010-12-31 | 2013-09-18 | 肖岩 | Retrieval method and system |
CN103034656A (en) * | 2011-09-29 | 2013-04-10 | 日立(中国)研究开发有限公司 | Chapter content tiering method and device, and article content tiering method and device |
CN103034656B (en) * | 2011-09-29 | 2016-04-20 | 日立(中国)研究开发有限公司 | Chapters and sections content layered approach and device, article content layered approach and device |
CN103164415A (en) * | 2011-12-09 | 2013-06-19 | 富士通株式会社 | Expansion keyword obtaining method based on microblog platform and equipment |
CN103164415B (en) * | 2011-12-09 | 2016-03-23 | 富士通株式会社 | Based on expanded keyword acquisition methods and the equipment of microblog |
CN103177124B (en) * | 2013-04-15 | 2016-03-30 | 昆明理工大学 | A kind of specific inductive capacity database index method and system |
CN103177124A (en) * | 2013-04-15 | 2013-06-26 | 昆明理工大学 | Dielectric constant database searching method and dielectric constant database searching system |
CN104462084B (en) * | 2013-09-13 | 2019-08-16 | Sap欧洲公司 | Search refinement is provided based on multiple queries to suggest |
CN104462084A (en) * | 2013-09-13 | 2015-03-25 | Sap欧洲公司 | Search refinement advice based on multiple queries |
US10289623B2 (en) | 2013-09-29 | 2019-05-14 | Peking University Founder Group Co. Ltd. | Method and system for key knowledge point recommendation |
WO2015043073A1 (en) * | 2013-09-29 | 2015-04-02 | 北大方正集团有限公司 | Key knowledge point recommendation method and system |
CN104216932B (en) * | 2013-09-29 | 2017-11-07 | 北大方正集团有限公司 | The measure and its system of a kind of knowledge point relationship strength |
CN104516904B (en) * | 2013-09-29 | 2018-04-03 | 北大方正集团有限公司 | A kind of Key Points recommend method and its system |
CN104516904A (en) * | 2013-09-29 | 2015-04-15 | 北大方正集团有限公司 | Key knowledge point recommendation method and system |
CN104216932A (en) * | 2013-09-29 | 2014-12-17 | 北大方正集团有限公司 | Method and system for measuring knowledge point relationship strength |
CN106339399A (en) * | 2015-07-13 | 2017-01-18 | 阿里巴巴集团控股有限公司 | Method and device for recommending keywords |
CN106339399B (en) * | 2015-07-13 | 2019-07-23 | 阿里巴巴集团控股有限公司 | Keyword recommendation method and device |
US10970489B2 (en) | 2017-04-01 | 2021-04-06 | Shanghai Banpo Network Technologies Ltd. | System for real-time expression of semantic mind map, and operation method therefor |
WO2018177411A1 (en) * | 2017-04-01 | 2018-10-04 | 上海半坡网络技术有限公司 | System for real-time expression of semantic mind map, and operation method therefor |
CN108038220A (en) * | 2017-12-22 | 2018-05-15 | 新奥(中国)燃气投资有限公司 | A kind of keyword methods of exhibiting and device |
CN109033084A (en) * | 2018-07-26 | 2018-12-18 | 国信优易数据有限公司 | A kind of semantic hierarchies tree constructing method and device |
CN112470145A (en) * | 2018-08-14 | 2021-03-09 | 赫尔实验室有限公司 | Hypergraph-based method for segmenting and clustering consumer observable objects of a vehicle |
JP2020060816A (en) * | 2018-10-04 | 2020-04-16 | Tis株式会社 | Information processing apparatus, information processing method, and program |
JP7170487B2 (en) | 2018-10-04 | 2022-11-14 | Tis株式会社 | Information processing device and program |
CN114168751A (en) * | 2021-12-06 | 2022-03-11 | 厦门大学 | Medical knowledge concept graph-based medical text label identification method and system |
CN114168751B (en) * | 2021-12-06 | 2024-07-09 | 厦门大学 | Medical text label identification method and system based on medical knowledge conceptual diagram |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101281530A (en) | Key word hierarchy clustering method based on conception deriving tree | |
CN104537116B (en) | A kind of books searching method based on label | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
Chung et al. | Thematic mapping-from unstructured documents to taxonomies | |
Einasto et al. | Sdss dr7 superclusters-morphology | |
Hammouda et al. | Hierarchically distributed peer-to-peer document clustering and cluster summarization | |
CN103927358A (en) | Text search method and system | |
CN101630314A (en) | Semantic query expansion method based on domain knowledge | |
CN106372087A (en) | Information retrieval-oriented information map generation method and dynamic updating method | |
CN106570191A (en) | Wikipedia-based Chinese and English cross-language entity matching method | |
CN111190900A (en) | JSON data visualization optimization method in cloud computing mode | |
CN102043793A (en) | Knowledge-service-oriented recommendation method | |
CN108647322A (en) | The method that word-based net identifies a large amount of Web text messages similarities | |
CN106372122A (en) | Wiki semantic matching-based document classification method and system | |
CN104765779A (en) | Patent document inquiry extension method based on YAGO2s | |
Chaves et al. | Towards a multilingual ontology for ontology-driven content mining in social web sites | |
CN112508376A (en) | Index system construction method | |
CN103927176B (en) | Method for generating program feature tree on basis of hierarchical topic model | |
Krishna et al. | An efficient approach for text clustering based on frequent itemsets | |
CN108304519A (en) | A kind of knowledge forest construction method based on chart database | |
CN105447104A (en) | Knowledge map generating method and apparatus | |
CN103020283A (en) | Semantic search method based on dynamic reconfiguration of background knowledge | |
CN116662521B (en) | Electronic document screening and inquiring method and system | |
Sun et al. | Automatic generation of survey paper based on template tree | |
Kian et al. | An efficient approach for keyword selection; improving accessibility of web contents by general search engines |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Open date: 20081008 |