CN103412878A - Document theme partitioning method based on domain knowledge map community structure - Google Patents

Document theme partitioning method based on domain knowledge map community structure Download PDF

Info

Publication number
CN103412878A
CN103412878A CN2013102990478A CN201310299047A CN103412878A CN 103412878 A CN103412878 A CN 103412878A CN 2013102990478 A CN2013102990478 A CN 2013102990478A CN 201310299047 A CN201310299047 A CN 201310299047A CN 103412878 A CN103412878 A CN 103412878A
Authority
CN
China
Prior art keywords
community
document
knowledge
node
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013102990478A
Other languages
Chinese (zh)
Other versions
CN103412878B (en
Inventor
郑庆华
董博
刘均
徐海鹏
李冰
贺欢
马天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhirongjie Intellectual Property Service Co ltd
Taiyuan University of Technology
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201310299047.8A priority Critical patent/CN103412878B/en
Publication of CN103412878A publication Critical patent/CN103412878A/en
Application granted granted Critical
Publication of CN103412878B publication Critical patent/CN103412878B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document theme partitioning method based on a domain knowledge map community structure, and the partitioning problem of document resources related to subject knowledge or document knowledge is mainly solved, so that documents related to a theme can be stored in a close logical place, and learning efficiency is improved. The document theme partitioning method is characterized in that a level community discovery algorithm based on the Fast Geedy algorithm and the GN algorithm is proposed, and a theme structure tree is built; in the process of feature extraction, knowledge units directly serve as feature vectors, and due to the fact that the knowledge units have semantic integrality, compared with a traditional method based on participles, the document theme partitioning method can reflect theme characteristics of the feature vectors better; in the process of calculating feature vector values, the method of combination of degree centrality and knowledge unit file frequency is proposed, wherein the concept of the degree centrality reflects the status of the knowledge units in a knowledge map whole situation. Through the method, accuracy of document theme partitioning is effectively improved, and the method is suitable for the document theme partitioning based on the domain knowledge map community structure in general scenes.

Description

Document subject matter division methods based on domain knowledge map community structure
Technical field
The present invention relates to the enterprising style of writing shelves theme on the basis of domain knowledge map community structure divides, the partition problem of the document resources that mainly solution is relevant to subject or domain knowledge, so that the document of Topic relative is stored in to close logical place, improve storage and access efficiency.
Background technology
Expansion along with Network Course Platform, network courses every subjects document scale constantly enlarges, the document that theme is close is stored in close logical place, when the learner learns certain resource, can look ahead to other resources with its Topic relative connection, reduce the time overhead of file reading, improve storage and access efficiency.
For the Study on Topic Partition of document, below 3 pieces of patent documentations different technical schemes is provided:
1. based on text classification feature selecting and the weighing computation method (CN101290626) of domain knowledge
2. based on the k nearest neighbor file classification method (CN102033949A) of revising
3. the method for the proper vector weight of a new Text Classification and device (CN1719436A)
The method of document 1 comprises: (1) assembling sphere text and non-field text are as corpus and testing material; (2) pre-service of text, comprise word segmentation processing and statistics word frequency and document frequently; (3) choose the characteristic of division space also with improved TF-IDF method calculated characteristics weights; (4) selected characteristic space expand field term to feature space on the basis of step (3); (5) choose the characteristic of division space, utilize improved TF-IDF algorithm to calculate and adjust feature weight; (6) use the SVM machine learning method, training text is divided device, builds field text partitioning model, and the field text is carried out to experimental verification.
The method of document 2 comprises (1) text pre-service: at first each document in the training text set is carried out to participle, remove stop words, text is carried out to project-based expression; (2) text feature selection: then to the text vector dimensionality reduction, the structural attitude function is given a mark to Feature Words, select the least possible and with the closely-related file characteristics of document subject matter concept; (3) text classification: finally utilize based on the k nearest neighbor Algorithm of documents categorization structure sorter of deviation and classify, obtain classification results.
The method of document 3 comprises: (1) collects corpus and testing material by field; (2) remove " rubbish ", participle, the part-of-speech tagging of web page text; (3) from corpus, extracting the vocabulary in each field, and extract total vocabulary; (4) according to total vocabulary and field vocabulary, set up the information vocabulary with different keyword numbers for classification; (5) use the TF-IWF-DBV algorithm to classify to test text, optimize and obtain optimal threshold; (6) according to classification results, determine optimum keyword number.Because TF-IDF and TF-IWF method are all too relied on word frequency for counsel, simultaneously can't express vector element distributes between classification lack of uniformity again, so document 3 proposes a kind of new weighing computation method (TF-IWF-DBV), the n th Root of having introduced DBV and TF in the TF-IWF method has made up the deficiency of method.
The described method of above document mainly concentrates in the optimization of feature extracting method of text classification, and to choose term be characteristic item yet still be based on traditional participle mode, do not fully take into account the theme characteristic of characteristic item, causes classification accuracy not good enough.
Summary of the invention
The present invention, in order to solve the theme partition problem of every subjects document in existing large scale network course, provides a kind of domain knowledge map community structure and document subject matter has been divided to the division methods combined, to mark off the document that theme is close.
For reaching above purpose, the present invention takes following technical scheme to be achieved:
A kind of document subject matter division methods based on domain knowledge map community structure, is characterized in that, comprises the steps:
One, domain knowledge map community structure tree builds:
(1) domain knowledge map preprocessing process, be converted to simple non-directed graph by the domain knowledge map, and the domain knowledge map after changing joins it in CAQ of node queue to be analyzed as the root community node of community structure tree; The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V CMean the blocks of knowledge set that community's node comprises, Children means the child node set of community's node, and Parent means the father node of community's node;
(2) domain knowledge map level community partition process, from CAQ, taking out head of the queue node CH, used respectively Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph, and introduce the modularity threshold value
Figure BDA00003520702700021
If the community that above-mentioned two kinds of algorithms obtain divides modularity value corresponding to result and all is less than
Figure BDA00003520702700022
It is invalid to divide, execution step (3); Otherwise, contrast above-mentioned two kinds of algorithms and divide respective modules degree value as a result, choose wherein community corresponding to larger modularity value and divide result, create wherein community's node corresponding to each community, as the sub-community node of CH, and it is added to the CAQ formation;
(3) all nodes in CAQ are carried out to step (2), until the CAQ formation is empty, thereby obtain the community structure tree C-Tree that the domain knowledge map is corresponding, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet means community's node set of community structure tree, and croot means the root community node of community structure tree, and n means community's nodes, the community's number namely existed in network;
Two, by community structure tree corresponding to the domain knowledge map to the step 1 gained, carry out community's theme identification, build field thematic structure tree, realize that community structure arrives the mapping of thematic structure;
Three, the file characteristics vector extracts:
(1) structural attitude space, using all blocks of knowledge in the domain knowledge map as characteristic item, form the feature space of various dimensions;
(2) preprocessing process of document, document is converted to the plain text form, extract the text chunk of each document, use is carried out the similarity coupling based on the TF-IDF algorithm of the vector space model text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku of domain knowledge map office, if similarity reaches threshold value μ, think that document comprises ku, extract accordingly all blocks of knowledge that document comprises;
(3) utilize the degree centrad of blocks of knowledge in the domain knowledge map in formula (3) calculated characteristics space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by document abstract be following form:
X j={ W 1, W 2..., W i..., W n, the dimension of n representation feature vector wherein, W iThe weight that means i characteristic item, its formalization representation is as follows:
W i=C deg(ku i)*kuf(ku i,d) (7)
Wherein, kuf (ku i, d) mean blocks of knowledge occurs in document d the frequency, C deg(ku i) expression blocks of knowledge ku iThe degree centrad;
Four, the document subject matter partitioning model builds:
(1) structure training dataset, for each document in given training dataset D, use the described method of step 3 to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure in integrating step one tree C-Tree and step 2, by training dataset abstract be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., the m) proper vector of j document of expression, Y j(j=1,2 ..., m) meaning the theme label set of j document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) training process is selected the BR-SVM algorithm, adopts the cross validation mode, and based on training document sets D, training obtains document subject matter partitioning model M;
Five, document subject matter is divided: to document to be divided, extract the blocks of knowledge that document comprises, use the step 3 method to obtain the file characteristics vector representation, the document subject matter partitioning model that uses step 4 to obtain is realized the document subject matter division.
In said method, described structure field thematic structure tree concrete steps are:
(1) community center's point analysis, calculate the degree centrad of each community's blocks of knowledge that node comprises in domain knowledge map subgraph corresponding to community in C-Tree, and the larger set of node of Selection Center degree is as the node group CCNS of community center; The degree centrad computing method of blocks of knowledge in domain knowledge map subgraph corresponding to community are as follows:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) expression blocks of knowledge ku iDegree in community, KU mean the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(2) to the blocks of knowledge in CCNS, search the domain knowledge map office, obtain the core terminology that CCNS comprises, in conjunction with degree centrad and the core term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of core term Central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) means the centrad of blocks of knowledge in CCNS, and δ (term, ku) means term occurs in ku the frequency, and the core term of Selection Center weight maximum is as the theme of community;
(3) for each community's node of C-Tree, carry out step (2), thereby the field of structure thematic structure tree T-Tree realizes that community structure arrives the mapping of thematic structure, the T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet means community's theme node set, and troot means the root node of thematic structure tree, and n means the theme number; Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y CMean community's theme label, SubTopics means the child node set of theme node, and PTopic means the father node of theme node.
Compared with prior art, the advantage of the inventive method is: build in the process of thematic structure tree, proposed to build the community structure tree based on the level community discovery algorithm of Fast Geedy algorithm and GN algorithm; Characteristic extraction procedure directly as proper vector, because blocks of knowledge has semantic integrity, more can embody the theme characteristic of proper vector with respect to traditional method based on participle using blocks of knowledge; The process proposition degree centrad of calculated characteristics vector value and the method that the blocks of knowledge document combines frequently, the concept of its moderate centrad has reflected the status of blocks of knowledge in the Knowledge Map overall situation.By above-mentioned improvement, effectively improved with respect to classic method the accuracy rate that document subject matter is divided.
The accompanying drawing explanation
The present invention is described in further detail below in conjunction with the drawings and the specific embodiments.
Fig. 1 the present invention is based on Knowledge Map community structure document subject matter to divide process flow diagram.
Fig. 2 is domain knowledge map theme system construction process flow diagram in Fig. 1.
Fig. 3 is that in Fig. 1, proper vector is extracted process flow diagram.
Embodiment
Described domain knowledge map is to describe the interior knowledge in some fields (course or subject) and the complex network of the association between these knowledge; Blocks of knowledge refers in Knowledge Map have the ABC fragment of complete ability to express; The domain knowledge map office is the database of blocks of knowledge in field of storage, has recorded the details of blocks of knowledge, comprises relation between core term and blocks of knowledge etc. as blocks of knowledge title, the corresponding text chunk of blocks of knowledge, blocks of knowledge.The Knowledge Map of a common subject is from the document resources of this subject, building and produce, and is expressed as the network of blocks of knowledge and incidence relation thereof; Use complex network community discovery algorithm by the domain knowledge map partitioning as community structure after, each community has relatively independent theme.Therefore, the blocks of knowledge community structure can be used as the foundation that document subject matter is divided.
The implementation procedure of dividing based on the document subject matter of Knowledge Map community structure as shown in Figure 1, can be divided into two parts: divide by the structure of document subject matter partitioning model and the theme of document to be divided.
The structure of document subject matter disaggregated model is divided into three steps:
1, domain knowledge map theme system construction: at first, (Fast Greedy algorithm is a kind of coagulation type community discovery algorithm proposed by people such as Newman based on Fast Greedy algorithm in proposition, when initial, each node is a community, then the community's modularity increment after any two community's polymerizations in computational grid, choose wherein two communities of increment maximum and merge; This process recurrence is carried out, until modularity no longer increases) and the GN algorithm (the GN algorithm is a kind of Split type community discovery algorithm proposed by Girvan and Newman, the continuous limit betweenness on limit in computational grid in implementation; Choose the limit of limit betweenness maximum from network, deleting, until modularity no longer increases at every turn) level community discovery algorithm, the domain knowledge map is carried out to community's division, obtain the community structure tree of domain knowledge map; Each node of community structure tree means a community of domain knowledge map, and the blocks of knowledge of same community shows subject consistency; Secondly, by analyzing community center's node (being certain important node of using the degree centrad of blocks of knowledge to portray in community), determine community's theme, thereby the field of structure thematic structure tree realizes that community structure arrives the mapping of thematic structure;
2, construction feature space, calculate the proper vector value of each dimension: using all blocks of knowledge of domain knowledge map as characteristic item, the construction feature space; Extract the blocks of knowledge that document comprises, in conjunction with the degree centrad of blocks of knowledge, calculate the proper vector value of each dimension;
3, structure training dataset, training theme partitioning model: the structure training dataset, select many labelings of BR-SVM algorithm, training dataset is trained, obtain the document subject matter partitioning model.
Document to be divided is carried out to document subject matter, and to divide concrete steps as follows:
1, file characteristics vector representation: for document d to be divided, the described method of step 2 in the structure part of profile subject classification model, extract the document blocks of knowledge, obtains the feature vector, X of document to be divided d
2, document subject matter is divided: by the feature vector, X of document to be divided dAs the input of field document subject matter partitioning model M, the output of model is the theme label Y of document d, according to Y dAnd the corresponding relation between field thematic structure tree T-Tree, show that the theme of document d is divided.
As shown in Figure 2, the concrete implementation step of domain knowledge map theme system construction process is as follows:
(1) domain knowledge map preprocessing process, be converted to simple non-directed graph by the domain knowledge map, and the domain knowledge map after changing joins it in CAQ of node queue to be analyzed as the root community node of community structure tree.The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V CMean the blocks of knowledge set that community's node comprises, Children means the child node set of community's node, and Parent means the father node of community's node;
(2) domain knowledge map level community partition process, from CAQ, taking out head of the queue node CH, used respectively Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph, and introduce the modularity threshold value (default value is 0.35); If the community that above-mentioned two kinds of algorithms obtain divides modularity value corresponding to result and all is less than 0.35, it is invalid to divide, execution step (3); Otherwise, contrast above-mentioned two kinds of algorithms and divide respective modules degree value as a result, choose wherein community corresponding to larger modularity value and divide result, create wherein community's node corresponding to each community, as the sub-community node of CH, and it is added to the CAQ formation;
(3) all nodes in CAQ are carried out to step (2), until the CAQ formation is empty, thereby obtain the community structure tree C-Tree that the domain knowledge map is corresponding, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet means community's node set of community structure tree, and croot means the root community node of community structure tree, and n means community's nodes, the community's number namely existed in network;
(4) community center's point analysis, calculate the degree centrad of each community's blocks of knowledge that node comprises in domain knowledge map subgraph corresponding to community in C-Tree, and the larger set of node of Selection Center degree is as the node group CCNS of community center; The degree centrad computing method of blocks of knowledge in domain knowledge map subgraph corresponding to community are as follows:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) expression blocks of knowledge ku iDegree in community, KU mean the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(5) to the blocks of knowledge in CCNS, search the domain knowledge map office, obtain the core terminology that CCNS comprises, in conjunction with degree centrad and the core term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of core term Central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) means the centrad of blocks of knowledge in CCNS, and δ (term, ku) means term occurs in ku the frequency.The core term of Selection Center weight maximum is as the theme of community;
(6) for each community's node of C-Tree, carry out step (2), thereby the field of structure thematic structure tree T-Tree realizes that community structure arrives the mapping of thematic structure, the T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet means community's theme node set, and troot means the root node of thematic structure tree, and n means the theme number.Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y CMean community's theme label, SubTopics means the child node set of theme node, and PTopic means the father node of theme node.
As shown in Figure 3, the construction feature space, the concrete implementation step of proper vector value of calculating each dimension is as follows:
(1) structural attitude space, using all blocks of knowledge in the domain knowledge map as characteristic item, form the feature space of various dimensions (each blocks of knowledge is a dimension);
(2) preprocessing process of document, document is converted to plain text form (being txt file), extract the text chunk of each document, use based on the TF-IDF algorithm of vector space model (based on the TF-IDF algorithm of vector space model use the TF-IDF algorithm by text representation for the proper vector form of term as characteristic item, by included angle cosine between vector, meaning the similarity between document) the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku of domain knowledge map office carries out the similarity coupling, if similarity reaches threshold value μ (default value is 0.8), think that document comprises ku, extract accordingly all blocks of knowledge that document comprises,
(3) the degree centrad (computing method referring to formula (3)) of blocks of knowledge in the domain knowledge map in the calculated characteristics space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by document abstract be following form: X j={ W 1, W 2..., W i..., W n, the dimension of n representation feature vector wherein, W iThe weight that means i characteristic item, its formalization representation is as follows:
W i=C deg(ku i)*kuf(ku i,d) (7)
Wherein, kuf (ku i, d) mean blocks of knowledge occurs in document d the frequency, C deg(ku i) expression blocks of knowledge ku iThe degree centrad.
The structure training dataset, the concrete steps of training theme partitioning model comprise:
(1) structure training dataset, for each document in given training dataset D, use the described method of step 4 to extract its proper vector, in conjunction with domain knowledge map community structure tree C-Tree and field thematic structure tree T-Tree, by training dataset abstract be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., the m) proper vector of j document of expression, Y j(j=1,2 ..., m) meaning the theme label set of j document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) (the BR-SVM method adopts " one-to-many " strategy that many labels problem is converted into to a plurality of two classification problems to training process selection BR-SVM algorithm, and with two ripe classification problem training method SVM, these a series of two classification problems are trained), adopt the cross validation mode, based on training document sets D, training obtains document subject matter partitioning model M.

Claims (2)

1. the document subject matter division methods based on domain knowledge map community structure, is characterized in that, comprises the steps:
One, domain knowledge map community structure tree builds:
(1) domain knowledge map preprocessing process, be converted to simple non-directed graph by the domain knowledge map, and the domain knowledge map after changing joins it in CAQ of node queue to be analyzed as the root community node of community structure tree; The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V CMean the blocks of knowledge set that community's node comprises, Children means the child node set of community's node, and Parent means the father node of community's node;
(2) domain knowledge map level community partition process, from CAQ, taking out head of the queue node CH, used respectively Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph, and introduce the modularity threshold value
Figure FDA00003520702600011
If the community that above-mentioned two kinds of algorithms obtain divides modularity value corresponding to result and all is less than
Figure FDA00003520702600012
It is invalid to divide, execution step (3); Otherwise, contrast above-mentioned two kinds of algorithms and divide respective modules degree value as a result, choose wherein community corresponding to larger modularity value and divide result, create wherein community's node corresponding to each community, as the sub-community node of CH, and it is added to the CAQ formation;
(3) all nodes in CAQ are carried out to step (2), until the CAQ formation is empty, thereby obtain the community structure tree C-Tree that the domain knowledge map is corresponding, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet means community's node set of community structure tree, and croot means the root community node of community structure tree, and n means community's nodes, the community's number namely existed in network;
Two, by community structure tree corresponding to the domain knowledge map to the step 1 gained, carry out community's theme identification, build field thematic structure tree, realize that community structure arrives the mapping of thematic structure;
Three, the file characteristics vector extracts:
(1) structural attitude space, using all blocks of knowledge in the domain knowledge map as characteristic item, form the feature space of various dimensions;
(2) preprocessing process of document, document is converted to the plain text form, extract the text chunk of each document, use is carried out the similarity coupling based on the TF-IDF algorithm of the vector space model text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku of domain knowledge map office, if similarity reaches threshold value μ, think that document comprises ku, extract accordingly all blocks of knowledge that document comprises;
(3) utilize the degree centrad of blocks of knowledge in the domain knowledge map in formula (3) calculated characteristics space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by document abstract be following form:
X j={ W 1, W 2..., W i..., W n, the dimension of n representation feature vector wherein, W iThe weight that means i characteristic item, its formalization representation is as follows:
W i=C deg(ku i)*kuf(ku i,d) (7)
Wherein, kuf (ku i, d) mean blocks of knowledge occurs in document d the frequency, C deg(ku i) expression blocks of knowledge ku iThe degree centrad;
Four, the document subject matter partitioning model builds:
(1) structure training dataset, for each document in given training dataset D, use the described method of step 3 to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure in integrating step one tree C-Tree and step 2, by training dataset abstract be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., the m) proper vector of j document of expression, Y j(j=1,2 ..., m) meaning the theme label set of j document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) training process is selected the BR-SVM algorithm, adopts the cross validation mode, and based on training document sets D, training obtains document subject matter partitioning model M;
Five, document subject matter is divided: to document to be divided, extract the blocks of knowledge that document comprises, use the step 3 method to obtain the file characteristics vector representation, the document subject matter partitioning model that uses step 4 to obtain is realized the document subject matter division.
2. the document subject matter division methods based on domain knowledge map community structure as claimed in claim 1, is characterized in that, described structure field thematic structure tree concrete steps are:
(1) community center's point analysis, calculate the degree centrad of each community's blocks of knowledge that node comprises in domain knowledge map subgraph corresponding to community in C-Tree, and the larger set of node of Selection Center degree is as the node group CCNS of community center; The degree centrad computing method of blocks of knowledge in domain knowledge map subgraph corresponding to community are as follows:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) expression blocks of knowledge ku iDegree in community, KU mean the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(2) to the blocks of knowledge in CCNS, search the domain knowledge map office, obtain the core terminology that CCNS comprises, in conjunction with degree centrad and the core term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of core term Central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) means the centrad of blocks of knowledge in CCNS, and δ (term, ku) means term occurs in ku the frequency, and the core term of Selection Center weight maximum is as the theme of community;
(3) for each community's node of C-Tree, carry out step (2), thereby the field of structure thematic structure tree T-Tree realizes that community structure arrives the mapping of thematic structure, the T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet means community's theme node set, and troot means the root node of thematic structure tree, and n means the theme number; Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y CMean community's theme label, SubTopics means the child node set of theme node, and PTopic means the father node of theme node.
CN201310299047.8A 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure Expired - Fee Related CN103412878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310299047.8A CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310299047.8A CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Publications (2)

Publication Number Publication Date
CN103412878A true CN103412878A (en) 2013-11-27
CN103412878B CN103412878B (en) 2015-03-04

Family

ID=49605890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310299047.8A Expired - Fee Related CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Country Status (1)

Country Link
CN (1) CN103412878B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933621A (en) * 2015-06-19 2015-09-23 天睿信科技术(北京)有限公司 Big data analysis system and method for guarantee ring
CN106528540A (en) * 2016-12-16 2017-03-22 广州索答信息科技有限公司 Word segmentation method and word segmentation system for seed questions
CN107368558A (en) * 2017-07-05 2017-11-21 腾讯科技(深圳)有限公司 The return method and device of data object
CN107766412A (en) * 2017-09-05 2018-03-06 华南师范大学 A kind of mthods, systems and devices for establishing thematic map
CN109697642A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Data push method, device and computer readable storage medium
CN110427494A (en) * 2019-07-29 2019-11-08 北京明略软件系统有限公司 Methods of exhibiting, device, storage medium and the electronic device of knowledge mapping
CN110737777A (en) * 2019-08-28 2020-01-31 南京航空航天大学 knowledge map construction method based on GHSOM algorithm

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348239A (en) * 2003-05-20 2004-12-09 Fujitsu Ltd Text classification program
US7949646B1 (en) * 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
CN102141997A (en) * 2010-02-02 2011-08-03 三星电子(中国)研发中心 Intelligent decision support system and intelligent decision method thereof
CN102567464A (en) * 2011-11-29 2012-07-11 西安交通大学 Theme map expansion based knowledge resource organizing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348239A (en) * 2003-05-20 2004-12-09 Fujitsu Ltd Text classification program
US7949646B1 (en) * 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
CN102141997A (en) * 2010-02-02 2011-08-03 三星电子(中国)研发中心 Intelligent decision support system and intelligent decision method thereof
CN102567464A (en) * 2011-11-29 2012-07-11 西安交通大学 Theme map expansion based knowledge resource organizing method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933621A (en) * 2015-06-19 2015-09-23 天睿信科技术(北京)有限公司 Big data analysis system and method for guarantee ring
CN106528540A (en) * 2016-12-16 2017-03-22 广州索答信息科技有限公司 Word segmentation method and word segmentation system for seed questions
CN107368558A (en) * 2017-07-05 2017-11-21 腾讯科技(深圳)有限公司 The return method and device of data object
CN107766412A (en) * 2017-09-05 2018-03-06 华南师范大学 A kind of mthods, systems and devices for establishing thematic map
CN109697642A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Data push method, device and computer readable storage medium
CN110427494A (en) * 2019-07-29 2019-11-08 北京明略软件系统有限公司 Methods of exhibiting, device, storage medium and the electronic device of knowledge mapping
CN110427494B (en) * 2019-07-29 2022-11-15 北京明略软件系统有限公司 Knowledge graph display method and device, storage medium and electronic device
CN110737777A (en) * 2019-08-28 2020-01-31 南京航空航天大学 knowledge map construction method based on GHSOM algorithm

Also Published As

Publication number Publication date
CN103412878B (en) 2015-03-04

Similar Documents

Publication Publication Date Title
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN110929161B (en) Large-scale user-oriented personalized teaching resource recommendation method
CN109670039A (en) Sentiment analysis method is commented on based on the semi-supervised electric business of tripartite graph and clustering
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN106126751A (en) A kind of sorting technique with time availability and device
CN103984681A (en) News event evolution analysis method based on time sequence distribution information and topic model
CN102411611B (en) Instant interactive text oriented event identifying and tracking method
CN103870474A (en) News topic organizing method and device
CN102289522A (en) Method of intelligently classifying texts
Xue et al. Optimizing ontology alignment through memetic algorithm based on partial reference alignment
CN103971161A (en) Hybrid recommendation method based on Cauchy distribution quantum-behaved particle swarm optimization
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN102622609B (en) Method for automatically classifying three-dimensional models based on support vector machine
CN106407208A (en) Establishment method and system for city management ontology knowledge base
US20220318317A1 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
CN103678436A (en) Information processing system and information processing method
Faddoul et al. Boosting multi-task weak learners with applications to textual and social data
CN103761286B (en) A kind of Service Source search method based on user interest
CN109949174A (en) A kind of isomery social network user entity anchor chain connects recognition methods
CN101174316A (en) Device and method for cases illation based on cases tree
CN105869058A (en) Method for user portrait extraction based on multilayer latent variable model
Campbell et al. Content+ context networks for user classification in twitter
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
Azzam et al. Text-based question routing for question answering communities via deep learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Han Xiaoxia

Inventor after: Zhao Chaofan

Inventor after: Zhao Shuyan

Inventor before: Zheng Qinghua

Inventor before: Dong Bo

Inventor before: Liu Jun

Inventor before: Xu Haipeng

Inventor before: Li Bing

Inventor before: He Huan

Inventor before: Ma Tian

CB03 Change of inventor or designer information
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20171130

Address after: 030024 Yingze West Street, Taiyuan City, Taiyuan, Shanxi

Patentee after: Taiyuan University of Technology

Address before: 511442 1402 room 1402, No. 383 office building, North 383 Panyu Avenue, Panyu District South Village, Panyu District, Guangdong

Patentee before: Guangzhou Zhirongjie Intellectual Property Service Co.,Ltd.

Effective date of registration: 20171130

Address after: 511442 1402 room 1402, No. 383 office building, North 383 Panyu Avenue, Panyu District South Village, Panyu District, Guangdong

Patentee after: Guangzhou Zhirongjie Intellectual Property Service Co.,Ltd.

Address before: 710049 Xianning West Road, Shaanxi, China, No. 28, No.

Patentee before: Xi'an Jiaotong University

CP02 Change in the address of a patent holder

Address after: 030024 West Street, Taiyuan, Shanxi, No. 79, No.

Patentee after: Taiyuan University of Technology

Address before: 030024 Yingze West Street, Taiyuan City, Taiyuan, Shanxi

Patentee before: Taiyuan University of Technology

CB03 Change of inventor or designer information

Inventor after: Zheng Qinghua

Inventor after: Dong Bo

Inventor after: Liu Jun

Inventor after: Xu Haipeng

Inventor after: Li Bing

Inventor after: He Huan

Inventor after: Ma Tian

Inventor before: Han Xiaoxia

Inventor before: Zhao Chaofan

Inventor before: Zhao Shuyan

CB03 Change of inventor or designer information
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150304

Termination date: 20190716

CF01 Termination of patent right due to non-payment of annual fee