CN103412878B - Document theme partitioning method based on domain knowledge map community structure - Google Patents

Document theme partitioning method based on domain knowledge map community structure Download PDF

Info

Publication number
CN103412878B
CN103412878B CN201310299047.8A CN201310299047A CN103412878B CN 103412878 B CN103412878 B CN 103412878B CN 201310299047 A CN201310299047 A CN 201310299047A CN 103412878 B CN103412878 B CN 103412878B
Authority
CN
China
Prior art keywords
community
document
knowledge
node
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310299047.8A
Other languages
Chinese (zh)
Other versions
CN103412878A (en
Inventor
郑庆华
董博
刘均
徐海鹏
李冰
贺欢
马天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Zhirongjie Intellectual Property Service Co ltd
Taiyuan University of Technology
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201310299047.8A priority Critical patent/CN103412878B/en
Publication of CN103412878A publication Critical patent/CN103412878A/en
Application granted granted Critical
Publication of CN103412878B publication Critical patent/CN103412878B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a document theme partitioning method based on a domain knowledge map community structure, and the partitioning problem of document resources related to subject knowledge or document knowledge is mainly solved, so that documents related to a theme can be stored in a close logical place, and learning efficiency is improved. The document theme partitioning method is characterized in that a level community discovery algorithm based on the Fast Geedy algorithm and the GN algorithm is proposed, and a theme structure tree is built; in the process of feature extraction, knowledge units directly serve as feature vectors, and due to the fact that the knowledge units have semantic integrality, compared with a traditional method based on participles, the document theme partitioning method can reflect theme characteristics of the feature vectors better; in the process of calculating feature vector values, the method of combination of degree centrality and knowledge unit file frequency is proposed, wherein the concept of the degree centrality reflects the status of the knowledge units in a knowledge map whole situation. Through the method, accuracy of document theme partitioning is effectively improved, and the method is suitable for the document theme partitioning based on the domain knowledge map community structure in general scenes.

Description

Based on the document subject matter division methods of domain knowledge map community structure
Technical field
The present invention relates to and carry out document subject matter division on the basis of domain knowledge map community structure, mainly solve the partition problem of the document resources relevant to subject or domain knowledge, so that the document of being correlated with by theme is stored in close logical place, improves and store and access efficiency.
Background technology
Along with the expansion of Network Course Platform, network courses every subjects document scale constantly expands, document close for theme is stored in close logical place, when learner learns certain resource, other resources be associated with its theme can be looked ahead, reduce the time overhead of file reading, improve and store and access efficiency.
For the Study on Topic Partition of document, 3 sections of patent documentations provide different technical schemes below:
1. based on text classification feature selecting and the weighing computation method (CN101290626) of domain knowledge
2. based on the k nearest neighbor file classification method (CN102033949A) revised
3. the method for the proper vector weight of a new Text Classification and device (CN1719436A)
The method of document 1 comprises: (1) assembling sphere text and non-field text are as corpus and testing material; (2) pre-service of text, comprises word segmentation processing and statistics word frequency and document frequently; (3) choose characteristic of division space and calculate feature weight by the TF-IDF method improved; (4) selected characteristic space expand field term to feature space on the basis of step (3); (5) choose characteristic of division space, utilize the TF-IDF algorithm improved calculate feature weight and adjust; (6) use SVM machine learning method, training text divider, build field text partitioning model, and experimental verification is carried out to field text.
The method of document 2 comprises (1) Text Pretreatment: first carry out participle to each document in training text set, removes stop words, text is carried out project-based expression; (2) text feature selection: then to text vector dimensionality reduction, structural attitude function is given a mark to Feature Words, selects the least possible and closely-related with document subject matter concept file characteristics; (3) text classification: finally utilize the k nearest neighbor Algorithm of documents categorization based on deviation to build sorter and classify, obtain classification results.
The method of document 3 comprises: (1) collects corpus and testing material by field; (2) " rubbish ", participle, the part-of-speech tagging of web page text is removed; (3) from corpus, extract the vocabulary in each field, and extract total vocabulary; (4) the information vocabulary with different keyword number for classifying is set up according to total vocabulary and field vocabulary; (5) use TF-IWF-DBV algorithm to classify to test text, optimize and obtain optimal threshold; (6) optimum keyword number is determined according to classification results.Because TF-IDF and TF-IWF method all too relies on word frequency for counsel, the lack of uniformity that vector element distributes between classification cannot be indicated again simultaneously, so document 3 proposes a kind of new weighing computation method (TF-IWF-DBV), the n th Root introducing DBV and TF in TF-IWF method compensate for the deficiency of method.
Described in above document, method mainly concentrates in the optimization of the feature extracting method of text classification, but is still that to choose term based on traditional participle mode be characteristic item, does not fully take into account the theme characteristic of characteristic item, causes classification accuracy not good enough.
Summary of the invention
The present invention, in order to solve the theme partition problem of every subjects document in existing large scale network course, provides and a kind ofly domain knowledge map community structure and document subject matter is divided the division methods combined, to mark off the close document of theme.
For reaching above object, the present invention takes following technical scheme to be achieved:
Based on a document subject matter division methods for domain knowledge map community structure, it is characterized in that, comprise the steps:
One, domain knowledge map community structure tree builds:
(1) domain knowledge map preprocessing process, is converted to simple undirected graph by domain knowledge map, and using the domain knowledge map after conversion as the root community node that community structure is set, is joined in node queue CAQ to be analyzed; The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V crepresent the blocks of knowledge set that community's node comprises, Children represents the child node set of community's node, and Parent represents the father node of community's node;
(2) domain knowledge map level community partition process, takes out head of the queue node CH from CAQ, uses Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph respectively, and introduces modularity threshold value if module angle value corresponding to the community division result that above-mentioned two kinds of algorithms obtain all is less than it is invalid then to divide, and performs step (3); Otherwise, contrast above-mentioned two kinds of algorithm partition result respective modules angle value, choose the community division result that wherein larger module angle value is corresponding, create community's node that wherein each community is corresponding, as the sub-community node of CH, and added CAQ queue;
(3) carry out step (2) to all nodes in CAQ, until CAQ queue is empty, thus obtain community structure tree C-Tree corresponding to domain knowledge map, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet represents that community's node set that community structure is set, croot represent the root community node that community structure is set, and n represents community's nodes, the community's number namely existed in network;
Two, carry out community's theme identification by the community structure tree corresponding to the domain knowledge map of step one gained, build field thematic structure tree, realize the mapping of community structure to thematic structure;
Three, file characteristics vector extracts:
(1) structural attitude space, using all blocks of knowledge in domain knowledge map as characteristic item, forms the feature space of various dimensions;
(2) preprocessing process of document, be plain text by document subject feature vector, extract the text chunk of each document, use the TF-IDF algorithm based on vector space model that the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku in domain knowledge map storehouse is carried out similarity mode, if similarity reaches threshold value μ, then think that document package contains ku, extract all blocks of knowledge that document package contains accordingly;
(3) utilize formula (3) to calculate the degree centrad of blocks of knowledge in domain knowledge map in feature space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by abstract for document be following form:
X j={ W 1, W 2..., W i..., W n, the wherein dimension of n representation feature vector, W irepresent the weight of i-th characteristic item, its formalization representation is as follows:
W i=C deg(ku i)*kuf(ku i,d) (7)
Wherein, kuf (ku i, d) represent the frequency that blocks of knowledge occurs in document d, C deg(ku i) represent blocks of knowledge ku idegree centrad;
Four, document subject matter partitioning model builds:
(1) training dataset is constructed, for each document in given training dataset D, method described in step 3 is used to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure tree C-Tree in integrating step one and step 2, by abstract for training dataset be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., m) represent the proper vector of a jth document, Y j(j=1,2 ..., m) represent the theme label set of a jth document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) training process selects BR-SVM algorithm, adopts cross validation mode, and based on Training document collection D, training obtains document subject matter partitioning model M;
Five, document subject matter divides: to document to be divided, extracts the blocks of knowledge that document package contains, and uses step 3 method to obtain file characteristics vector representation, and the document subject matter partitioning model using step 4 to obtain realizes document subject matter and divides.
In said method, described structure field thematic structure tree concrete steps are:
(1) community center's point analysis, calculate each community node in C-Tree comprise the degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding, the larger set of node of Selection Center degree is as community center node group CCNS; The degree centrad computing method of blocks of knowledge in the domain knowledge map subgraph that community is corresponding are as follows:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) represent blocks of knowledge ku idegree in community, KU represents the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(2) to the blocks of knowledge in CCNS, search domain knowledge map storehouse, obtain the central term collection that CCNS comprises, in conjunction with degree centrad and the central term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of central term central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) represents the centrad of blocks of knowledge in CCNS, and δ (term, ku) represents the frequency that term occurs in ku, and the maximum central term of Selection Center weight is as the theme of community;
(3) step (2) is carried out for C-Tree each community node, thus the field of structure thematic structure tree T-Tree, realize the mapping of community structure to thematic structure, T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet represents community's theme node set, and troot represents the root node that thematic structure is set, and n represents theme number; Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y crepresent community's theme label, SubTopics represents the child node set of theme node, and PTopic represents the father node of theme node.
Compared with prior art, the advantage of the inventive method is: build in the process of thematic structure tree, and the level community discovery algorithm proposed based on Fast Geedy algorithm and GN algorithm builds community structure tree; Blocks of knowledge directly as proper vector, because blocks of knowledge has semantic integrity, more can be embodied the theme characteristic of proper vector by characteristic extraction procedure relative to traditional method based on participle; The method that the process proposition degree centrad of calculating proper vector value and blocks of knowledge document combine frequently, the concept of its moderate centrad reflects the status of blocks of knowledge in the Knowledge Map overall situation.By above-mentioned improvement, effectively improve the accuracy rate of document subject matter division relative to classic method.
Accompanying drawing explanation
Below in conjunction with the drawings and the specific embodiments, the present invention is described in further detail.
Fig. 1 the present invention is based on Knowledge Map community structure document subject matter to divide process flow diagram.
Fig. 2 is domain knowledge map theme system construction process flow diagram in Fig. 1.
Fig. 3 is characteristic vector pickup process flow diagram in Fig. 1.
Embodiment
Described domain knowledge map is the complex network of the association described between knowledge in some fields (course or subject) and these knowledge; Blocks of knowledge refers to the ABC fragment in Knowledge Map with complete ability to express; Domain knowledge map storehouse is the database of blocks of knowledge in field of storage, have recorded the details of blocks of knowledge, as blocks of knowledge title, the corresponding text chunk of blocks of knowledge, blocks of knowledge comprise the relation etc. between central term and blocks of knowledge.The Knowledge Map of a usual subject builds to produce from the document resources of this subject, is expressed as the network of blocks of knowledge and incidence relation thereof; Use complex network community discovery algorithm by domain knowledge map partitioning for after community structure, each community has relatively independent theme.Therefore, blocks of knowledge community structure can as the foundation of document subject matter division.
The implementation procedure of the document subject matter division of knowledge based map community structure as shown in Figure 1, can be divided into two parts: the structure of document subject matter partitioning model and the theme of document to be divided divide.
The structure of document subject matter disaggregated model is divided into three steps:
1, domain knowledge map theme system construction: first, based on Fast Greedy algorithm, (FastGreedy algorithm is a kind of coagulation type community discovery algorithm proposed by people such as Newman in proposition, each node Dou Shiyige community time initial, then the community module degree increment in computational grid after the polymerization of any Liang Ge community, the Liang Ge community choosing wherein increment maximum merges; This process recurrence is carried out, until modularity no longer increases) and GN algorithm (GN algorithm is a kind of Split type community discovery algorithm proposed by Girvan and Newman, the limit betweenness on limit in continuous computational grid in implementation; The limit at every turn choosing limit betweenness maximum is deleted from network, until modularity no longer increases) level community discovery algorithm, community's division is carried out to domain knowledge map, obtain domain knowledge map community structure tree; Each node of community structure tree represents a community of domain knowledge map, and the blocks of knowledge of same community shows subject consistency; Secondly, determine community's theme by analyzing community center's node (namely using certain important node that the degree centrad of blocks of knowledge is portrayed in community), thus the field of structure thematic structure tree, realize the mapping of community structure to thematic structure;
2, construction feature space, calculates the proper vector value of each dimension: using all blocks of knowledge of domain knowledge map as characteristic item, construction feature space; Extract the blocks of knowledge that document package contains, in conjunction with the degree centrad of blocks of knowledge, calculate the proper vector value of each dimension;
3, construct training dataset, training theme partitioning model: structure training dataset, selects BR-SVM many labelings algorithm, trains, obtain document subject matter partitioning model to training dataset.
To document to be divided carry out document subject matter divide concrete steps as follows:
1, file characteristics vector representation: for document d to be divided, method described in step 2 in the structure part of profile subject classification model, extracts document blocks of knowledge, obtains the feature vector, X of document to be divided d;
2, document subject matter divides: by the feature vector, X of document to be divided das the input of field document subject matter partitioning model M, the output of model is the theme label Y of document d, according to Y dand the corresponding relation between field thematic structure tree T-Tree, show that the theme of document d divides.
As shown in Figure 2, the concrete implementation step of domain knowledge map theme system construction process is as follows:
(1) domain knowledge map preprocessing process, is converted to simple undirected graph by domain knowledge map, and using the domain knowledge map after conversion as the root community node that community structure is set, is joined in node queue CAQ to be analyzed.The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V crepresent the blocks of knowledge set that community's node comprises, Children represents the child node set of community's node, and Parent represents the father node of community's node;
(2) domain knowledge map level community partition process, takes out head of the queue node CH from CAQ, uses Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph respectively, and introduces modularity threshold value (default value is 0.35); If module angle value corresponding to the community division result that above-mentioned two kinds of algorithms obtain all is less than 0.35, then it is invalid to divide, and performs step (3); Otherwise, contrast above-mentioned two kinds of algorithm partition result respective modules angle value, choose the community division result that wherein larger module angle value is corresponding, create community's node that wherein each community is corresponding, as the sub-community node of CH, and added CAQ queue;
(3) carry out step (2) to all nodes in CAQ, until CAQ queue is empty, thus obtain community structure tree C-Tree corresponding to domain knowledge map, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet represents that community's node set that community structure is set, croot represent the root community node that community structure is set, and n represents community's nodes, the community's number namely existed in network;
(4) community center's point analysis, calculate each community node in C-Tree comprise the degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding, the larger set of node of Selection Center degree is as community center node group CCNS; The degree centrad computing method of blocks of knowledge in the domain knowledge map subgraph that community is corresponding are as follows:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) represent blocks of knowledge ku idegree in community, KU represents the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(5) to the blocks of knowledge in CCNS, search domain knowledge map storehouse, obtain the central term collection that CCNS comprises, in conjunction with degree centrad and the central term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of central term central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) represents the centrad of blocks of knowledge in CCNS, and δ (term, ku) represents the frequency that term occurs in ku.The maximum central term of Selection Center weight is as the theme of community;
(6) step (2) is carried out for C-Tree each community node, thus the field of structure thematic structure tree T-Tree, realize the mapping of community structure to thematic structure, T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet represents community's theme node set, and troot represents the root node that thematic structure is set, and n represents theme number.Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y crepresent community's theme label, SubTopics represents the child node set of theme node, and PTopic represents the father node of theme node.
As shown in Figure 3, construction feature space, the concrete implementation step calculating the proper vector value of each dimension is as follows:
(1) structural attitude space, using all blocks of knowledge in domain knowledge map as characteristic item, forms the feature space of various dimensions (each blocks of knowledge is a dimension);
(2) preprocessing process of document, be plain text (i.e. txt file) by document subject feature vector, extract the text chunk of each document, (the TF-IDF algorithm based on vector space model uses TF-IDF algorithm to be take term as the proper vector form of characteristic item by text representation to the TF-IDF algorithm using based on vector space model, the similarity between document is represented by included angle cosine between vector) the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku in domain knowledge map storehouse is carried out similarity mode, if similarity reaches threshold value μ (default value is 0.8), then think that document package contains ku, extract all blocks of knowledge that document package contains accordingly,
(3) calculate the degree centrad of blocks of knowledge in domain knowledge map (computing method are see formula (3)) in feature space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by abstract for document be following form: X j={ W 1, W 2..., W i..., W n, the wherein dimension of n representation feature vector, W irepresent the weight of i-th characteristic item, its formalization representation is as follows:
W i=C deg(ku i)*kuf(ku i,d) (7)
Wherein, kuf (ku i, d) represent the frequency that blocks of knowledge occurs in document d, C deg(ku i) represent blocks of knowledge ku idegree centrad.
Structure training dataset, the concrete steps of training theme partitioning model comprise:
(1) training dataset is constructed, for each document in given training dataset D, use method described in step 4 to extract its proper vector, in conjunction with domain knowledge map community structure tree C-Tree and field thematic structure tree T-Tree, by abstract for training dataset be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., m) represent the proper vector of a jth document, Y j(j=1,2 ..., m) represent the theme label set of a jth document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) (BR-SVM method adopts " one-to-many " strategy that many labels problem is converted into multiple two classification problems to training process selection BR-SVM algorithm, and with two ripe classification problem training method SVM, these a series of two classification problems are trained), adopt cross validation mode, based on Training document collection D, training obtains document subject matter partitioning model M.

Claims (1)

1., based on a document subject matter division methods for domain knowledge map community structure, it is characterized in that, comprise the steps:
One, domain knowledge map community structure tree builds:
(1) domain knowledge map preprocessing process, is converted to simple undirected graph by domain knowledge map, and using the domain knowledge map after conversion as the root community node that community structure is set, is joined in node queue CAQ to be analyzed; The formalization representation of community's node is as follows:
CNode(V C,Children,Parent) (1)
Wherein, V crepresent the blocks of knowledge set that community's node comprises, Children represents the child node set of community's node, and Parent represents the father node of community's node;
(2) domain knowledge map level community partition process, takes out head of the queue node CH from CAQ, uses Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph respectively, and introduces modularity threshold value ; If module angle value corresponding to the community division result that above-mentioned two kinds of algorithms obtain all is less than , then it is invalid to divide, and performs step (3); Otherwise, contrast above-mentioned two kinds of algorithm partition result respective modules angle value, choose the community division result that wherein larger module angle value is corresponding, create community's node that wherein each community is corresponding, as the sub-community node of CH, and added CAQ queue;
(3) carry out step (2) to all nodes in CAQ, until CAQ queue is empty, thus obtain community structure tree C-Tree corresponding to domain knowledge map, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet represents that community's node set that community structure is set, croot represent the root community node that community structure is set, and n represents community's nodes, the community's number namely existed in network;
Two, carry out community's theme identification by the community structure tree corresponding to the domain knowledge map of step one gained, build field thematic structure tree, realize the mapping of community structure to thematic structure;
Three, file characteristics vector extracts:
(1) structural attitude space, using all blocks of knowledge in domain knowledge map as characteristic item, forms the feature space of various dimensions;
(2) preprocessing process of document, be plain text by document subject feature vector, extract the text chunk of each document, use the TF-IDF algorithm based on vector space model that the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku in domain knowledge map storehouse is carried out similarity mode, if similarity reaches threshold value μ, then think that document package contains ku, extract all blocks of knowledge that document package contains accordingly;
(3) utilize formula (3) to calculate the degree centrad of blocks of knowledge in domain knowledge map in feature space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by abstract for document be following form:
X j={ W 1, W 2..., W i..., W n, the wherein dimension of n representation feature vector, W irepresent the weight of i-th characteristic item, its formalization representation is as follows:
W i=C deg(ku i) *kuf(ku i,d) (7)
Wherein, kuf (ku i, d) represent the frequency that blocks of knowledge occurs in document d, C deg(ku i) represent blocks of knowledge ku idegree centrad;
The expression formula of formula (3) is:
C deg ( ku i ) = deg ( ku i ) Σ i = 1 n deg ( ku i ) , ku i ∈ KU - - - ( 3 )
Wherein, deg (ku i) represent blocks of knowledge ku idegree in community, KU represents the blocks of knowledge set that domain knowledge map or its subgraph comprise;
Four, document subject matter partitioning model builds:
(1) training dataset is constructed, for each document in given training dataset D, method described in step 3 is used to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure tree C-Tree in integrating step one and step 2, by abstract for training dataset be following form:
D={(X 1,Y 1),(X 2,Y 2),...,(X j,Y j),...,(X m,Y m)} (8)
Wherein, X j(j=1,2 ..., m) represent the proper vector of a jth document, Y j(j=1,2 ..., m) represent the theme label set of a jth document, its formalization representation is as follows:
Y j={L 1,L 2,...,L i...,L k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) training process selects BR-SVM algorithm, adopts cross validation mode, and based on Training document collection D, training obtains document subject matter partitioning model M;
Five, document subject matter divides: to document to be divided, extracts the blocks of knowledge that document package contains, and uses step 3 method to obtain file characteristics vector representation, and the document subject matter partitioning model using step 4 to obtain realizes document subject matter and divides;
In above-mentioned steps, the concrete grammar building field thematic structure tree described in step 2 is:
(1) community center's point analysis, calculate each community node in C-Tree comprise the degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding, the larger set of node of Selection Center degree is as community center node group CCNS; The degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding calculates is undertaken by formula (3);
(2) to the blocks of knowledge in CCNS, search domain knowledge map storehouse, obtain the central term collection that CCNS comprises, in conjunction with degree centrad and the central term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of central term term central, its formalization representation is as follows:
W Central term = Σ ku CCNS C ( ku ) * δ ( term , ku ) - - - ( 4 )
Wherein, C (ku) represents the centrad of blocks of knowledge in CCNS, and δ (term, ku) represents the frequency that term occurs in ku, and the maximum central term of Selection Center weight is as the theme of community;
(3) step (2) is carried out for C-Tree each community node, thus the field of structure thematic structure tree T-Tree, realize the mapping of community structure to thematic structure, T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet represents community's theme node set, and troot represents the root node that thematic structure is set, and n represents theme number; Community's theme joint form is expressed as follows:
CTopic(Y C,SubTopics,PTopic) (6)
Wherein, Y crepresent community's theme label, SubTopics represents the child node set of theme node, and PTopic represents the father node of theme node.
CN201310299047.8A 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure Expired - Fee Related CN103412878B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310299047.8A CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310299047.8A CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Publications (2)

Publication Number Publication Date
CN103412878A CN103412878A (en) 2013-11-27
CN103412878B true CN103412878B (en) 2015-03-04

Family

ID=49605890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310299047.8A Expired - Fee Related CN103412878B (en) 2013-07-16 2013-07-16 Document theme partitioning method based on domain knowledge map community structure

Country Status (1)

Country Link
CN (1) CN103412878B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933621A (en) * 2015-06-19 2015-09-23 天睿信科技术(北京)有限公司 Big data analysis system and method for guarantee ring
CN106528540A (en) * 2016-12-16 2017-03-22 广州索答信息科技有限公司 Word segmentation method and word segmentation system for seed questions
CN107368558B (en) * 2017-07-05 2021-05-14 腾讯科技(深圳)有限公司 Data object returning method and device
CN107766412A (en) * 2017-09-05 2018-03-06 华南师范大学 A kind of mthods, systems and devices for establishing thematic map
CN109697642A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Data push method, device and computer readable storage medium
CN110427494B (en) * 2019-07-29 2022-11-15 北京明略软件系统有限公司 Knowledge graph display method and device, storage medium and electronic device
CN110737777A (en) * 2019-08-28 2020-01-31 南京航空航天大学 knowledge map construction method based on GHSOM algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949646B1 (en) * 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
CN102141997A (en) * 2010-02-02 2011-08-03 三星电子(中国)研发中心 Intelligent decision support system and intelligent decision method thereof
CN102567464A (en) * 2011-11-29 2012-07-11 西安交通大学 Theme map expansion based knowledge resource organizing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348239A (en) * 2003-05-20 2004-12-09 Fujitsu Ltd Text classification program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7949646B1 (en) * 2005-12-23 2011-05-24 At&T Intellectual Property Ii, L.P. Method and apparatus for building sales tools by mining data from websites
CN102141997A (en) * 2010-02-02 2011-08-03 三星电子(中国)研发中心 Intelligent decision support system and intelligent decision method thereof
CN102567464A (en) * 2011-11-29 2012-07-11 西安交通大学 Theme map expansion based knowledge resource organizing method

Also Published As

Publication number Publication date
CN103412878A (en) 2013-11-27

Similar Documents

Publication Publication Date Title
CN103412878B (en) Document theme partitioning method based on domain knowledge map community structure
CN102567464B (en) Based on the knowledge resource method for organizing of expansion thematic map
CN110083696B (en) Global citation recommendation method and system based on meta-structure technology
CN108073677A (en) A kind of multistage text multi-tag sorting technique and system based on artificial intelligence
Meng et al. Leveraging concept association network for multimedia rare concept mining and retrieval
CN102411611B (en) Instant interactive text oriented event identifying and tracking method
Hettiarachchi et al. Embed2Detect: temporally clustered embedded words for event detection in social media
CN100495408C (en) Text clustering element study method and device
CN102855282B (en) A kind of document recommendation method and device
CN103984681A (en) News event evolution analysis method based on time sequence distribution information and topic model
US20210350125A1 (en) System for searching natural language documents
CN102289522A (en) Method of intelligently classifying texts
CN104679738A (en) Method and device for mining Internet hot words
Faddoul et al. Boosting multi-task weak learners with applications to textual and social data
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN113515632A (en) Text classification method based on graph path knowledge extraction
CN102779119B (en) A kind of method of extracting keywords and device
CN101174316A (en) Device and method for cases illation based on cases tree
CN105335510A (en) Text data efficient searching method
CN105740310A (en) Automatic answer summarizing method and system for question answering system
CN117725261A (en) Cross-modal retrieval method, device, equipment and medium for video text
CN114997288A (en) Design resource association method
Yu et al. Heterogeneous graph contrastive learning with meta-path contexts and weighted negative samples
Campbell et al. Content+ context networks for user classification in twitter
CN102193928B (en) Method for matching lightweight ontologies based on multilayer text categorizer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Han Xiaoxia

Inventor after: Zhao Chaofan

Inventor after: Zhao Shuyan

Inventor before: Zheng Qinghua

Inventor before: Dong Bo

Inventor before: Liu Jun

Inventor before: Xu Haipeng

Inventor before: Li Bing

Inventor before: He Huan

Inventor before: Ma Tian

CB03 Change of inventor or designer information
TR01 Transfer of patent right

Effective date of registration: 20171130

Address after: 030024 Yingze West Street, Taiyuan City, Taiyuan, Shanxi

Patentee after: Taiyuan University of Technology

Address before: 511442 1402 room 1402, No. 383 office building, North 383 Panyu Avenue, Panyu District South Village, Panyu District, Guangdong

Patentee before: Guangzhou Zhirongjie Intellectual Property Service Co.,Ltd.

Effective date of registration: 20171130

Address after: 511442 1402 room 1402, No. 383 office building, North 383 Panyu Avenue, Panyu District South Village, Panyu District, Guangdong

Patentee after: Guangzhou Zhirongjie Intellectual Property Service Co.,Ltd.

Address before: 710049 Xianning West Road, Shaanxi, China, No. 28, No.

Patentee before: Xi'an Jiaotong University

TR01 Transfer of patent right
CP02 Change in the address of a patent holder

Address after: 030024 West Street, Taiyuan, Shanxi, No. 79, No.

Patentee after: Taiyuan University of Technology

Address before: 030024 Yingze West Street, Taiyuan City, Taiyuan, Shanxi

Patentee before: Taiyuan University of Technology

CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zheng Qinghua

Inventor after: Dong Bo

Inventor after: Liu Jun

Inventor after: Xu Haipeng

Inventor after: Li Bing

Inventor after: He Huan

Inventor after: Ma Tian

Inventor before: Han Xiaoxia

Inventor before: Zhao Chaofan

Inventor before: Zhao Shuyan

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150304

Termination date: 20190716