Based on the document subject matter division methods of domain knowledge map community structure
Technical field
The present invention relates to and carry out document subject matter division on the basis of domain knowledge map community structure, mainly solve the partition problem of the document resources relevant to subject or domain knowledge, so that the document of being correlated with by theme is stored in close logical place, improves and store and access efficiency.
Background technology
Along with the expansion of Network Course Platform, network courses every subjects document scale constantly expands, document close for theme is stored in close logical place, when learner learns certain resource, other resources be associated with its theme can be looked ahead, reduce the time overhead of file reading, improve and store and access efficiency.
For the Study on Topic Partition of document, 3 sections of patent documentations provide different technical schemes below:
1. based on text classification feature selecting and the weighing computation method (CN101290626) of domain knowledge
2. based on the k nearest neighbor file classification method (CN102033949A) revised
3. the method for the proper vector weight of a new Text Classification and device (CN1719436A)
The method of document 1 comprises: (1) assembling sphere text and non-field text are as corpus and testing material; (2) pre-service of text, comprises word segmentation processing and statistics word frequency and document frequently; (3) choose characteristic of division space and calculate feature weight by the TF-IDF method improved; (4) selected characteristic space expand field term to feature space on the basis of step (3); (5) choose characteristic of division space, utilize the TF-IDF algorithm improved calculate feature weight and adjust; (6) use SVM machine learning method, training text divider, build field text partitioning model, and experimental verification is carried out to field text.
The method of document 2 comprises (1) Text Pretreatment: first carry out participle to each document in training text set, removes stop words, text is carried out project-based expression; (2) text feature selection: then to text vector dimensionality reduction, structural attitude function is given a mark to Feature Words, selects the least possible and closely-related with document subject matter concept file characteristics; (3) text classification: finally utilize the k nearest neighbor Algorithm of documents categorization based on deviation to build sorter and classify, obtain classification results.
The method of document 3 comprises: (1) collects corpus and testing material by field; (2) " rubbish ", participle, the part-of-speech tagging of web page text is removed; (3) from corpus, extract the vocabulary in each field, and extract total vocabulary; (4) the information vocabulary with different keyword number for classifying is set up according to total vocabulary and field vocabulary; (5) use TF-IWF-DBV algorithm to classify to test text, optimize and obtain optimal threshold; (6) optimum keyword number is determined according to classification results.Because TF-IDF and TF-IWF method all too relies on word frequency for counsel, the lack of uniformity that vector element distributes between classification cannot be indicated again simultaneously, so document 3 proposes a kind of new weighing computation method (TF-IWF-DBV), the n th Root introducing DBV and TF in TF-IWF method compensate for the deficiency of method.
Described in above document, method mainly concentrates in the optimization of the feature extracting method of text classification, but is still that to choose term based on traditional participle mode be characteristic item, does not fully take into account the theme characteristic of characteristic item, causes classification accuracy not good enough.
Summary of the invention
The present invention, in order to solve the theme partition problem of every subjects document in existing large scale network course, provides and a kind ofly domain knowledge map community structure and document subject matter is divided the division methods combined, to mark off the close document of theme.
For reaching above object, the present invention takes following technical scheme to be achieved:
Based on a document subject matter division methods for domain knowledge map community structure, it is characterized in that, comprise the steps:
One, domain knowledge map community structure tree builds:
(1) domain knowledge map preprocessing process, is converted to simple undirected graph by domain knowledge map, and using the domain knowledge map after conversion as the root community node that community structure is set, is joined in node queue CAQ to be analyzed; The formalization representation of community's node is as follows:
CNode(V
C,Children,Parent) (1)
Wherein, V
crepresent the blocks of knowledge set that community's node comprises, Children represents the child node set of community's node, and Parent represents the father node of community's node;
(2) domain knowledge map level community partition process, takes out head of the queue node CH from CAQ, uses Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph respectively, and introduces modularity threshold value
if module angle value corresponding to the community division result that above-mentioned two kinds of algorithms obtain all is less than
it is invalid then to divide, and performs step (3); Otherwise, contrast above-mentioned two kinds of algorithm partition result respective modules angle value, choose the community division result that wherein larger module angle value is corresponding, create community's node that wherein each community is corresponding, as the sub-community node of CH, and added CAQ queue;
(3) carry out step (2) to all nodes in CAQ, until CAQ queue is empty, thus obtain community structure tree C-Tree corresponding to domain knowledge map, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet represents that community's node set that community structure is set, croot represent the root community node that community structure is set, and n represents community's nodes, the community's number namely existed in network;
Two, carry out community's theme identification by the community structure tree corresponding to the domain knowledge map of step one gained, build field thematic structure tree, realize the mapping of community structure to thematic structure;
Three, file characteristics vector extracts:
(1) structural attitude space, using all blocks of knowledge in domain knowledge map as characteristic item, forms the feature space of various dimensions;
(2) preprocessing process of document, be plain text by document subject feature vector, extract the text chunk of each document, use the TF-IDF algorithm based on vector space model that the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku in domain knowledge map storehouse is carried out similarity mode, if similarity reaches threshold value μ, then think that document package contains ku, extract all blocks of knowledge that document package contains accordingly;
(3) utilize formula (3) to calculate the degree centrad of blocks of knowledge in domain knowledge map in feature space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by abstract for document be following form:
X
j={ W
1, W
2..., W
i..., W
n, the wherein dimension of n representation feature vector, W
irepresent the weight of i-th characteristic item, its formalization representation is as follows:
W
i=C
deg(ku
i)*kuf(ku
i,d) (7)
Wherein, kuf (ku
i, d) represent the frequency that blocks of knowledge occurs in document d, C
deg(ku
i) represent blocks of knowledge ku
idegree centrad;
Four, document subject matter partitioning model builds:
(1) training dataset is constructed, for each document in given training dataset D, method described in step 3 is used to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure tree C-Tree in integrating step one and step 2, by abstract for training dataset be following form:
D={(X
1,Y
1),(X
2,Y
2),...,(X
j,Y
j),...,(X
m,Y
m)} (8)
Wherein, X
j(j=1,2 ..., m) represent the proper vector of a jth document, Y
j(j=1,2 ..., m) represent the theme label set of a jth document, its formalization representation is as follows:
Y
j={L
1,L
2,...,L
i...,L
k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) training process selects BR-SVM algorithm, adopts cross validation mode, and based on Training document collection D, training obtains document subject matter partitioning model M;
Five, document subject matter divides: to document to be divided, extracts the blocks of knowledge that document package contains, and uses step 3 method to obtain file characteristics vector representation, and the document subject matter partitioning model using step 4 to obtain realizes document subject matter and divides.
In said method, described structure field thematic structure tree concrete steps are:
(1) community center's point analysis, calculate each community node in C-Tree comprise the degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding, the larger set of node of Selection Center degree is as community center node group CCNS; The degree centrad computing method of blocks of knowledge in the domain knowledge map subgraph that community is corresponding are as follows:
Wherein, deg (ku
i) represent blocks of knowledge ku
idegree in community, KU represents the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(2) to the blocks of knowledge in CCNS, search domain knowledge map storehouse, obtain the central term collection that CCNS comprises, in conjunction with degree centrad and the central term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of central term
central, its formalization representation is as follows:
Wherein, C (ku) represents the centrad of blocks of knowledge in CCNS, and δ (term, ku) represents the frequency that term occurs in ku, and the maximum central term of Selection Center weight is as the theme of community;
(3) step (2) is carried out for C-Tree each community node, thus the field of structure thematic structure tree T-Tree, realize the mapping of community structure to thematic structure, T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet represents community's theme node set, and troot represents the root node that thematic structure is set, and n represents theme number; Community's theme joint form is expressed as follows:
CTopic(Y
C,SubTopics,PTopic) (6)
Wherein, Y
crepresent community's theme label, SubTopics represents the child node set of theme node, and PTopic represents the father node of theme node.
Compared with prior art, the advantage of the inventive method is: build in the process of thematic structure tree, and the level community discovery algorithm proposed based on Fast Geedy algorithm and GN algorithm builds community structure tree; Blocks of knowledge directly as proper vector, because blocks of knowledge has semantic integrity, more can be embodied the theme characteristic of proper vector by characteristic extraction procedure relative to traditional method based on participle; The method that the process proposition degree centrad of calculating proper vector value and blocks of knowledge document combine frequently, the concept of its moderate centrad reflects the status of blocks of knowledge in the Knowledge Map overall situation.By above-mentioned improvement, effectively improve the accuracy rate of document subject matter division relative to classic method.
Accompanying drawing explanation
Below in conjunction with the drawings and the specific embodiments, the present invention is described in further detail.
Fig. 1 the present invention is based on Knowledge Map community structure document subject matter to divide process flow diagram.
Fig. 2 is domain knowledge map theme system construction process flow diagram in Fig. 1.
Fig. 3 is characteristic vector pickup process flow diagram in Fig. 1.
Embodiment
Described domain knowledge map is the complex network of the association described between knowledge in some fields (course or subject) and these knowledge; Blocks of knowledge refers to the ABC fragment in Knowledge Map with complete ability to express; Domain knowledge map storehouse is the database of blocks of knowledge in field of storage, have recorded the details of blocks of knowledge, as blocks of knowledge title, the corresponding text chunk of blocks of knowledge, blocks of knowledge comprise the relation etc. between central term and blocks of knowledge.The Knowledge Map of a usual subject builds to produce from the document resources of this subject, is expressed as the network of blocks of knowledge and incidence relation thereof; Use complex network community discovery algorithm by domain knowledge map partitioning for after community structure, each community has relatively independent theme.Therefore, blocks of knowledge community structure can as the foundation of document subject matter division.
The implementation procedure of the document subject matter division of knowledge based map community structure as shown in Figure 1, can be divided into two parts: the structure of document subject matter partitioning model and the theme of document to be divided divide.
The structure of document subject matter disaggregated model is divided into three steps:
1, domain knowledge map theme system construction: first, based on Fast Greedy algorithm, (FastGreedy algorithm is a kind of coagulation type community discovery algorithm proposed by people such as Newman in proposition, each node Dou Shiyige community time initial, then the community module degree increment in computational grid after the polymerization of any Liang Ge community, the Liang Ge community choosing wherein increment maximum merges; This process recurrence is carried out, until modularity no longer increases) and GN algorithm (GN algorithm is a kind of Split type community discovery algorithm proposed by Girvan and Newman, the limit betweenness on limit in continuous computational grid in implementation; The limit at every turn choosing limit betweenness maximum is deleted from network, until modularity no longer increases) level community discovery algorithm, community's division is carried out to domain knowledge map, obtain domain knowledge map community structure tree; Each node of community structure tree represents a community of domain knowledge map, and the blocks of knowledge of same community shows subject consistency; Secondly, determine community's theme by analyzing community center's node (namely using certain important node that the degree centrad of blocks of knowledge is portrayed in community), thus the field of structure thematic structure tree, realize the mapping of community structure to thematic structure;
2, construction feature space, calculates the proper vector value of each dimension: using all blocks of knowledge of domain knowledge map as characteristic item, construction feature space; Extract the blocks of knowledge that document package contains, in conjunction with the degree centrad of blocks of knowledge, calculate the proper vector value of each dimension;
3, construct training dataset, training theme partitioning model: structure training dataset, selects BR-SVM many labelings algorithm, trains, obtain document subject matter partitioning model to training dataset.
To document to be divided carry out document subject matter divide concrete steps as follows:
1, file characteristics vector representation: for document d to be divided, method described in step 2 in the structure part of profile subject classification model, extracts document blocks of knowledge, obtains the feature vector, X of document to be divided
d;
2, document subject matter divides: by the feature vector, X of document to be divided
das the input of field document subject matter partitioning model M, the output of model is the theme label Y of document
d, according to Y
dand the corresponding relation between field thematic structure tree T-Tree, show that the theme of document d divides.
As shown in Figure 2, the concrete implementation step of domain knowledge map theme system construction process is as follows:
(1) domain knowledge map preprocessing process, is converted to simple undirected graph by domain knowledge map, and using the domain knowledge map after conversion as the root community node that community structure is set, is joined in node queue CAQ to be analyzed.The formalization representation of community's node is as follows:
CNode(V
C,Children,Parent) (1)
Wherein, V
crepresent the blocks of knowledge set that community's node comprises, Children represents the child node set of community's node, and Parent represents the father node of community's node;
(2) domain knowledge map level community partition process, takes out head of the queue node CH from CAQ, uses Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph respectively, and introduces modularity threshold value
(default value is 0.35); If module angle value corresponding to the community division result that above-mentioned two kinds of algorithms obtain all is less than 0.35, then it is invalid to divide, and performs step (3); Otherwise, contrast above-mentioned two kinds of algorithm partition result respective modules angle value, choose the community division result that wherein larger module angle value is corresponding, create community's node that wherein each community is corresponding, as the sub-community node of CH, and added CAQ queue;
(3) carry out step (2) to all nodes in CAQ, until CAQ queue is empty, thus obtain community structure tree C-Tree corresponding to domain knowledge map, its formalization representation is as follows:
C-Tree(CNodeSet,croot,n) (2)
Wherein, CNodeSet represents that community's node set that community structure is set, croot represent the root community node that community structure is set, and n represents community's nodes, the community's number namely existed in network;
(4) community center's point analysis, calculate each community node in C-Tree comprise the degree centrad of blocks of knowledge in the domain knowledge map subgraph that community is corresponding, the larger set of node of Selection Center degree is as community center node group CCNS; The degree centrad computing method of blocks of knowledge in the domain knowledge map subgraph that community is corresponding are as follows:
Wherein, deg (ku
i) represent blocks of knowledge ku
idegree in community, KU represents the blocks of knowledge set that domain knowledge map or its subgraph comprise;
(5) to the blocks of knowledge in CCNS, search domain knowledge map storehouse, obtain the central term collection that CCNS comprises, in conjunction with degree centrad and the central term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of central term
central, its formalization representation is as follows:
Wherein, C (ku) represents the centrad of blocks of knowledge in CCNS, and δ (term, ku) represents the frequency that term occurs in ku.The maximum central term of Selection Center weight is as the theme of community;
(6) step (2) is carried out for C-Tree each community node, thus the field of structure thematic structure tree T-Tree, realize the mapping of community structure to thematic structure, T-Tree formalization representation is as follows:
T-Tree(CTopicSet,troot,n) (5)
Wherein, CTopicSet represents community's theme node set, and troot represents the root node that thematic structure is set, and n represents theme number.Community's theme joint form is expressed as follows:
CTopic(Y
C,SubTopics,PTopic) (6)
Wherein, Y
crepresent community's theme label, SubTopics represents the child node set of theme node, and PTopic represents the father node of theme node.
As shown in Figure 3, construction feature space, the concrete implementation step calculating the proper vector value of each dimension is as follows:
(1) structural attitude space, using all blocks of knowledge in domain knowledge map as characteristic item, forms the feature space of various dimensions (each blocks of knowledge is a dimension);
(2) preprocessing process of document, be plain text (i.e. txt file) by document subject feature vector, extract the text chunk of each document, (the TF-IDF algorithm based on vector space model uses TF-IDF algorithm to be take term as the proper vector form of characteristic item by text representation to the TF-IDF algorithm using based on vector space model, the similarity between document is represented by included angle cosine between vector) the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku in domain knowledge map storehouse is carried out similarity mode, if similarity reaches threshold value μ (default value is 0.8), then think that document package contains ku, extract all blocks of knowledge that document package contains accordingly,
(3) calculate the degree centrad of blocks of knowledge in domain knowledge map (computing method are see formula (3)) in feature space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by abstract for document be following form: X
j={ W
1, W
2..., W
i..., W
n, the wherein dimension of n representation feature vector, W
irepresent the weight of i-th characteristic item, its formalization representation is as follows:
W
i=C
deg(ku
i)*kuf(ku
i,d) (7)
Wherein, kuf (ku
i, d) represent the frequency that blocks of knowledge occurs in document d, C
deg(ku
i) represent blocks of knowledge ku
idegree centrad.
Structure training dataset, the concrete steps of training theme partitioning model comprise:
(1) training dataset is constructed, for each document in given training dataset D, use method described in step 4 to extract its proper vector, in conjunction with domain knowledge map community structure tree C-Tree and field thematic structure tree T-Tree, by abstract for training dataset be following form:
D={(X
1,Y
1),(X
2,Y
2),...,(X
j,Y
j),...,(X
m,Y
m)} (8)
Wherein, X
j(j=1,2 ..., m) represent the proper vector of a jth document, Y
j(j=1,2 ..., m) represent the theme label set of a jth document, its formalization representation is as follows:
Y
j={L
1,L
2,...,L
i...,L
k} (9)
Wherein, m is training set document number, and k is community's theme number;
(2) (BR-SVM method adopts " one-to-many " strategy that many labels problem is converted into multiple two classification problems to training process selection BR-SVM algorithm, and with two ripe classification problem training method SVM, these a series of two classification problems are trained), adopt cross validation mode, based on Training document collection D, training obtains document subject matter partitioning model M.