CN103412878A

CN103412878A - Document theme partitioning method based on domain knowledge map community structure

Info

Publication number: CN103412878A
Application number: CN2013102990478A
Authority: CN
Inventors: 郑庆华; 董博; 刘均; 徐海鹏; 李冰; 贺欢; 马天
Original assignee: Xian Jiaotong University
Current assignee: Guangzhou Zhirongjie Intellectual Property Service Co ltd; Taiyuan University of Technology
Priority date: 2013-07-16
Filing date: 2013-07-16
Publication date: 2013-11-27
Anticipated expiration: 2033-07-16
Also published as: CN103412878B

Abstract

The invention discloses a document theme partitioning method based on a domain knowledge map community structure, and the partitioning problem of document resources related to subject knowledge or document knowledge is mainly solved, so that documents related to a theme can be stored in a close logical place, and learning efficiency is improved. The document theme partitioning method is characterized in that a level community discovery algorithm based on the Fast Geedy algorithm and the GN algorithm is proposed, and a theme structure tree is built; in the process of feature extraction, knowledge units directly serve as feature vectors, and due to the fact that the knowledge units have semantic integrality, compared with a traditional method based on participles, the document theme partitioning method can reflect theme characteristics of the feature vectors better; in the process of calculating feature vector values, the method of combination of degree centrality and knowledge unit file frequency is proposed, wherein the concept of the degree centrality reflects the status of the knowledge units in a knowledge map whole situation. Through the method, accuracy of document theme partitioning is effectively improved, and the method is suitable for the document theme partitioning based on the domain knowledge map community structure in general scenes.

Description

Document subject matter division methods based on domain knowledge map community structure

Technical field

The present invention relates to the enterprising style of writing shelves theme on the basis of domain knowledge map community structure divides, the partition problem of the document resources that mainly solution is relevant to subject or domain knowledge, so that the document of Topic relative is stored in to close logical place, improve storage and access efficiency.

Background technology

Expansion along with Network Course Platform, network courses every subjects document scale constantly enlarges, the document that theme is close is stored in close logical place, when the learner learns certain resource, can look ahead to other resources with its Topic relative connection, reduce the time overhead of file reading, improve storage and access efficiency.

For the Study on Topic Partition of document, below 3 pieces of patent documentations different technical schemes is provided:

1. based on text classification feature selecting and the weighing computation method (CN101290626) of domain knowledge

2. based on the k nearest neighbor file classification method (CN102033949A) of revising

3. the method for the proper vector weight of a new Text Classification and device (CN1719436A)

The method of document 1 comprises: (1) assembling sphere text and non-field text are as corpus and testing material; (2) pre-service of text, comprise word segmentation processing and statistics word frequency and document frequently; (3) choose the characteristic of division space also with improved TF-IDF method calculated characteristics weights; (4) selected characteristic space expand field term to feature space on the basis of step (3); (5) choose the characteristic of division space, utilize improved TF-IDF algorithm to calculate and adjust feature weight; (6) use the SVM machine learning method, training text is divided device, builds field text partitioning model, and the field text is carried out to experimental verification.

The method of document 2 comprises (1) text pre-service: at first each document in the training text set is carried out to participle, remove stop words, text is carried out to project-based expression; (2) text feature selection: then to the text vector dimensionality reduction, the structural attitude function is given a mark to Feature Words, select the least possible and with the closely-related file characteristics of document subject matter concept; (3) text classification: finally utilize based on the k nearest neighbor Algorithm of documents categorization structure sorter of deviation and classify, obtain classification results.

The method of document 3 comprises: (1) collects corpus and testing material by field; (2) remove " rubbish ", participle, the part-of-speech tagging of web page text; (3) from corpus, extracting the vocabulary in each field, and extract total vocabulary; (4) according to total vocabulary and field vocabulary, set up the information vocabulary with different keyword numbers for classification; (5) use the TF-IWF-DBV algorithm to classify to test text, optimize and obtain optimal threshold; (6) according to classification results, determine optimum keyword number.Because TF-IDF and TF-IWF method are all too relied on word frequency for counsel, simultaneously can't express vector element distributes between classification lack of uniformity again, so document 3 proposes a kind of new weighing computation method (TF-IWF-DBV), the n th Root of having introduced DBV and TF in the TF-IWF method has made up the deficiency of method.

The described method of above document mainly concentrates in the optimization of feature extracting method of text classification, and to choose term be characteristic item yet still be based on traditional participle mode, do not fully take into account the theme characteristic of characteristic item, causes classification accuracy not good enough.

Summary of the invention

The present invention, in order to solve the theme partition problem of every subjects document in existing large scale network course, provides a kind of domain knowledge map community structure and document subject matter has been divided to the division methods combined, to mark off the document that theme is close.

For reaching above purpose, the present invention takes following technical scheme to be achieved:

A kind of document subject matter division methods based on domain knowledge map community structure, is characterized in that, comprises the steps:

One, domain knowledge map community structure tree builds:

(1) domain knowledge map preprocessing process, be converted to simple non-directed graph by the domain knowledge map, and the domain knowledge map after changing joins it in CAQ of node queue to be analyzed as the root community node of community structure tree; The formalization representation of community's node is as follows:

CNode(V _C,Children,Parent) (1)

Wherein, V _CMean the blocks of knowledge set that community's node comprises, Children means the child node set of community's node, and Parent means the father node of community's node;

(2) domain knowledge map level community partition process, from CAQ, taking out head of the queue node CH, used respectively Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph, and introduce the modularity threshold value

If the community that above-mentioned two kinds of algorithms obtain divides modularity value corresponding to result and all is less than

It is invalid to divide, execution step (3); Otherwise, contrast above-mentioned two kinds of algorithms and divide respective modules degree value as a result, choose wherein community corresponding to larger modularity value and divide result, create wherein community's node corresponding to each community, as the sub-community node of CH, and it is added to the CAQ formation;

(3) all nodes in CAQ are carried out to step (2), until the CAQ formation is empty, thereby obtain the community structure tree C-Tree that the domain knowledge map is corresponding, its formalization representation is as follows:

C-Tree(CNodeSet,croot,n) (2)

Wherein, CNodeSet means community's node set of community structure tree, and croot means the root community node of community structure tree, and n means community's nodes, the community's number namely existed in network;

Two, by community structure tree corresponding to the domain knowledge map to the step 1 gained, carry out community's theme identification, build field thematic structure tree, realize that community structure arrives the mapping of thematic structure;

Three, the file characteristics vector extracts:

(1) structural attitude space, using all blocks of knowledge in the domain knowledge map as characteristic item, form the feature space of various dimensions;

(2) preprocessing process of document, document is converted to the plain text form, extract the text chunk of each document, use is carried out the similarity coupling based on the TF-IDF algorithm of the vector space model text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku of domain knowledge map office, if similarity reaches threshold value μ, think that document comprises ku, extract accordingly all blocks of knowledge that document comprises;

(3) utilize the degree centrad of blocks of knowledge in the domain knowledge map in formula (3) calculated characteristics space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by document abstract be following form:

X _j={ W ₁, W ₂..., W _i..., W _n, the dimension of n representation feature vector wherein, W _iThe weight that means i characteristic item, its formalization representation is as follows:

W _i=C _deg(ku _i)*kuf(ku _i,d) （7）

Wherein, kuf (ku _i, d) mean blocks of knowledge occurs in document d the frequency, C _deg(ku _i) expression blocks of knowledge ku _iThe degree centrad;

Four, the document subject matter partitioning model builds:

(1) structure training dataset, for each document in given training dataset D, use the described method of step 3 to extract its proper vector, field thematic structure tree T-Tree in domain knowledge map community structure in integrating step one tree C-Tree and step 2, by training dataset abstract be following form:

D={(X ₁,Y ₁),(X ₂,Y ₂),...,(X _j,Y _j),...,(X _m,Y _m)} （8）

Wherein, X _j(j=1,2 ..., the m) proper vector of j document of expression, Y _j(j=1,2 ..., m) meaning the theme label set of j document, its formalization representation is as follows:

Y _j={L ₁,L ₂,...,L _i...,L _k} （9）

Wherein, m is training set document number, and k is community's theme number;

(2) training process is selected the BR-SVM algorithm, adopts the cross validation mode, and based on training document sets D, training obtains document subject matter partitioning model M;

Five, document subject matter is divided: to document to be divided, extract the blocks of knowledge that document comprises, use the step 3 method to obtain the file characteristics vector representation, the document subject matter partitioning model that uses step 4 to obtain is realized the document subject matter division.

In said method, described structure field thematic structure tree concrete steps are:

(1) community center's point analysis, calculate the degree centrad of each community's blocks of knowledge that node comprises in domain knowledge map subgraph corresponding to community in C-Tree, and the larger set of node of Selection Center degree is as the node group CCNS of community center; The degree centrad computing method of blocks of knowledge in domain knowledge map subgraph corresponding to community are as follows:

C_{\deg} ({ku}_{i}) = \frac{\deg ({ku}_{i})}{Σ_{i = 1}^{n} \deg ({ku}_{i})}, {ku}_{i} &Element; KU - - - (3)

Wherein, deg (ku _i) expression blocks of knowledge ku _iDegree in community, KU mean the blocks of knowledge set that domain knowledge map or its subgraph comprise;

(2) to the blocks of knowledge in CCNS, search the domain knowledge map office, obtain the core terminology that CCNS comprises, in conjunction with degree centrad and the core term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of core term _Central, its formalization representation is as follows:

W_{Central}^{term} = Σ_{ku}^{CCNS} C (ku) * δ (term, ku) - - - (4)

Wherein, C (ku) means the centrad of blocks of knowledge in CCNS, and δ (term, ku) means term occurs in ku the frequency, and the core term of Selection Center weight maximum is as the theme of community;

(3) for each community's node of C-Tree, carry out step (2), thereby the field of structure thematic structure tree T-Tree realizes that community structure arrives the mapping of thematic structure, the T-Tree formalization representation is as follows:

T-Tree(CTopicSet,troot，n) （5）

Wherein, CTopicSet means community's theme node set, and troot means the root node of thematic structure tree, and n means the theme number; Community's theme joint form is expressed as follows:

CTopic(Y _C,SubTopics,PTopic) （6）

Wherein, Y _CMean community's theme label, SubTopics means the child node set of theme node, and PTopic means the father node of theme node.

Compared with prior art, the advantage of the inventive method is: build in the process of thematic structure tree, proposed to build the community structure tree based on the level community discovery algorithm of Fast Geedy algorithm and GN algorithm; Characteristic extraction procedure directly as proper vector, because blocks of knowledge has semantic integrity, more can embody the theme characteristic of proper vector with respect to traditional method based on participle using blocks of knowledge; The process proposition degree centrad of calculated characteristics vector value and the method that the blocks of knowledge document combines frequently, the concept of its moderate centrad has reflected the status of blocks of knowledge in the Knowledge Map overall situation.By above-mentioned improvement, effectively improved with respect to classic method the accuracy rate that document subject matter is divided.

The accompanying drawing explanation

The present invention is described in further detail below in conjunction with the drawings and the specific embodiments.

Fig. 1 the present invention is based on Knowledge Map community structure document subject matter to divide process flow diagram.

Fig. 2 is domain knowledge map theme system construction process flow diagram in Fig. 1.

Fig. 3 is that in Fig. 1, proper vector is extracted process flow diagram.

Embodiment

Described domain knowledge map is to describe the interior knowledge in some fields (course or subject) and the complex network of the association between these knowledge; Blocks of knowledge refers in Knowledge Map have the ABC fragment of complete ability to express; The domain knowledge map office is the database of blocks of knowledge in field of storage, has recorded the details of blocks of knowledge, comprises relation between core term and blocks of knowledge etc. as blocks of knowledge title, the corresponding text chunk of blocks of knowledge, blocks of knowledge.The Knowledge Map of a common subject is from the document resources of this subject, building and produce, and is expressed as the network of blocks of knowledge and incidence relation thereof; Use complex network community discovery algorithm by the domain knowledge map partitioning as community structure after, each community has relatively independent theme.Therefore, the blocks of knowledge community structure can be used as the foundation that document subject matter is divided.

The implementation procedure of dividing based on the document subject matter of Knowledge Map community structure as shown in Figure 1, can be divided into two parts: divide by the structure of document subject matter partitioning model and the theme of document to be divided.

The structure of document subject matter disaggregated model is divided into three steps:

1, domain knowledge map theme system construction: at first, (Fast Greedy algorithm is a kind of coagulation type community discovery algorithm proposed by people such as Newman based on Fast Greedy algorithm in proposition, when initial, each node is a community, then the community's modularity increment after any two community's polymerizations in computational grid, choose wherein two communities of increment maximum and merge; This process recurrence is carried out, until modularity no longer increases) and the GN algorithm (the GN algorithm is a kind of Split type community discovery algorithm proposed by Girvan and Newman, the continuous limit betweenness on limit in computational grid in implementation; Choose the limit of limit betweenness maximum from network, deleting, until modularity no longer increases at every turn) level community discovery algorithm, the domain knowledge map is carried out to community's division, obtain the community structure tree of domain knowledge map; Each node of community structure tree means a community of domain knowledge map, and the blocks of knowledge of same community shows subject consistency; Secondly, by analyzing community center's node (being certain important node of using the degree centrad of blocks of knowledge to portray in community), determine community's theme, thereby the field of structure thematic structure tree realizes that community structure arrives the mapping of thematic structure;

2, construction feature space, calculate the proper vector value of each dimension: using all blocks of knowledge of domain knowledge map as characteristic item, the construction feature space; Extract the blocks of knowledge that document comprises, in conjunction with the degree centrad of blocks of knowledge, calculate the proper vector value of each dimension;

3, structure training dataset, training theme partitioning model: the structure training dataset, select many labelings of BR-SVM algorithm, training dataset is trained, obtain the document subject matter partitioning model.

Document to be divided is carried out to document subject matter, and to divide concrete steps as follows:

1, file characteristics vector representation: for document d to be divided, the described method of step 2 in the structure part of profile subject classification model, extract the document blocks of knowledge, obtains the feature vector, X of document to be divided _d

2, document subject matter is divided: by the feature vector, X of document to be divided _dAs the input of field document subject matter partitioning model M, the output of model is the theme label Y of document _d, according to Y _dAnd the corresponding relation between field thematic structure tree T-Tree, show that the theme of document d is divided.

As shown in Figure 2, the concrete implementation step of domain knowledge map theme system construction process is as follows:

(1) domain knowledge map preprocessing process, be converted to simple non-directed graph by the domain knowledge map, and the domain knowledge map after changing joins it in CAQ of node queue to be analyzed as the root community node of community structure tree.The formalization representation of community's node is as follows:

CNode(V _C,Children,Parent) （1）

(2) domain knowledge map level community partition process, from CAQ, taking out head of the queue node CH, used respectively Fast Greedy and GN algorithm to carry out community's division to domain knowledge map corresponding to CH or its subgraph, and introduce the modularity threshold value (default value is 0.35); If the community that above-mentioned two kinds of algorithms obtain divides modularity value corresponding to result and all is less than 0.35, it is invalid to divide, execution step (3); Otherwise, contrast above-mentioned two kinds of algorithms and divide respective modules degree value as a result, choose wherein community corresponding to larger modularity value and divide result, create wherein community's node corresponding to each community, as the sub-community node of CH, and it is added to the CAQ formation;

C-Tree(CNodeSet,croot,n) （2）

(4) community center's point analysis, calculate the degree centrad of each community's blocks of knowledge that node comprises in domain knowledge map subgraph corresponding to community in C-Tree, and the larger set of node of Selection Center degree is as the node group CCNS of community center; The degree centrad computing method of blocks of knowledge in domain knowledge map subgraph corresponding to community are as follows:

C_{\deg} ({ku}_{i}) = \frac{\deg ({ku}_{i})}{Σ_{i = 1}^{n} \deg ({ku}_{i})}, {ku}_{i} &Element; KU - - - (3)

(5) to the blocks of knowledge in CCNS, search the domain knowledge map office, obtain the core terminology that CCNS comprises, in conjunction with degree centrad and the core term frequency that blocks of knowledge occurs in CCNS of blocks of knowledge, calculate the centrality weights W of core term _Central, its formalization representation is as follows:

W_{Central}^{term} = Σ_{ku}^{CCNS} C (ku) * δ (term, ku) - - - (4)

Wherein, C (ku) means the centrad of blocks of knowledge in CCNS, and δ (term, ku) means term occurs in ku the frequency.The core term of Selection Center weight maximum is as the theme of community;

(6) for each community's node of C-Tree, carry out step (2), thereby the field of structure thematic structure tree T-Tree realizes that community structure arrives the mapping of thematic structure, the T-Tree formalization representation is as follows:

T-Tree(CTopicSet,troot，n) （5）

Wherein, CTopicSet means community's theme node set, and troot means the root node of thematic structure tree, and n means the theme number.Community's theme joint form is expressed as follows:

CTopic(Y _C,SubTopics,PTopic) （6）

As shown in Figure 3, the construction feature space, the concrete implementation step of proper vector value of calculating each dimension is as follows:

(1) structural attitude space, using all blocks of knowledge in the domain knowledge map as characteristic item, form the feature space of various dimensions (each blocks of knowledge is a dimension);

(2) preprocessing process of document, document is converted to plain text form (being txt file), extract the text chunk of each document, use based on the TF-IDF algorithm of vector space model (based on the TF-IDF algorithm of vector space model use the TF-IDF algorithm by text representation for the proper vector form of term as characteristic item, by included angle cosine between vector, meaning the similarity between document) the text chunk content that the text chunk of document is corresponding with the blocks of knowledge ku of domain knowledge map office carries out the similarity coupling, if similarity reaches threshold value μ (default value is 0.8), think that document comprises ku, extract accordingly all blocks of knowledge that document comprises,

(3) the degree centrad (computing method referring to formula (3)) of blocks of knowledge in the domain knowledge map in the calculated characteristics space, in conjunction with the frequency of occurrence of blocks of knowledge in document, by document abstract be following form: X _j={ W ₁, W ₂..., W _i..., W _n, the dimension of n representation feature vector wherein, W _iThe weight that means i characteristic item, its formalization representation is as follows:

W _i=C _deg(ku _i)*kuf(ku _i,d) （7）

Wherein, kuf (ku _i, d) mean blocks of knowledge occurs in document d the frequency, C _deg(ku _i) expression blocks of knowledge ku _iThe degree centrad.

The structure training dataset, the concrete steps of training theme partitioning model comprise:

(1) structure training dataset, for each document in given training dataset D, use the described method of step 4 to extract its proper vector, in conjunction with domain knowledge map community structure tree C-Tree and field thematic structure tree T-Tree, by training dataset abstract be following form:

D={(X ₁,Y ₁),(X ₂,Y ₂),...,(X _j,Y _j),...,(X _m,Y _m)} （8）

Y _j={L ₁,L ₂,...,L _i...,L _k} （9）

Wherein, m is training set document number, and k is community's theme number;

(2) (the BR-SVM method adopts " one-to-many " strategy that many labels problem is converted into to a plurality of two classification problems to training process selection BR-SVM algorithm, and with two ripe classification problem training method SVM, these a series of two classification problems are trained), adopt the cross validation mode, based on training document sets D, training obtains document subject matter partitioning model M.

Claims

1. the document subject matter division methods based on domain knowledge map community structure, is characterized in that, comprises the steps:

One, domain knowledge map community structure tree builds:

CNode(V _C,Children,Parent) (1)

C-Tree(CNodeSet,croot,n) (2)

Three, the file characteristics vector extracts:

W _i=C _deg(ku _i)*kuf(ku _i,d) （7）

Four, the document subject matter partitioning model builds:

D={(X ₁,Y ₁),(X ₂,Y ₂),...,(X _j,Y _j),...,(X _m,Y _m)} （8）

Y _j={L ₁,L ₂,...,L _i...,L _k} （9）

Wherein, m is training set document number, and k is community's theme number;

2. the document subject matter division methods based on domain knowledge map community structure as claimed in claim 1, is characterized in that, described structure field thematic structure tree concrete steps are:

C_{\deg} ({ku}_{i}) = \frac{\deg ({ku}_{i})}{Σ_{i = 1}^{n} \deg ({ku}_{i})}, {ku}_{i} &Element; KU - - - (3)

W_{Central}^{term} = Σ_{ku}^{CCNS} C (ku) * δ (term, ku) - - - (4)

T-Tree(CTopicSet,troot，n) （5）

CTopic(Y _C,SubTopics,PTopic) （6）