CN103412878B

CN103412878B - Document theme partitioning method based on domain knowledge map community structure

Info

Publication number: CN103412878B
Application number: CN201310299047.8A
Authority: CN
Inventors: 郑庆华; 董博; 刘均; 徐海鹏; 李冰; 贺欢; 马天
Original assignee: Xian Jiaotong University
Current assignee: Guangzhou Zhirongjie Intellectual Property Service Co ltd; Taiyuan University of Technology
Priority date: 2013-07-16
Filing date: 2013-07-16
Publication date: 2015-03-04
Anticipated expiration: 2033-07-16
Also published as: CN103412878A

Abstract

The invention discloses a document subject division method based on domain knowledge map community structure, which mainly solves the problem of division of document resources related to subject or domain knowledge, so as to store subject-related documents in similar logical positions and improve learning efficiency . It is characterized in that: a hierarchical community discovery algorithm based on the Fast Geedy algorithm and the GN algorithm is proposed to construct a topic structure tree; the feature extraction process uses the knowledge unit directly as a feature vector, because the knowledge unit has semantic integrity, compared with the traditional method based on word segmentation It can better reflect the topic characteristics of feature vectors; the process of calculating feature vector values proposes a method of combining degree centrality and knowledge unit document frequency, where the concept of degree centrality reflects the status of knowledge units in the overall knowledge map. Through the above method, the accuracy of document subject division is effectively improved, and it is suitable for document subject division based on knowledge map community structure in general scenarios.

Description

Document subject division method based on domain knowledge map community structure

技术领域technical field

本发明涉及在领域知识地图社区结构的基础上进行文档主题划分，主要解决与学科或领域知识相关的文档资源的划分问题，以便于将主题相关的文档存储在相近的逻辑位置，提高存储和访问效率。The invention relates to the division of document topics based on the domain knowledge map community structure, and mainly solves the problem of division of document resources related to subject or domain knowledge, so as to store subject-related documents in similar logical positions and improve storage and access efficiency.

背景技术Background technique

随着网络课程平台的扩展，网络课程各个学科文档规模不断扩大，将主题相近的文档存储在相近的逻辑位置，当学习者学习某个资源时，可以对与其主题相关联的其他资源进行预取，减小读取文件的时间开销，提高存储和访问效率。With the expansion of the network course platform, the scale of the various subject documents of the network course continues to expand. Documents with similar topics are stored in similar logical locations. When learners learn a resource, they can prefetch other resources associated with the topic. , reduce the time overhead of reading files, and improve storage and access efficiency.

针对文档的主题划分方法，以下3篇专利文献提供了不同的技术方案：For the subject division method of documents, the following three patent documents provide different technical solutions:

1.基于领域知识的文本分类特征选择及权重计算方法（CN101290626）1. Text classification feature selection and weight calculation method based on domain knowledge (CN101290626)

2.基于修正的K近邻文本分类方法（CN102033949A）2. Based on the modified K-nearest neighbor text classification method (CN102033949A)

3.一种新的面向文本分类的特征向量权重的方法及装置（CN1719436A）3. A new method and device for text classification-oriented feature vector weights (CN1719436A)

文献1的方法包括：（1）收集领域文本和非领域文本作为训练语料和测试语料；（2）文本的预处理，包括分词处理及统计词频和文档频；（3）选取分类特征空间并用改进的TF-IDF方法计算特征权值；（4）在步骤（3）的基础上选取特征空间并扩展领域术语到特征空间；（5）选取分类特征空间，利用改进的TF-IDF算法对特征权重进行计算和调整；（6）用SVM机器学习方法，训练文本划分器，构建领域文本划分模型，并对领域文本进行实验验证。The method in Document 1 includes: (1) collecting domain text and non-domain text as training corpus and test corpus; (2) text preprocessing, including word segmentation processing and counting word frequency and document frequency; (3) selecting classification feature space and using improved (4) Select the feature space based on step (3) and expand the domain terms to the feature space; (5) Select the classification feature space, and use the improved TF-IDF algorithm to calculate the feature weight Perform calculation and adjustment; (6) Use SVM machine learning method to train text divider, construct domain text division model, and conduct experimental verification on domain text.

文献2的方法包括（1）文本预处理：首先对训练文本集合中的每个文档进行分词，去除停用词，将文本进行项目化表示；（2）文本特征选择：然后对文本向量降维，构造特征函数对特征词进行打分，选择尽可能少且与文档主题概念密切相关的文档特征；（3）文本分类：最后利用基于偏差的K近邻文本分类算法构建分类器进行分类，得到分类结果。The method in Document 2 includes (1) text preprocessing: first, segment each document in the training text set, remove stop words, and represent the text as an item; (2) text feature selection: then reduce the dimensionality of the text vector , construct a feature function to score feature words, and select document features that are as few as possible and closely related to the topic concept of the document; (3) Text classification: Finally, use the deviation-based K-nearest neighbor text classification algorithm to construct a classifier for classification and obtain classification results .

文献3的方法包括：（1）按领域收集训练语料和测试语料；（2）去除网页文本的“垃圾”、分词、词性标注；（3）从训练语料中提取每个领域的词表，并提取总词表；（4）根据总词表和领域词表建立用于分类的具有不同关键词数目的信息词表；（5）使用TF-IWF-DBV算法对测试文本进行分类，优化得到最优阈值；（6）根据分类结果确定最优关键词数目。由于TF-IDF和TF-IWF方法都过分倚重词频，同时又无法表示出向量元素在类别之间分布的不均衡性，所以文献3提出一种新的权重计算方法（TF-IWF-DBV），在TF-IWF方法中引入了DBV和TF的n次方根弥补了方法的不足。The method in Document 3 includes: (1) collecting training corpus and test corpus by domain; (2) removing “garbage”, word segmentation, and part-of-speech tagging from webpage text; (3) extracting the vocabulary of each domain from the training corpus, and Extract the general vocabulary; (4) establish an information vocabulary with different numbers of keywords for classification according to the general vocabulary and domain vocabulary; (5) use the TF-IWF-DBV algorithm to classify the test text, and optimize to obtain the most (6) Determine the optimal number of keywords according to the classification results. Since the TF-IDF and TF-IWF methods both rely too much on word frequency, and at the same time cannot express the unbalanced distribution of vector elements between categories, Document 3 proposes a new weight calculation method (TF-IWF-DBV), In the TF-IWF method, the nth root of DBV and TF is introduced to make up for the deficiency of the method.

以上文献所述方法主要集中在文本分类的特征提取方法的优化上，然而仍是基于传统分词方式选取术语为特征项，并未充分考虑到特征项的主题特性，导致分类准确率欠佳。The methods described in the above literature mainly focus on the optimization of feature extraction methods for text classification. However, terms are still selected as feature items based on traditional word segmentation methods, and the subject characteristics of feature items are not fully considered, resulting in poor classification accuracy.

发明内容Contents of the invention

本发明为了解决现有大规模网络课程中各个学科文档的主题划分问题，提供了一种将领域知识地图社区结构和文档主题划分相结合的划分方法，以划分出主题相近的文档。In order to solve the subject division problem of various subject documents in the existing large-scale network courses, the present invention provides a division method combining domain knowledge map community structure and document subject division, so as to divide documents with similar subjects.

为达到以上目的，本发明是采取如下技术方案予以实现的：To achieve the above object, the present invention is achieved by taking the following technical solutions:

一种基于领域知识地图社区结构的文档主题划分方法，其特征在于，包括下述步骤：A method for dividing the subject of documents based on the domain knowledge map community structure, characterized in that it comprises the following steps:

一、领域知识地图社区结构树构建：1. Domain knowledge map community structure tree construction:

（1）领域知识地图预处理过程，将领域知识地图转换为简单无向图，并将转换后的领域知识地图作为社区结构树的根社区节点，将其加入到待分析节点队列CAQ中；社区节点的形式化表示如下：(1) The domain knowledge map preprocessing process converts the domain knowledge map into a simple undirected graph, and takes the converted domain knowledge map as the root community node of the community structure tree, and adds it to the node queue CAQ to be analyzed; the community The formal representation of the node is as follows:

CNode(V_C,Children,Parent) (1)CNode(V _C ,Children,Parent) (1)

其中，V_C表示社区节点包含的知识单元集合，Children表示社区节点的子节点集合，Parent表示社区节点的父节点；Among them, V _C represents the knowledge unit set contained in the community node, Children represents the child node set of the community node, and Parent represents the parent node of the community node;

（2）领域知识地图层次社区划分过程，从CAQ中取出队首节点CH，分别使用Fast Greedy和GN算法对CH对应的领域知识地图或其子图进行社区划分，并引入模块度阈值若上述两种算法得到的社区划分结果对应的模块度值均小于则划分无效，执行步骤（3）；否则，对比上述两种算法划分结果对应模块度值，选取其中较大的模块度值对应的社区划分结果，创建其中每个社区对应的社区节点，作为CH的子社区节点，并将其加入CAQ队列；(2) The domain knowledge map hierarchical community division process, the team head node CH is taken out from the CAQ, and the Fast Greedy and GN algorithms are used to divide the community of the domain knowledge map corresponding to CH or its sub-graph, and the modularity threshold is introduced If the modularity values corresponding to the community division results obtained by the above two algorithms are less than If the division is invalid, go to step (3); otherwise, compare the modularity values corresponding to the division results of the above two algorithms, select the community division result corresponding to the larger modularity value, and create a community node corresponding to each community as CH The sub-community node of , and add it to the CAQ queue;

（3）对CAQ中的所有节点进行步骤（2），直到CAQ队列为空，从而得到领域知识地图对应的社区结构树C-Tree，其形式化表示如下：(3) Perform step (2) on all nodes in the CAQ until the CAQ queue is empty, so as to obtain the community structure tree C-Tree corresponding to the domain knowledge map, and its formal representation is as follows:

C-Tree(CNodeSet,croot,n) (2)C-Tree(CNodeSet,croot,n) (2)

其中，CNodeSet表示社区结构树的社区节点集合，croot表示社区结构树的根社区节点，n表示社区节点数，即网络中存在的社区个数；Among them, CNodeSet represents the community node set of the community structure tree, croot represents the root community node of the community structure tree, and n represents the number of community nodes, that is, the number of communities existing in the network;

二、通过对步骤一所得的领域知识地图对应的社区结构树进行社区主题辨识，构建领域主题结构树，实现社区结构到主题结构的映射；2. Through the community theme identification of the community structure tree corresponding to the domain knowledge map obtained in step 1, construct the domain theme structure tree, and realize the mapping from the community structure to the theme structure;

三、文档特征向量提取：3. Document feature vector extraction:

（1）构造特征空间，将领域知识地图中的所有知识单元作为特征项，构成多维度的特征空间；(1) Construct a feature space, and use all knowledge units in the domain knowledge map as feature items to form a multi-dimensional feature space;

（2）文档的预处理过程，将文档转换为纯文本形式，提取每个文档的文本段，使用基于向量空间模型的TF-IDF算法将文档的文本段与领域知识地图库的知识单元ku对应的文本段内容进行相似度匹配，若相似度达到阈值μ，则认为文档包含ku，据此提取出文档包含的所有知识单元；(2) The preprocessing process of the document, converting the document into a plain text form, extracting the text segment of each document, and using the TF-IDF algorithm based on the vector space model to correspond the text segment of the document to the knowledge unit ku of the domain knowledge map library Similarity matching is performed on the content of the text segment. If the similarity reaches the threshold μ, the document is considered to contain ku, and all knowledge units contained in the document are extracted accordingly;

（3）利用公式（3）计算特征空间中知识单元在领域知识地图中的度中心度，结合文档中知识单元的出现频次，将文档抽象为如下形式：(3) Use the formula (3) to calculate the degree centrality of the knowledge unit in the feature space in the domain knowledge map, and combine the frequency of occurrence of the knowledge unit in the document to abstract the document into the following form:

X_j={W₁,W₂,...,W_i,...,W_n}，其中n表示特征向量的维度，W_i表示第i个特征项的权重，其形式化表示如下：X _j ={W ₁ ,W ₂ ,...,W _i ,...,W _n }, where n represents the dimension of the feature vector, W _i represents the weight of the i-th feature item, and its formal expression is as follows:

W_i=C_deg(ku_i)*kuf(ku_i,d) （7）W _i =C _deg (ku _i )*kuf(ku _i ,d) (7)

其中，kuf(ku_i,d)表示知识单元在文档d中出现的频次，C_deg(ku_i)表示知识单元ku_i的度中心度；Among them, kuf(ku _i ,d) represents the frequency of knowledge unit appearing in document d, and C _deg (ku _i ) represents the degree centrality of knowledge unit ku _i ;

四、文档主题划分模型构建：Fourth, the construction of the document subject division model:

（1）构造训练数据集，对于给定的训练数据集D中的每一个文档，使用步骤三所述方法提取其特征向量，结合步骤一中的领域知识地图社区结构树C-Tree和步骤二中领域主题结构树T-Tree，将训练数据集抽象为如下形式：(1) Construct the training data set. For each document in the given training data set D, use the method described in step 3 to extract its feature vector, combine the domain knowledge map community structure tree C-Tree in step 1 and step 2 In the domain topic structure tree T-Tree, the training data set is abstracted into the following form:

D={(X₁,Y₁),(X₂,Y₂),...,(X_j,Y_j),...,(X_m,Y_m)} （8）D={(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _j ,Y _j ),...,(X _m ,Y _m )} (8)

其中，X_j(j=1,2,...,m)表示第j个文档的特征向量，Y_j(j=1,2,...,m)表示第j个文档的主题标签集合，其形式化表示如下：Among them, X _j (j=1,2,...,m) represents the feature vector of the j-th document, and Y _j (j=1,2,...,m) represents the hashtag set of the j-th document , which is formalized as follows:

Y_j={L₁,L₂,...,L_i...,L_k} （9）Y _j ={L ₁ ,L ₂ ,...,L _i ...,L _k } (9)

其中，m为训练集文档个数，k为社区主题个数；Among them, m is the number of documents in the training set, and k is the number of community topics;

（2）训练过程选择BR-SVM算法，采用交叉验证方式，基于训练文档集D，训练得到文档主题划分模型M；(2) In the training process, the BR-SVM algorithm is selected, and the cross-validation method is adopted. Based on the training document set D, the document topic division model M is obtained through training;

五、文档主题划分：对待划分的文档，提取文档包含的知识单元，使用步骤三方法得到文档特征向量表示，使用步骤四得到的文档主题划分模型实现文档主题划分。5. Document subject division: For the document to be divided, extract the knowledge units contained in the document, use the method of step 3 to obtain the document feature vector representation, and use the document subject division model obtained in step 4 to realize document subject division.

上述方法中，所述的构建领域主题结构树具体步骤为：In the above method, the specific steps of constructing the domain subject structure tree are as follows:

（1）社区中心点分析，计算C-Tree中的每个社区节点所包含知识单元在社区对应的领域知识地图子图中的度中心度，选取中心度较大的节点集作为社区中心节点组CCNS；知识单元在社区对应的领域知识地图子图中的度中心度计算方法如下：(1) Community central point analysis, calculate the degree centrality of the knowledge units contained in each community node in the C-Tree in the domain knowledge map subgraph corresponding to the community, and select the node set with a large centrality as the community central node group CCNS; the calculation method of the degree centrality of knowledge units in the domain knowledge map subgraph corresponding to the community is as follows:

${C C}_{deg deg} (({ku ku}_{i i})) = = \frac{deg deg (({ku ku}_{i i}))}{{Σ Σ}_{i i = = 11}^{n no} deg deg (({ku ku}_{i i}))},, {ku ku}_{i i} &Element; &Element; KU KU - - - - - - ((33))$

其中，deg(ku_i)表示知识单元ku_i社区内的度，KU表示领域知识地图或其子图包含的知识单元集合；Among them, deg(ku _i ) represents the degree of the knowledge unit ku _i community, and KU represents the set of knowledge units contained in the domain knowledge map or its subgraph;

（2）对CCNS中的知识单元，查找领域知识地图库，得到CCNS包含的核心术语集，结合知识单元的度中心度和核心术语在CCNS中知识单元出现的频次，计算核心术语的中心性权重W_Central，其形式化表示如下：(2) For the knowledge units in CCNS, search the domain knowledge map database to obtain the core term set contained in CCNS, combine the degree centrality of the knowledge unit and the frequency of occurrence of the core term in the knowledge unit in CCNS, and calculate the centrality weight of the core term W _Central , its formal representation is as follows:

${W W}_{Central Central}^{term term} = = {Σ Σ}_{ku ku}^{CCNS CCNS} C C ((ku ku)) * * δ δ ((term term,, ku ku)) - - - - - - ((44))$

其中，C(ku)表示CCNS中知识单元的中心度，δ(term,ku)表示term在ku中出现的频次，选取中心性权重最大的核心术语作为社区的主题；Among them, C(ku) represents the centrality of knowledge units in CCNS, δ(term, ku) represents the frequency of terms appearing in ku, and the core term with the largest centrality weight is selected as the topic of the community;

（3）对于C-Tree每个社区节点进行步骤（2），从而构建领域主题结构树T-Tree，实现社区结构到主题结构的映射，T-Tree形式化表示如下：(3) Step (2) is carried out for each community node of C-Tree, so as to construct the domain topic structure tree T-Tree, and realize the mapping from community structure to topic structure. The formal expression of T-Tree is as follows:

T-Tree(CTopicSet,troot，n) （5）T-Tree(CTopicSet,troot,n) （5）

其中，CTopicSet表示社区主题节点集合，troot表示主题结构树的根节点，n表示主题个数；社区主题节点形式化表示如下：Among them, CTopicSet represents the collection of community topic nodes, troot represents the root node of the topic structure tree, and n represents the number of topics; the formal representation of community topic nodes is as follows:

CTopic(Y_C,SubTopics,PTopic) （6）CTopic(Y _C , SubTopics, PTopic) (6)

其中，Y_C表示社区主题标号，SubTopics表示主题节点的子节点集合，PTopic表示主题节点的父节点。Among them, Y _C represents the label of the community topic, SubTopics represents the set of sub-nodes of the topic node, and PTopic represents the parent node of the topic node.

与现有技术相比，本发明方法的优点在于：构建主题结构树的过程中，提出了基于Fast Geedy算法和GN算法的层次社区发现算法构建社区结构树；特征提取过程将知识单元直接作为特征向量，由于知识单元具有语义完整性，相对于传统的基于分词的方法更能体现特征向量的主题特性；计算特征向量值的过程提出度中心度和知识单元文档频相结合的方法，其中度中心度的概念反映了知识单元在知识地图全局中的地位。通过上述改进，相对于传统方法有效提高了文档主题划分的准确率。Compared with the prior art, the method of the present invention has the advantages that: in the process of constructing the topic structure tree, a hierarchical community discovery algorithm based on the Fast Geedy algorithm and the GN algorithm is proposed to construct the community structure tree; the feature extraction process uses the knowledge unit directly as a feature Vector, because the knowledge unit has semantic integrity, compared with the traditional method based on word segmentation, it can better reflect the topic characteristics of the feature vector; the process of calculating the value of the feature vector proposes a method that combines degree centrality and knowledge unit document frequency. The concept of degree reflects the position of knowledge unit in the overall knowledge map. Through the above improvements, compared with traditional methods, the accuracy of document subject division is effectively improved.

附图说明Description of drawings

以下结合附图及具体实施方式对本发明作进一步的详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1是本发明基于知识地图社区结构文档主题划分流程图。Fig. 1 is a flowchart of subject division of documents based on knowledge map community structure in the present invention.

图2是图1中领域知识地图主题体系构建流程图。Fig. 2 is a flow chart of building the subject system of domain knowledge map in Fig. 1.

图3是图1中特征向量提取流程图。Fig. 3 is a flowchart of feature vector extraction in Fig. 1 .

具体实施方式Detailed ways

所述领域知识地图是描述某一个领域（课程或学科）内的知识以及这些知识之间的关联的复杂网络；知识单元指知识地图中具有完备表达能力的基本知识片段；领域知识地图库是存储领域内知识单元的数据库，记录了知识单元的详细信息，如知识单元名称、知识单元对应文本段、知识单元包含核心术语及知识单元之间的关系等。通常一门学科的知识地图是从该学科的文档资源中构建产生，表示为知识单元及其关联关系的网络；使用复杂网络社区发现算法将领域知识地图划分为社区结构后，每个社区具有相对独立的主题。因此，知识单元社区结构可以作为文档主题划分的依据。The domain knowledge map is a complex network that describes the knowledge in a certain field (course or discipline) and the association between these knowledge; the knowledge unit refers to the basic knowledge fragment with complete expression ability in the knowledge map; the domain knowledge map library is a storage The database of knowledge units in the field records the detailed information of knowledge units, such as the name of the knowledge unit, the corresponding text segment of the knowledge unit, the core terms contained in the knowledge unit, and the relationship between the knowledge units. Usually, the knowledge map of a subject is constructed from the document resources of the subject, expressed as a network of knowledge units and their associated relationships; after using complex network community discovery algorithms to divide the domain knowledge map into community structures, each community has a relative separate subject. Therefore, the knowledge unit community structure can be used as the basis for document topic division.

基于知识地图社区结构的文档主题划分的实现过程如图1所示，可以分为两个部分：文档主题划分模型的构建和待划分文档的主题划分。The implementation process of document subject division based on the knowledge map community structure is shown in Figure 1, which can be divided into two parts: the construction of the document subject division model and the subject division of the document to be divided.

文档主题分类模型的构建分为三个步骤：The construction of the document topic classification model is divided into three steps:

1、领域知识地图主题体系构建：首先，提出基于Fast Greedy算法（FastGreedy算法是由Newman等人提出的一种凝聚式社区发现算法，初始时每个节点都是一个社区，然后计算网络中任意两个社区聚合后的社区模块度增量，选取其中增量最大的两个社区进行合并；该过程递归进行，直到模块度不再增大）和GN算法(GN算法是由Girvan和Newman提出的一种分裂式社区发现算法，执行过程中不断计算网络中边的边介数；每次选取边介数最大的边从网络中删除，直到模块度不再增大）的层次社区发现算法，对领域知识地图进行社区划分，得到领域知识地图的社区结构树；社区结构树的每个节点表示领域知识地图的一个社区，同一社区的知识单元表现出主题一致性；其次，通过分析社区中心节点（即社区中使用知识单元的度中心度刻画的某种重要节点）确定社区主题，从而构建领域主题结构树，实现社区结构到主题结构的映射；1. Construction of domain knowledge map topic system: First, the Fast Greedy algorithm is proposed (Fast Greedy algorithm is a cohesive community discovery algorithm proposed by Newman et al., each node is a community at the beginning, and then any two nodes in the network are calculated. The increment of community modularity after community aggregation, select the two communities with the largest increment to merge; the process is recursive until the modularity no longer increases) and GN algorithm (GN algorithm is a proposed by Girvan and Newman) A split community discovery algorithm, which continuously calculates the edge betweenness in the network during the execution process; each time the edge with the largest edge betweenness is selected and deleted from the network until the modularity no longer increases), the hierarchical community discovery algorithm for the domain The knowledge map is divided into communities, and the community structure tree of the domain knowledge map is obtained; each node of the community structure tree represents a community of the domain knowledge map, and the knowledge units of the same community show theme consistency; secondly, by analyzing the community center nodes (ie Some important nodes described by the degree centrality of knowledge units in the community) determine the community theme, thereby constructing the domain theme structure tree, and realizing the mapping from the community structure to the theme structure;

2、构建特征空间，计算各维度的特征向量值：将领域知识地图的所有知识单元作为特征项，构建特征空间；提取文档包含的知识单元，结合知识单元的度中心度，计算各维度的特征向量值；2. Construct the feature space and calculate the feature vector values of each dimension: use all the knowledge units of the domain knowledge map as feature items to construct the feature space; extract the knowledge units contained in the document, and combine the degree centrality of the knowledge units to calculate the features of each dimension vector value;

3、构造训练数据集，训练主题划分模型：构造训练数据集，选择BR-SVM多标签分类算法，对训练数据集进行训练，得到文档主题划分模型。3. Construct the training data set and train the topic division model: construct the training data set, select the BR-SVM multi-label classification algorithm, train the training data set, and obtain the document topic division model.

对待划分的文档进行文档主题划分具体步骤如下：The specific steps of document subject division for the document to be divided are as follows:

1、文档特征向量表示：对于待划分文档d，应用文档主题分类模型的构建部分中步骤2所述方法，提取文档知识单元，得到待划分文档的特征向量X_d；1. Document feature vector representation: For the document d to be divided, apply the method described in step 2 in the construction part of the document subject classification model, extract the document knowledge unit, and obtain the feature vector X _d of the document to be divided;

2、文档主题划分：将待划分文档的特征向量X_d作为领域文档主题划分模型M的输入，模型的输出即为文档的主题标签Y_d，根据Y_d和领域主题结构树T-Tree之间的对应关系，得出文档d的主题划分。2. Document subject division: The feature vector X _d of the document to be divided is used as the input of the domain document subject division model M, and the output of the model is the subject label Y _d of the document. According to the relationship between Y _d and the domain topic structure tree T-Tree Corresponding relationship, get the subject division of document d.

如图2所示，领域知识地图主题体系构建过程的具体实施步骤如下：As shown in Figure 2, the specific implementation steps of the domain knowledge map topic system construction process are as follows:

（1）领域知识地图预处理过程，将领域知识地图转换为简单无向图，并将转换后的领域知识地图作为社区结构树的根社区节点，将其加入到待分析节点队列CAQ中。社区节点的形式化表示如下：(1) The domain knowledge map preprocessing process converts the domain knowledge map into a simple undirected graph, and takes the converted domain knowledge map as the root community node of the community structure tree, and adds it to the node queue CAQ to be analyzed. The formal representation of community nodes is as follows:

CNode(V_C,Children,Parent) （1）CNode(V _C ,Children,Parent) (1)

（2）领域知识地图层次社区划分过程，从CAQ中取出队首节点CH，分别使用Fast Greedy和GN算法对CH对应的领域知识地图或其子图进行社区划分，并引入模块度阈值（缺省值为0.35）；若上述两种算法得到的社区划分结果对应的模块度值均小于0.35，则划分无效，执行步骤（3）；否则，对比上述两种算法划分结果对应模块度值，选取其中较大的模块度值对应的社区划分结果，创建其中每个社区对应的社区节点，作为CH的子社区节点，并将其加入CAQ队列；(2) The domain knowledge map hierarchical community division process, the team head node CH is taken out from the CAQ, and the Fast Greedy and GN algorithms are used to divide the community of the domain knowledge map corresponding to CH or its sub-graph, and the modularity threshold is introduced (The default value is 0.35); if the modularity values corresponding to the community division results obtained by the above two algorithms are both less than 0.35, the division is invalid, and step (3) is performed; otherwise, compare the modularity values corresponding to the division results of the above two algorithms , select the community division result corresponding to the larger modularity value, create a community node corresponding to each community, as a sub-community node of CH, and add it to the CAQ queue;

C-Tree(CNodeSet,croot,n) （2）C-Tree(CNodeSet,croot,n) （2）

（4）社区中心点分析，计算C-Tree中的每个社区节点所包含知识单元在社区对应的领域知识地图子图中的度中心度，选取中心度较大的节点集作为社区中心节点组CCNS；知识单元在社区对应的领域知识地图子图中的度中心度计算方法如下：(4) Community central point analysis, calculate the degree centrality of the knowledge units contained in each community node in the C-Tree in the domain knowledge map subgraph corresponding to the community, and select the node set with a large centrality as the community central node group CCNS; the calculation method of the degree centrality of knowledge units in the domain knowledge map subgraph corresponding to the community is as follows:

（5）对CCNS中的知识单元，查找领域知识地图库，得到CCNS包含的核心术语集，结合知识单元的度中心度和核心术语在CCNS中知识单元出现的频次，计算核心术语的中心性权重W_Central，其形式化表示如下：(5) For the knowledge units in CCNS, search the domain knowledge map library, get the core term set contained in CCNS, combine the degree centrality of knowledge units and the frequency of occurrence of core terms in CCNS knowledge units, and calculate the centrality weight of core terms W _Central , its formal representation is as follows:

其中，C(ku)表示CCNS中知识单元的中心度，δ(term,ku)表示term在ku中出现的频次。选取中心性权重最大的核心术语作为社区的主题；Among them, C(ku) represents the centrality of knowledge units in CCNS, and δ(term, ku) represents the frequency of term appearing in ku. Select the core term with the largest centrality weight as the topic of the community;

（6）对于C-Tree每个社区节点进行步骤（2），从而构建领域主题结构树T-Tree，实现社区结构到主题结构的映射，T-Tree形式化表示如下：(6) Step (2) is performed for each community node of C-Tree, so as to construct the domain topic structure tree T-Tree, and realize the mapping from community structure to topic structure. The formal expression of T-Tree is as follows:

T-Tree(CTopicSet,troot，n) （5）T-Tree(CTopicSet,troot,n) （5）

其中，CTopicSet表示社区主题节点集合，troot表示主题结构树的根节点，n表示主题个数。社区主题节点形式化表示如下：Among them, CTopicSet represents the collection of community topic nodes, troot represents the root node of the topic structure tree, and n represents the number of topics. The formal representation of community topic nodes is as follows:

CTopic(Y_C,SubTopics,PTopic) （6）CTopic(Y _C , SubTopics, PTopic) (6)

如图3所示，构建特征空间，计算各维度的特征向量值的具体实施步骤如下：As shown in Figure 3, the specific implementation steps of constructing the feature space and calculating the feature vector values of each dimension are as follows:

（1）构造特征空间，将领域知识地图中的所有知识单元作为特征项，构成多维度（每个知识单元即为一个维度）的特征空间；(1) Construct a feature space, using all knowledge units in the domain knowledge map as feature items to form a multi-dimensional (each knowledge unit is a dimension) feature space;

（2）文档的预处理过程，将文档转换为纯文本形式（即txt文件），提取每个文档的文本段，使用基于向量空间模型的TF-IDF算法（基于向量空间模型的TF-IDF算法使用TF-IDF算法将文本表示为以术语为特征项的特征向量形式，借助向量之间夹角余弦来表示文档间的相似度）将文档的文本段与领域知识地图库的知识单元ku对应的文本段内容进行相似度匹配，若相似度达到阈值μ（缺省值为0.8），则认为文档包含ku，据此提取出文档包含的所有知识单元；(2) The preprocessing process of the document, converting the document into a plain text form (ie txt file), extracting the text segment of each document, using the TF-IDF algorithm based on the vector space model (TF-IDF algorithm based on the vector space model Use the TF-IDF algorithm to represent the text as a feature vector form with terms as feature items, and use the cosine of the angle between the vectors to represent the similarity between documents) The text segment of the document corresponds to the knowledge unit ku of the domain knowledge map library The content of the text segment is matched by similarity. If the similarity reaches the threshold μ (the default value is 0.8), the document is considered to contain ku, and all knowledge units contained in the document are extracted accordingly;

（3）计算特征空间中知识单元在领域知识地图中的度中心度（计算方法参见公式（3）），结合文档中知识单元的出现频次，将文档抽象为如下形式：X_j={W₁,W₂,...,W_i,...,W_n}，其中n表示特征向量的维度，W_i表示第i个特征项的权重，其形式化表示如下：(3) Calculate the degree centrality of the knowledge unit in the feature space in the domain knowledge map (see formula (3) for the calculation method), combine the frequency of occurrence of the knowledge unit in the document, and abstract the document into the following form: X _j ={W ₁ ,W ₂ ,...,W _i ,...,W _n }, where n represents the dimension of the feature vector, W _i represents the weight of the i-th feature item, and its formal expression is as follows:

W_i=C_deg(ku_i)*kuf(ku_i,d) （7）W _i =C _deg (ku _i )*kuf(ku _i ,d) (7)

其中，kuf(ku_i,d)表示知识单元在文档d中出现的频次，C_deg(ku_i)表示知识单元ku_i的度中心度。Among them, kuf(ku _i ,d) represents the frequency of the knowledge unit appearing in document d, and C _deg (ku _i ) represents the degree centrality of the knowledge unit ku _i .

构造训练数据集，训练主题划分模型的具体步骤包括：The specific steps for constructing a training data set and training a topic segmentation model include:

（1）构造训练数据集，对于给定的训练数据集D中的每一个文档，使用步骤4所述方法提取其特征向量，结合领域知识地图社区结构树C-Tree和领域主题结构树T-Tree，将训练数据集抽象为如下形式：(1) Construct the training data set. For each document in the given training data set D, use the method described in step 4 to extract its feature vector, combine the community structure tree C-Tree of the domain knowledge map and the domain topic structure tree T- Tree, which abstracts the training data set into the following form:

（2）训练过程选择BR-SVM算法（BR-SVM方法采用“一对多”策略将多标签问题转化为多个二分类问题，并用成熟的二分类问题训练方法SVM对这一系列二分类问题进行训练），采用交叉验证方式，基于训练文档集D，训练得到文档主题划分模型M。(2) Select the BR-SVM algorithm in the training process (the BR-SVM method adopts the "one-to-many" strategy to convert multi-label problems into multiple binary classification problems, and uses the mature binary classification problem training method SVM to solve this series of binary classification problems. training), using the cross-validation method, based on the training document set D, the document topic division model M is trained.

Claims

1. A method for dividing the subject of documents based on domain knowledge map community structure, characterized in that it comprises the following steps:

1. Domain knowledge map community structure tree construction:

(1) The domain knowledge map preprocessing process, transforming the domain knowledge map into a simple undirected graph, and taking the converted domain knowledge map as the root community node of the community structure tree, adding it to the node queue CAQ to be analyzed; community The formal representation of the node is as follows:

CNode(V _C ,Children,Parent) (1)

Among them, V _C represents the knowledge unit set contained in the community node, Children represents the child node set of the community node, and Parent represents the parent node of the community node;

(2) The domain knowledge map hierarchical community division process, take out the team leader node CH from the CAQ, use the Fast Greedy and GN algorithms to divide the community of the domain knowledge map or its subgraphs corresponding to CH, and introduce the modularity threshold ; If the modularity values corresponding to the community division results obtained by the above two algorithms are less than , then the division is invalid, and go to step (3); otherwise, compare the modularity values corresponding to the division results of the above two algorithms, select the community division result corresponding to the larger modularity value, and create a community node corresponding to each community, as CH's sub-community nodes, and add them to the CAQ queue;

(3) Perform step (2) on all nodes in the CAQ until the CAQ queue is empty, so as to obtain the community structure tree C-Tree corresponding to the domain knowledge map, and its formal expression is as follows:

C-Tree(CNodeSet,croot,n) (2)

Among them, CNodeSet represents the community node set of the community structure tree, croot represents the root community node of the community structure tree, and n represents the number of community nodes, that is, the number of communities existing in the network;

2. Through the community theme identification of the community structure tree corresponding to the domain knowledge map obtained in step 1, construct the domain theme structure tree, and realize the mapping from the community structure to the theme structure;

3. Document feature vector extraction:

(1) Construct a feature space, and use all knowledge units in the domain knowledge map as feature items to form a multi-dimensional feature space;

(2) The preprocessing process of the document, converting the document into a plain text form, extracting the text segment of each document, and using the TF-IDF algorithm based on the vector space model to correspond the text segment of the document to the knowledge unit ku of the domain knowledge map library Similarity matching is performed on the content of the text segment. If the similarity reaches the threshold μ, the document is considered to contain ku, and all knowledge units contained in the document are extracted accordingly;

(3) Use the formula (3) to calculate the degree centrality of the knowledge unit in the feature space in the domain knowledge map, and combine the frequency of occurrence of the knowledge unit in the document to abstract the document into the following form:

X _j ＝{W ₁ ,W ₂ ,...,W _i ,...,W _n }, where n represents the dimension of the feature vector, W _i represents the weight of the i-th feature item, and its formal expression is as follows:

W _i ＝C _deg (ku _i ) ^* kuf(ku _i ,d) (7)

Among them, kuf(ku _i ,d) represents the frequency of knowledge unit appearing in document d, and C _deg (ku _i ) represents the degree centrality of knowledge unit ku _i ;

The expression of formula (3) is:

{C C}_{deg deg} (({ku ku}_{i i})) = = \frac{deg deg (({ku ku}_{i i}))}{{Σ Σ}_{i i = = 11}^{n no} deg deg (({ku ku}_{i i}))},, {ku ku}_{i i} &Element; &Element; KU KU - - - - - - ((33))

Among them, deg(ku _i ) represents the degree of the knowledge unit ku _i community, and KU represents the set of knowledge units contained in the domain knowledge map or its subgraph;

Fourth, the construction of the document subject division model:

(1) Construct a training data set. For each document in a given training data set D, use the method described in step 3 to extract its feature vector, combine the domain knowledge map community structure tree C-Tree in step 1 and step 2 In the domain topic structure tree T-Tree, the training data set is abstracted into the following form:

D＝{(X ₁ ,Y ₁ ),(X ₂ ,Y ₂ ),...,(X _j ,Y _j ),...,(X _m ,Y _m )} (8)

Among them, X _j (j=1,2,...,m) represents the feature vector of the jth document, and Y _j (j=1,2,...,m) represents the hashtag set of the jth document , which is formalized as follows:

Y _j ＝{L ₁ ,L ₂ ,...,L _i ...,L _k } (9)

Among them, m is the number of documents in the training set, and k is the number of community topics;

(2) The BR-SVM algorithm is selected for the training process, and the cross-validation method is used to obtain the document topic division model M based on the training document set D;

5. Document subject division: For the document to be divided, extract the knowledge units contained in the document, use the method of step 3 to obtain the document feature vector representation, and use the document subject division model obtained in step 4 to realize document subject division;

In the above steps, the specific method of constructing the domain subject structure tree in step 2 is as follows:

(1) Community central point analysis, calculate the degree centrality of the knowledge units contained in each community node in the C-Tree in the domain knowledge map subgraph corresponding to the community, and select the node set with a large centrality as the community central node group CCNS; the calculation of the degree centrality of the knowledge unit in the domain knowledge map subgraph corresponding to the community is carried out according to formula (3);

(2) For the knowledge units in CCNS, search the domain knowledge map library, get the core term set contained in CCNS, combine the degree centrality of the knowledge unit and the frequency of the core term in the knowledge unit in CCNS, and calculate the centrality of the core term term The weight W _Central , its formal expression is as follows:

{W W}_{Central Central}^{term term} = = {Σ Σ}_{ku ku}^{CCNS CCNS} C C ((ku ku)) * * δ δ ((term term,, ku ku)) - - - - - - ((44))

Among them, C(ku) represents the centrality of knowledge units in CCNS, δ(term, ku) represents the frequency of terms appearing in ku, and the core term with the largest centrality weight is selected as the topic of the community;

(3) Step (2) is performed for each community node of the C-Tree, thereby constructing a domain topic structure tree T-Tree, and realizing the mapping from the community structure to the topic structure. The T-Tree is formally expressed as follows:

T-Tree(CTopicSet,troot,n) (5)

Among them, CTopicSet represents the collection of community topic nodes, troot represents the root node of the topic structure tree, and n represents the number of topics; the formal representation of community topic nodes is as follows:

CTopic(Y _C ,SubTopics,PTopic) (6)

Among them, Y _C represents the label of the community topic, SubTopics represents the set of sub-nodes of the topic node, and PTopic represents the parent node of the topic node.