CN107577785A

CN107577785A - A Hierarchical Multi-Label Classification Approach for Legal Identification

Info

Publication number: CN107577785A
Application number: CN201710832304.8A
Authority: CN
Inventors: 柏文阳; 陈朋薇; 张剡; 周嵩
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2018-01-12
Anticipated expiration: 2037-09-15
Also published as: CN107577785B

Abstract

The invention discloses a kind of level multi-tag sorting technique suitable for law identification, comprise the following steps：Step 1, case facts and its legal provision are extracted from the judgement document by pretreatment；Step 2, the hierarchical structure based on Label space, legal provision corresponding to extension case facts, the class label for making case sample are a subset of Label space；Step 3, case facts text is segmented and part-of-speech tagging, feature selecting is carried out to word segmentation result, choose the Feature Words construction feature vector that can fully represent case facts；Step 4, forecast model is built：Find out the k neighbour sample set N (x) for having no example x in multi-tag training set is extended, to each neighbour's sample, weight is set, the classified weight of each classification is calculated according to k neighbour's sample and has no that example belongs to the confidence level of each classification, finally prediction has no the class label set of example.

Description

A Hierarchical Multi-Label Classification Approach for Legal Identification

技术领域technical field

本发明属于计算机数据分析与挖掘领域，涉及一种适用于法律识别的层次多标签分类方法。The invention belongs to the field of computer data analysis and mining, and relates to a hierarchical multi-label classification method suitable for legal identification.

背景技术Background technique

层次多标签分类是多标签分类的一个特例。与一般的多标签分类不同，层次多标签分类问题中，每个样本可以具有多个类别标签，同时样本标签空间以树形或有向无环图的层次结构组织。在有向无环图中，一个节点可能有多个父节点，相比树形结构更为复杂，算法的设计难度更大，因此目前层次多标签分类方面的研究主要针对树形的类别标签结构。根据算法考察类别层次结构的不同方式，层次多标签分类算法可以分为局部算法和全局算法。Hierarchical multi-label classification is a special case of multi-label classification. Different from general multi-label classification, in hierarchical multi-label classification problems, each sample can have multiple class labels, and the sample label space is organized in a tree or directed acyclic graph hierarchy. In a directed acyclic graph, a node may have multiple parent nodes, which is more complex than the tree structure, and the design of the algorithm is more difficult. Therefore, the current research on hierarchical multi-label classification mainly focuses on the tree-shaped category label structure. . Hierarchical multi-label classification algorithms can be divided into local algorithms and global algorithms according to the different ways in which the algorithms examine the category hierarchy.

局部算法逐一考察类别层次中的各个内部节点的局部分类信息，将层次多标签分类问题转化为多个多标签分类问题。而且在训练内部节点上的多标签分类器时，需要选择合适的局部样本集。在预测阶段采用自顶向下等预测方式使预测结果满足层次要求。文献ESULI A,FAGNI T,SEBASTIANI F.TreeBoost.MH:A boosting algorithm formulti-labelhierarchical text categorization[C]//String Processing andInformationRetrieval.2006:13–24.提出了TreeBoost.MH算法来处理层次多标签文本分类问题。算法递归地在类别标签树中的每一个非叶子节点上训练多标签分类器，基分类器选择AdaBoost.MH，在每个多标签分类器训练过程中，特征选择和训练样本的选择都局部地进行。实验效果证明TreeBoost.MH算法在时间效率和预测性能上都好于AdaBoost.MH算法。文献CERRI R,BARROS R C,DE CARVALHO AC.Hierarchical multi-labelclassificationusing local neural networks[J].Journal of Computer and SystemSciences,2014,80(1):39–56.提出了基于多层感知机的局部层次多标签分类算法，在类别层次的每一层训练一个多层感知机网络，每个神经网络与一个类别层次关联，用于预测该层次上的类别标签，某一层上神经网络的预测结果将作为下一层神经网络的输入。由于每一层神经网络都是在同样的样本集合上训练得到，因此预测结果会出现不满足层次限制的情况，需要通过对预测结果进行后续处理来保证其满足层次限制。The local algorithm examines the local classification information of each internal node in the category hierarchy one by one, and transforms the hierarchical multi-label classification problem into multiple multi-label classification problems. Moreover, when training a multi-label classifier on an internal node, it is necessary to select an appropriate local sample set. In the forecasting stage, top-down forecasting methods are used to make the forecasting results meet the hierarchical requirements. Literature ESULI A, FAGNI T, SEBASTIANI F.TreeBoost.MH: A boosting algorithm formulti-labelhierarchical text categorization[C]//String Processing andInformationRetrieval.2006:13–24. The TreeBoost.MH algorithm is proposed to deal with hierarchical multi-label text classification question. The algorithm recursively trains a multi-label classifier on each non-leaf node in the category label tree. The base classifier selects AdaBoost.MH. During the training process of each multi-label classifier, feature selection and training sample selection are locally conduct. Experimental results prove that TreeBoost.MH algorithm is better than AdaBoost.MH algorithm in terms of time efficiency and prediction performance. Literature CERRI R, BARROS R C, DE CARVALHO AC. Hierarchical multi-label classification using local neural networks [J]. Journal of Computer and System Sciences, 2014, 80(1): 39–56. Proposed local hierarchical multi-label classification based on multi-layer perceptron The label classification algorithm trains a multi-layer perceptron network at each layer of the category hierarchy. Each neural network is associated with a category hierarchy and is used to predict the category label at this level. The prediction result of the neural network on a certain layer will be used as Input to the next layer of neural network. Since each layer of the neural network is trained on the same sample set, the prediction results may not meet the hierarchical constraints, and subsequent processing of the prediction results is required to ensure that they meet the hierarchical constraints.

局部算法的缺点一方面在于需要训练多个分类器，造成模型较为复杂，影响了模型的可理解性；另一方面在于预测过程中会出现阻塞问题，即在上层被错误分类的样本无法到达下层的分类器，虽然有人提出了降低阈值、限制投票和扩展阈值倍增三种策略来应对局部算法的阻塞问题，但局部算法往往在预测准确率上较为不理想。The disadvantage of the local algorithm is that on the one hand, it needs to train multiple classifiers, which makes the model more complex and affects the intelligibility of the model; on the other hand, there will be blocking problems in the prediction process, that is, samples that are misclassified in the upper layer cannot reach the lower layer Although some people have proposed three strategies of lowering the threshold, limiting voting and expanding the threshold multiplication to deal with the blocking problem of the local algorithm, the local algorithm is often not ideal in terms of prediction accuracy.

全局算法从整体上考虑类别的层次结构，训练单一的层次多标签分类器，对未见实例进行预测。全局算法根据其处理类别标签层次结构的方式主要可以分为以下几种：一种全局算法是利用类别聚类，首先计算测试样本与各个类别的相似度，然后将测试样本分类到距离最近的类别。另一种方法是将层次多标签分类问题转换为多标签分类问题进行处理：文献KIRITCHENKO S,MATWIN S,FAMILI F.Functional annotation of genesusinghierarchical text categorization[J],2005.对训练样本的类别标签进行扩展，增加其祖先类别标签，将层次多标签分类问题转换为多标签分类问题进行处理。在测试阶段，由于采用的多标签分类算法AdaBoost.MH没有考虑类别的层次结构，因此面临了与局部算法相同的问题，即预测结果会有层次不一致情况，同样需要对模型的输出进行修正来保证层次限制满足。还有的全局算法是改造现有非层次分类算法使其能够直接处理层次信息并利用层次信息来改善性能。文献VENS C,STRUYF J,SCHIETGAT L,et al.Decision treesfor hierarchical multilabelclassification[J].Machine Learning,2008,73(2):185–214.基于预测聚类树(PCT)提出了Clus-HMC算法，训练一棵决策树来处理层次多标签分类问题，并且与Clus-HSC和Clus-SC方法进行了比较，Clus-SC忽略类别标签的层次结构，为每个类别标签训练一个独立的分类器，Clus-HSC方法是层次化的Clus-SC，预测结果满足层次限制。实验结果表明，全局的Clus-HMC算法不仅在预测性能上好于Clus-SC和Clus-HSC算法，而且在时间效率上也更好。Global algorithms consider the hierarchy of categories holistically and train a single hierarchical multi-label classifier to make predictions on unseen instances. The global algorithm can be mainly divided into the following types according to the way it deals with the category label hierarchy: a global algorithm uses category clustering, first calculates the similarity between the test sample and each category, and then classifies the test sample to the nearest category . Another method is to convert the hierarchical multi-label classification problem into a multi-label classification problem for processing: Document KIRITCHENKO S, MATWIN S, FAMILI F. Functional annotation of genes using hierarchical text categorization[J], 2005. Extend the category labels of training samples , increase its ancestor category label, and convert the hierarchical multi-label classification problem into a multi-label classification problem for processing. In the test phase, because the multi-label classification algorithm AdaBoost.MH adopted does not consider the hierarchical structure of categories, it faces the same problem as the local algorithm, that is, the prediction results will have hierarchical inconsistencies, and the output of the model needs to be corrected to ensure Hierarchy constraints are met. Another global algorithm is to transform the existing non-hierarchical classification algorithm so that it can directly process hierarchical information and use hierarchical information to improve performance. Literature VENS C, STRUYF J, SCHIETGAT L, et al. Decision trees for hierarchical multilabel classification [J]. Machine Learning, 2008, 73(2): 185–214. Based on the predictive clustering tree (PCT), the Clus-HMC algorithm was proposed. A decision tree is trained to handle hierarchical multi-label classification problems and compared with the Clus-HSC and Clus-SC methods, Clus-SC ignores the hierarchy of class labels and trains a separate classifier for each class label, Clus - The HSC method is a hierarchical Clus-SC, and the prediction results meet the hierarchical constraints. Experimental results show that the global Clus-HMC algorithm is not only better than the Clus-SC and Clus-HSC algorithms in prediction performance, but also better in time efficiency.

总的来说，全局算法有两方面特征：一次性的从整体上考虑类别的层次结构；不具有局部算法所特有的模块性。全局算法和局部算法的关键不同之处在于训练过程，在测试阶段，全局算法甚至也可以像局部算法一样使用自顶向下的方式对未见实例进行类别预测。In general, the global algorithm has two characteristics: it considers the hierarchical structure of the category as a whole at one time; it does not have the unique modularity of the local algorithm. The key difference between the global algorithm and the local algorithm is the training process. In the test phase, the global algorithm can even use a top-down method to predict the category of unseen instances like the local algorithm.

由于层次多标签分类问题中，类别标签的组织呈层次结构，因此如果样本具有类别标签c_i，则样本也隐含地具有了c_i的所有祖先类别标签；另一方面，在预测未见实例的类别时，也要满足层次限制，即不能出现未见实例属于某类别而不属于该类别的祖先类别的情况。一般的层次多标签分类算法往往无法保证其预测结果满足层次限制，或者由于没有利用到标签空间的层次结构特征而无法取得最优的学习效果。因此，层次多标签分类算法不仅要充分利用类别标签之间的关联和层次结构，提高分类模型的预测性能，还要使预测结果满足层次限制。Since the category labels are organized in a hierarchical structure in hierarchical multi-label classification problems, if a sample has a category label _ci , the sample also implicitly has all the ancestor category labels of _ci ; on the other hand, when predicting unseen instances When the category of , it must also satisfy the hierarchical restriction, that is, it cannot appear that no instance belongs to a certain category but does not belong to the ancestor category of this category. The general hierarchical multi-label classification algorithm often cannot guarantee that its prediction results meet the hierarchical constraints, or cannot achieve the optimal learning effect because it does not utilize the hierarchical structure features of the label space. Therefore, the hierarchical multi-label classification algorithm should not only make full use of the association and hierarchical structure between category labels to improve the prediction performance of the classification model, but also make the prediction results meet the hierarchical constraints.

案件适用法律自动识别问题本质上是一个层次多标签分类问题，样本的类别标签即案件适用的法律条文呈树形结构组织，一个案件可能适用多项法律条文，且案件适用的各项法律条文具体程度可能不同。相应的用于解决案件适用法律自动识别问题的层次多标签分类算法需要能够处理树形的类别层次结构，而且为非强制叶节点预测算法，预测的类别标签可以对应到类别层次结构中的任意节点。The problem of automatic identification of the applicable laws of a case is essentially a hierarchical multi-label classification problem. The category labels of the samples, that is, the legal provisions applicable to the case, are organized in a tree structure. A case may apply to multiple legal provisions, and the applicable legal provisions of the case are specific. The extent may vary. The corresponding hierarchical multi-label classification algorithm used to solve the problem of automatic identification of the applicable law of the case needs to be able to deal with the tree-shaped category hierarchy, and it is a non-mandatory leaf node prediction algorithm, and the predicted category label can correspond to any node in the category hierarchy .

发明内容Contents of the invention

发明目的：本发明所要解决的技术问题是针对现有技术的不足，提供一种有效的的适用于法律识别的层次多标签分类方法。Purpose of the invention: The technical problem to be solved by the present invention is to provide an effective hierarchical multi-label classification method suitable for legal identification in view of the deficiencies of the prior art.

技术方案：本发明公开了一种适用于法律识别的层次多标签分类方法，包括以下步骤：Technical solution: the present invention discloses a hierarchical multi-label classification method suitable for legal identification, including the following steps:

步骤1，利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集，一份裁判文书对应一个样本，以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理：根据裁判文书的行文结构从中提取案件事实及其适用的法律条文，案件事实用于生成案件样本的特征向量，适用的法律条文用于表示案件样本的类别标签，将原始文本数据集转化为半结构化的多标签训练集和测试集，半结构化的样本形式为：(案件事实描述，法律条文文本)；对案件适用法律条文中的错误和格式不一致进行修正；利用哈工大的语言技术平台LTP作为语言处理工具(LTP是一整套中文语言处理系统，制定了基于XML的语言处理结果表示，并在此基础上提供了一整套自底向上的丰富而且高效的中文语言处理模块(包括词法、句法、语义等六项中文处理核心技术)，以及基于动态链接库(DLL)的应用程序接口、可视化工具，并且能够以网络服务的形式进行使用)对案件事实描述进行分词和词性标注。Step 1: Crawl the required original text data set of referee documents from the Internet using jsoup-based crawler technology. One referee document corresponds to one sample, and it is randomly divided into a training set and a test set at a ratio of 7:3. Then pre-process the judgment documents: extract the case facts and applicable legal provisions according to the text structure of the judgment documents. The case facts are used to generate the feature vectors of the case samples, and the applicable legal provisions are used to represent the category labels of the case samples. The original text data set is converted into a semi-structured multi-label training set and test set. The semi-structured sample form is: (case fact description, legal text); correct the errors and inconsistent formats in the applicable legal provisions of the case; Utilize the language technology platform LTP of Harbin Institute of Technology as a language processing tool (LTP is a set of Chinese language processing system, formulate an XML-based language processing result representation, and provide a set of bottom-up rich and efficient Chinese language The processing module (including six Chinese processing core technologies such as lexical, syntactic, and semantic), as well as application programming interfaces and visualization tools based on dynamic link libraries (DLL), and can be used in the form of network services) performs word segmentation for case fact descriptions and part-of-speech tagging.

步骤2，由于法律系统中法律条文的组织呈树形结构，对应地，由多标签训练集中的类别标签构成的标签空间呈树形结构。基于多标签训练集中的类别标签构成的标签空间标签空间的层次结构，扩展所有样本的案件事实对应的法律条文，使每个案件事实对应的类别标签为标签空间的一个子集且满足层次限制；Step 2. Since the organization of legal provisions in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space label space formed by the category labels in the multi-label training set, expand the legal provisions corresponding to the case facts of all samples, so that the category labels corresponding to each case fact are a subset of the label space and meet the hierarchical constraints;

步骤3，对步骤1中来自训练集的分词结果(指的是步骤1所述半结构化的多标签训练集的案件事实部分的分词结果)进行特征选择，选取能够充分表示案件事实的特征词构建特征向量；经过文本表示，得到结构化的扩展多标签训练集Tr和测试集Te；Step 3, perform feature selection on the word segmentation results from the training set in step 1 (referring to the word segmentation results of the case fact part of the semi-structured multi-label training set described in step 1), and select feature words that can fully represent the case facts Construct feature vectors; through text representation, a structured extended multi-label training set Tr and test set Te are obtained;

步骤4，构建预测模型：找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x)，未见实例即待分类的案件事实，给每个近邻样本设置权重，根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度，预测未见实例的类别标签集合h(x)，且h(x)满足层次限制。最后根据标签空间的树形结构，除去预测类别标签集合h(x)中的层次限制，(即标签扩展的逆过程)，得到未见实例的具体适用法律条文。。Step 4, build a prediction model: find out the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, the unseen instance is the fact of the case to be classified, given Set weights for each neighbor sample, calculate the confidence that the unseen instance belongs to each category in the label space according to the classification weights of k neighbor samples for each category in the label space, and predict the category label set h(x) of the unseen instance, and h (x) Satisfy the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category label set h(x) are removed (that is, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained. .

步骤2包括：Step 2 includes:

步骤2-1，在层次多标签分类问题中，给定d维实例空间(为实数集)，和包含q个类别的标签空间Y＝{y₁,y₂,…,y_q}，y_i表示第i个类别，则类别标签空间层次结构可以用二元组(Y,＜)表示，如果有y_i,y_j∈Y且y_i＜y_j，则类别y_i属于类别y_j，y_i是y_j的子孙类别，y_j是y_i的祖先类别，＜表示类别标签的偏序关系，偏序关系＜可以理解为“属于”关系，即如果有y_i,y_j∈Y且y_i＜y_j，则类别y_i属于类别y_j，y_i是y_j的子孙类别，y_j是y_i的祖先类别。偏序关系＜具有非对称性、非自反性和传递性，可以用以下四个特征描述：Step 2-1, in the hierarchical multi-label classification problem, given a d-dimensional instance space ( is a set of real numbers), and the label space Y={y ₁ ,y ₂ ,…,y _q } containing q categories, y _i represents the i-th category, then the category label space hierarchy can be used as a binary group (Y, <) means that if there is y _i , y _j ∈ Y and y _i < y _j , then category y _i belongs to category y _j , y _i is the descendant category of y _j , y _j is the ancestor category of y _i , < means category The partial order relationship of the label, the partial order relationship < can be understood as the "belongs to" relationship, that is, if there is y _i , y _j ∈ Y and y _i < y _j , then category y _i belongs to category y _j , and y _i belongs to y _j Descendant category, y _j is the ancestor category of y _i . Partial order < is asymmetric, non-reflexive and transitive, and can be described by the following four characteristics:

a)类别标签层次结构中唯一的根节点用虚拟类别标签R表示，对任意y_i∈Y，有y_i＜R；a) The only root node in the category label hierarchy is represented by a virtual category label R, and for any y _i ∈ Y, y _i <R;

b)对任意y_i,y_j∈Y，如果有y_i＜y_j，那么 b) For any y _i , y _j ∈ Y, if y _i < y _j , then

c)任意y_i∈Y，有 c) For any y _i ∈ Y, we have

d)任意y_i,y_j,y_k∈Y，y_i＜y_j且y_j＜y_k，则有y_i＜y_k。d) Any y _i , y _j , y _k ∈ Y, y _i <y _j and y _j <y _k , then y _i <y _k .

类别标签的组织结构满足上述四个特征的多标签分类问题都可以认为是层次多标签分类问题。由上述形式化定义可知，在层次化的类别标签空间中，从任一类别节点开始往上追溯到根节点而形成的唯一路径上的所有其他类别节点(除去开始节点)都是该类别节点的祖先类别节点。因此如果样本具有类别标签y_i，则样本也隐含地具有了y_i的所有祖先类别标签，这就要求分类器对未见实例的预测类别集合h(x)也要满足层次限制，即，且y′＜y″:y″∈h(x)。其中y′为h(x)中的类别，y″为y′的一个祖先类别；Any multi-label classification problem whose organizational structure of category labels satisfies the above four characteristics can be considered as a hierarchical multi-label classification problem. From the above formal definition, in the hierarchical category label space, all other category nodes (except the start node) on the unique path formed from any category node traced back to the root node are the category nodes. Ancestor class node. Therefore, if a sample has a category label y _i , the sample also implicitly has all the ancestor category labels of y _i , which requires the classifier to satisfy the hierarchical constraints on the predicted category set h(x) of unseen instances, that is, And y′<y″: y″∈h(x). Where y' is the category in h(x), and y" is an ancestor category of y';

步骤2-2，对于任意训练样本(x_i,h_i)(1≤i≤m)，m为获取的全部裁判文书样本的数量，x_i∈X为d维的特征向量，用于表示案件事实部分，为与x_i对应的一组类别标签，即x_i对应的法律条文，令扩展后的类别标签集合为则h_i′中包含h_i中的所有类别标签及其所有祖先类别标签。形式化地，Step 2-2, for any training sample ( _xi , h _i ) (1≤i≤m), m is the number of samples of all judgment documents obtained, and x _i ∈ X is a d-dimensional feature vector, which is used to represent the case factual part, is a set of category labels corresponding to _xi , that is, the legal provisions corresponding to _xi , so that the expanded set of category labels is Then h _i ′ contains all category labels in _hi and all their ancestor category labels. formally,

标签扩展过程将类别标签的层次关系明确地在样本的类别标签中表达出来：如果样本被标记为某些类别，那么经过标签扩展，这些类别的祖先类别也会显式地赋予该样本；因此每个样本的类别标签可以看作标签空间树的一棵子树，并且各个子树的顶层都是根节点。由此可见，如果有y_i,y_j∈Y且y_i＜y_j，未见实例在扩展后的多标签训练集中的k近邻样本中，具有类别标签y_i的样本数一定不小于具有类别标签y_j的样本数。标签扩展是保证本学习算法预测结果满足层次限制的重要步骤。The label expansion process explicitly expresses the hierarchical relationship of category labels in the category labels of samples: if a sample is labeled as certain categories, then after label expansion, the ancestor categories of these categories will also be explicitly assigned to the sample; therefore, every The category label of each sample can be regarded as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is y _i , y _j ∈ Y and y _i < y _j , among the k-nearest neighbor samples in the expanded multi-label training set, the number of samples with class label y _i must not be less than that with The number of samples for class label _yj . Label expansion is an important step to ensure that the prediction results of this learning algorithm meet the hierarchical constraints.

步骤3包括如下步骤：Step 3 includes the following steps:

步骤3-1，特征选择的目的是为了特征降维，由于一般的文本特征选择算法不能直接处理多标签数据集，因此需要将多标签数据转换为单标签数据进行处理。转换的方法是：对于每一个多标签样本(x,h)，用|h|表示标签类别集合h中标签类别的个数，将其替换为|h|个新的单标签样本(x,y_i)(1≤i≤|h|,y_i∈h)，每个新样本的类y_i即为原多标签样本类别标签集合h中的一个类别标签，表1给出了按照上述策略，将多标签样本转化为单标签样本的示例。In step 3-1, the purpose of feature selection is to reduce the dimensionality of features. Since the general text feature selection algorithm cannot directly process multi-label data sets, it is necessary to convert multi-label data into single-label data for processing. The conversion method is: for each multi-label sample (x, h), use |h| to represent the number of label categories in the label category set h, and replace it with |h| new single-label samples (x, y _i )(1≤i≤|h|,y _i ∈h), the class y _i of each new sample is a category label in the original multi-label sample category label set h. Table 1 shows that according to the above strategy, An example of converting multi-label samples to single-label samples.

表1多标签样本转换过程Table 1 Multi-label sample conversion process

步骤3-2，经过步骤3-1的转换过程，多标签的案件样本就转换成为了多个单标签的案件样本，可以利用一般特征选择算法对步骤1中原始训练集所得分词结果进行特征选择，选择一定数量(通常视原始文本数据集情况而定，比如用信息增益算法进行特征选择时，应使所选特征词的信息增益总量尽可能大且特征词数量不至于过多，一般至少取100个特征词)的具有区分能力的特征词构成特征空间，用来自特征空间的特征词表示每个案件样本的案件事实部分。其中，每个特征词对应的属性值，也就是特征权重，采用常用的TF-IDF算法进行计算。将每个案件样本的案件事实部分看成一个已经分词的文档，则所有案件样本的案件事实部分组成一个文档集合。文档集合中第i个文档中第j维特征的特征权重tf-idf_ij定义如下：Step 3-2, after the conversion process of step 3-1, the multi-label case samples are converted into multiple single-label case samples, and the general feature selection algorithm can be used to perform feature selection on the word segmentation results obtained in the original training set in step 1 , select a certain number (usually depending on the original text data set, for example, when using the information gain algorithm for feature selection, the total amount of information gain of the selected feature words should be as large as possible and the number of feature words should not be too large, generally at least Take 100 feature words) with distinguishing ability to form a feature space, and use the feature words from the feature space to represent the case fact part of each case sample. Among them, the attribute value corresponding to each feature word, that is, the feature weight, is calculated using the commonly used TF-IDF algorithm. Considering the case fact part of each case sample as a word-segmented document, the case fact part of all case samples forms a document set. The feature weight tf-idf _ij of the j-th dimension feature in the i-th document in the document collection is defined as follows:

其中，tf_ij表示特征词t_j在文档d_i中出现的频率，idf_j表示特征词t_j在文档集合中的反文档频率，N表示文档集合中的文档总数，n_j表示特征词t_j在文档集合中的文档频率，即文档集合中出现特征词t_j的文档数目，分母为归一化因子。Among them, tf _ij represents the frequency of the feature word t _j in the document d _i , idf _j represents the inverse document frequency of the feature word t _j in the document collection, N represents the total number of documents in the document collection, and n _j represents the feature word t _j The document frequency in the document collection, that is, the number of documents in which the characteristic word t _j appears in the document collection, and the denominator is the normalization factor.

步骤3-3，对步骤1中原始训练集所得分词结果进行特征选择，选择大约100个最具有区分能力的特征词构成特征向量。常用的文本特征选择方法主要基于文档频率(DF)，互信息(MI)，信息增益(IG)，卡方统计(χ²Statistic,CHI)等衡量指标。基于文档频率的特征选择过于简单，往往无法选取最具分类信息的特征词，互信息的缺点在于容易受到特征词的边缘概率影响，因此本层次多标签分类方法选择信息增益或者卡方统计算法进行特征选择。Step 3-3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select about 100 feature words with the most distinguishing ability to form a feature vector. Commonly used text feature selection methods are mainly based on document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics (χ ² Statistics, CHI) and other measurement indicators. The feature selection based on document frequency is too simple, and it is often impossible to select the feature words with the most classification information. The disadvantage of mutual information is that it is easily affected by the marginal probability of feature words. Therefore, the multi-label classification method of this level chooses information gain or chi-square statistical algorithm. feature selection.

步骤3-3包括：采用信息增益算法进行特征选择：特征词t的信息增益IG(t)的定义如下：Step 3-3 includes: using the information gain algorithm for feature selection: the definition of the information gain IG(t) of the feature word t is as follows:

其中，P_r(y_i)表示类别y_i出现的概率，P_r(t)表示特征t出现的概率，P_r(y_i|t)表示在特征t出现的前提下类别y_i出现的概率，表示特征t不出现的概率，表示在特征t不出现的前提下类别y_i出现的概率。对于文档集合中的每个特征词，计算其信息增益，信息增益值低于设定的阈值(比如取0.15，设定阈值时应使所选特征词的信息增益总量尽可能大且特征词数量不至于过多)的特征词不纳入特征空间。Among them, P _r (y _i ) represents the probability of class y _i appearing, P _r (t) represents the probability of feature t appearing, P _r (y _i |t) represents the probability of class y _i appearing under the premise of feature t , Indicates the probability that feature t does not appear, Indicates the probability of category y _i appearing under the premise that feature t does not appear. For each feature word in the document collection, its information gain is calculated, and the information gain value is lower than the set threshold (for example, 0.15, when setting the threshold, the total amount of information gain of the selected feature words should be as large as possible and the feature words The number of feature words is not included in the feature space.

步骤3-3还可以采用卡方统计算法进行特征选择：先假设特征词与类别是不相关的，如果利用CHI分布计算出的检验值偏离阈值越大，那么更有信心否定原假设，接受原假设的备择假设：即特征词与类别有着很高的相关度。Step 3-3 can also use the chi-square statistical algorithm for feature selection: first assume that the feature word is not related to the category, if the test value calculated using the CHI distribution deviates from the threshold, then it is more confident to reject the null hypothesis and accept the original hypothesis. Hypothetical alternative hypothesis: That is, the feature word has a high correlation with the category.

令A为包含特征词t且属于类别y的文档数量，B为包含特征词t而不属于类别y的文档数量，C为不包含特征词t而属于类别y的文档数量，D为不包含特征词t且不属于类别y的文档数量，N为总文档数量，则特征词t和类别y的卡方统计量χ²(t,y)定义为：Let A be the number of documents that contain feature word t and belong to category y, B be the number of documents that contain feature word t but not belong to category y, C be the number of documents that do not contain feature word t but belong to category y, and D be the number of documents that do not contain feature word t The number of documents with word t that does not belong to category y, N is the total number of documents, then the chi-square statistic χ ² (t,y) of feature word t and category y is defined as:

特征词t和类别y独立时，其卡方统计量为0，针对一个特征词，计算其关于各个类别的卡方统计量，然后分别计算均值χ² _avg(t)和最大值χ² _max(t)，用这两种方式进行综合考虑，选出大约100个最具有区分能力的特征词：When the feature word t and category y are independent, the chi-square statistic is 0. For a feature word, calculate its chi-square statistic for each category, and then calculate the mean value χ ² _avg (t) and maximum value χ ² _max ( t), using these two methods for comprehensive consideration, select about 100 feature words with the most distinguishing ability:

χ² _avg(t)＝∑_i＝1P_r(y_i)χ²(t,y_i)，χ ² _avg (t)=∑ _i=1 P _r (y _i )χ ² (t,y _i ),

χ² _max(t)＝max_i＝1,...,qχ²(t,y_i)。χ ² _max (t)=max _{i=1, . . . , q} χ ² (t, y _i ).

P_r(y_i)表示类别y_i出现的概率。卡方统计特征选择算法相比于互信息的主要优点在于它是归一化的值，因此可以更好地衡量同一类别中的不同特征词。P _r (y _i ) represents the probability of class y _i appearing. The main advantage of the chi-square statistical feature selection algorithm over mutual information is that it is a normalized value, so it can better measure different feature words in the same category.

步骤4中，找k近邻时，未见实例x与样本(x_i,h_i)的距离d(x,x_i)，采用它们的特征向量的余弦相似度的倒数进行衡量。未见实例的特征向量γ和近邻样本的特征向量λ的余弦相似度cos(γ,λ)计算公式如下：In step 4, when finding k-nearest neighbors, the distance d(x, _xi ) between the unseen instance x and the sample ( _xi , _hi ) is measured by the reciprocal of the cosine similarity of their feature vectors. The cosine similarity cos(γ,λ) of the feature vector γ of the unseen instance and the feature vector λ of the neighbor sample is calculated as follows:

其中，s表示向量分量的下标，即该分量位于向量中的位置，S表示向量的维度，γ_s表示向量γ的第s分量，λ_s表示向量λ的第s个分量。Among them, s represents the subscript of the vector component, that is, the position of the component in the vector, S represents the dimension of the vector, γ _s represents the s-th component of the vector γ, and λ _s represents the s-th component of the vector λ.

步骤4中，用d(x,x_i)表示实例x与样本(x_i,h_i)的距离，采用全标签距离权重法或者熵标签距离权重法计算样本((x_i,h_i)∈N(x))对于h_i中的类别y_j的分类权重w_ij：In step 4, use d(x, _xi ) to represent the distance between the instance x and the sample ( _xi , _hi ), and use the full label distance weight method or the entropy label distance weight method to calculate the sample (( _xi , _hi )∈ N(x)) for classification weight w _ij of category y _j in h _i :

全标签距离权重法计算w_ij：Calculate w _ij using the full-label distance weight method:

熵标签距离权重法计算w_ij：Entropy label distance weight method to calculate w _ij :

实例属于类别y_j的置信度c(x,y_j)计算公式如下：The calculation formula of the confidence c(x,y _j ) that the instance belongs to the category y _j is as follows:

其中r表示第r个类别，w_ir表示h_i的第r个类别y_r的分类权重；where r represents the r-th category, and _w _ir represents the classification weight of the r-th category y _r of hi;

预测未见实例x的类别标签集合h(x)为：The category label set h(x) of the predicted unseen instance x is:

选择0.5作为决策阈值，当未见实例属于各个类别的置信度都小于决策阈值时，返回置信度最大的类别作为未见实例所属的类别。Choose 0.5 as the decision threshold, when the confidence of each category that the unseen instance belongs to is less than the decision threshold, return the category with the highest confidence as the category to which the unseen instance belongs.

作为一种层次多标签分类方法，其预测结果需要满足层次限制，即，且y′＜y″:y″∈h(x)。下面给出证明：由置信度计算公式知，如果算法预测未见实例x具有类别标签y_a(y_a∈Y),则x属于类别y_a的置信度c(x,y_a)大于阈值t，或者在所有类别中为最大值。考察类别y_a的祖先类别y_b(y_b∈Y,y_a＜y_b),如果y_b对应于类别层次结构中的虚拟根节点，则x具有类别标签y_a显然符合层次限制；否则，对于x的任意近邻样本(x_i,Y_i)∈N(x)，如果y_a∈Y_i，则也有y_b∈Y_i，而反之则不一定成立，训练集的标签扩展过程保证了上述结论成立。因此，采用全标签距离权重法和熵标签距离权重法，可以推导出：As a hierarchical multi-label classification method, its prediction results need to meet the hierarchical constraints, that is, And y′<y″: y″∈h(x). The proof is given below: from the confidence calculation formula, if the algorithm predicts that no instance x has the category label y _a (y _a ∈ Y), then the confidence c(x, y _a ) that x belongs to the category y _a is greater than the threshold t , or the maximum value across all categories. Consider the ancestor category y _b of category y _a (y _b ∈ Y, y _a < y _b ), if y _b corresponds to the virtual root node in the category hierarchy, then x has the category label y _a obviously meets the hierarchical constraints; otherwise, For any neighbor sample (x _i , Y _i )∈N(x) of x, if y _a ∈ Y _i , then there is also y _b ∈ Y _i , and vice versa. The label expansion process of the training set guarantees the above The conclusion holds. Therefore, using the full label distance weight method and the entropy label distance weight method, it can be deduced that:

分母上保持不变，因此x属于类别y_b的置信度c(x,y_b)不小于x属于类别y_a的置信度c(x,y_a)，如果有c(x,y_a)>t，必然也有c(x,y_b)>t，因此预测结果满足层次限制。on the denominator remains unchanged, so the confidence c(x,y _b ) of x belonging to category y _b is not less than the confidence c(x,y _a ) of x belonging to category y _a , if c(x,y _a )>t, There must also be c(x,y _b )>t, so the prediction result satisfies the level restriction.

最后，本学习方法的性能评价指标采用的层次化评价指标：层次化的精度(hP)、层次化的召回率(hR)和层次化的F度量值(hF)，它们的定义如下：Finally, the hierarchical evaluation indicators used in the performance evaluation indicators of this learning method: hierarchical precision (hP), hierarchical recall (hR) and hierarchical F-measure (hF), which are defined as follows:

其中，是预测测试样本i属于的类别及其祖先类别的集合，是测试样本i实际属于的类别及其祖先类别的集合，求和操作是为了计算在所有测试样本上的值。in, is the set of the category to which the test sample i belongs and its ancestor categories, is the set of the category that the test sample i actually belongs to and its ancestor categories, and the summation operation is to calculate the value on all test samples.

为了使案件适用法律的识别更有实用性，算法预测的目标类别最好是具体的法律条款，而不只是宽泛的法律，所以本方法考虑目标类别为全部法律条文和具体法律条款两种情况下的预测性能。下文分别用hP_all、hR_all、hF_all表示在目标类别为全部法律条文时系统的层次化精度、召回率和F度量值，用hP_partial、hR_partial、hF_partial表示在目标类别为具体法律条款时算法的层次化精度、召回率和F度量值。In order to make the identification of the applicable law of the case more practical, the target category predicted by the algorithm is preferably a specific legal clause, not just a broad law. Therefore, this method considers that the target category is all legal clauses and specific legal clauses. predictive performance. In the following, hP_all, hR_all, and hF_all are used to represent the hierarchical precision, recall rate, and F-measure value of the system when the target category is all legal provisions, and hP_partial, hR_partial, and hF_partial are used to represent the hierarchical precision of the algorithm when the target category is specific legal provisions , recall and F-measure.

除了层次化评价指标，还可以分别计算各个类别上的精度、召回率和F度量值，将所有类别上的精度、召回率和F度量值的均值作为系统性能的评价指标，即精度、召回率和F度量值的宏平均(Macro-averaging)。对于各个类别，令TP表示真正例的个数，FP表示伪正例的个数，TN表示真负例的个数，FN表示伪负例的个数，则精度、召回率和F值的宏平均Macro-P、Macro-R、Macro-F的计算公式如下：In addition to the hierarchical evaluation index, the precision, recall rate and F measure value of each category can be calculated separately, and the mean value of the precision, recall rate and F measure value of all categories can be used as the evaluation index of system performance, that is, precision, recall rate Macro-averaging of F-measure and F-measure. For each category, let TP represent the number of true cases, FP represent the number of false positive cases, TN represent the number of true negative cases, and FN represent the number of false negative cases, then the macro of precision, recall and F value The formulas for calculating the average Macro-P, Macro-R, and Macro-F are as follows:

本发明是一种全局的层次多标签分类方法，在整体上考虑类别标签的层次结构，保证预测结果也满足层次限制。本学习方法是一种惰性学习算法，不需要在训练集上构造明确的预测模型，只将原始的多标签样本进行标签扩展后存储起来，因而支持增量学习；在预测阶段，首先找到未见实例在训练集中的k个近邻样本，根据这些近邻样本对各个类别的分类权重来确定实例属于各个类别的置信度，进而预测未见实例所属的类别。本学习方法模型简单，支持增量学习，可以很好地应用到案件适用法律自动识别这类包含海量数据且数据不断增长的层次多标签分类问题中。The present invention is a global hierarchical multi-label classification method, which considers the hierarchical structure of category labels as a whole, and ensures that the prediction results also meet the hierarchical restrictions. This learning method is an inert learning algorithm. It does not need to construct a clear prediction model on the training set, and only stores the original multi-label samples after label expansion, thus supporting incremental learning; in the prediction stage, first find the unseen The k neighbor samples of the instance in the training set, according to the classification weights of these neighbor samples for each category, determine the confidence that the instance belongs to each category, and then predict the category to which the unseen instance belongs. The model of this learning method is simple, supports incremental learning, and can be well applied to hierarchical multi-label classification problems such as automatic identification of applicable laws of cases, which contain massive amounts of data and the data is constantly growing.

有益效果：本发明提供的一种适用于法律识别的层次多标签分类方法，在整体上充分考虑了法律条文标签空间的树形层次结构，使预测结果满足层次限制，不需要对预测结果进行额外修正。同时，本方法模型简单，支持增量学习，可以很好地应用到案件适用法律自动识别这类包含海量数据且数据不断增长的层次多标签分类问题中。Beneficial effects: the present invention provides a hierarchical multi-label classification method suitable for legal identification, which fully considers the tree-like hierarchical structure of the label space of legal provisions as a whole, so that the prediction results meet the hierarchical restrictions, and no additional processing is required for the prediction results. fix. At the same time, the model of this method is simple, supports incremental learning, and can be well applied to hierarchical multi-label classification problems such as automatic identification of applicable laws in cases, which contain massive amounts of data and the data is constantly growing.

附图说明Description of drawings

下面结合附图和具体实施方式对本发明做更进一步的具体说明，本发明的上述或其他方面的优点将会变得更加清楚。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments, and the advantages of the above and other aspects of the present invention will become clearer.

图1本发明主要流程图。Fig. 1 main flow chart of the present invention.

图2裁判文书样例。Figure 2 Sample referee documents.

图3法律条文标签空间树形结构。Figure 3. The tree structure of the label space of legal provisions.

图4法律条文组合频率分布。Fig. 4 Frequency distribution of combinations of legal provisions.

图5不同近邻个数下的层次化指标性能比较。Figure 5. Performance comparison of hierarchical indicators under different numbers of neighbors.

图6不同近邻个数下的宏平均指标性能比较。Figure 6. Performance comparison of macro-averaged indicators with different numbers of neighbors.

图7不同权重策略下的各指标性能比较。Figure 7. Performance comparison of each index under different weighting strategies.

具体实施方式Detailed ways

下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

本发明公开了一种适用于法律识别的层次多标签分类方法，包括以下步骤：The invention discloses a hierarchical multi-label classification method suitable for legal identification, which includes the following steps:

步骤1，利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集，以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理，主要完成以下几项工作：Step 1, use jsoup-based crawler technology to crawl the required original text data set of referee documents from the Internet, and divide it randomly into training set and test set at a ratio of 7:3. Then pre-process the referee documents, mainly completing the following tasks:

根据裁判文书的行文结构从中提取案件事实及其适用的法律条文，前者用于生成案件样本的特征向量，后者用于表示案件样本的类别标签，将原始文本数据集转化为半结构化的多标签训练集和测试集；According to the text structure of the judgment document, the case facts and the applicable legal provisions are extracted. The former is used to generate the feature vector of the case sample, and the latter is used to represent the category label of the case sample. The original text data set is converted into a semi-structured multi- label training and test sets;

对案件适用法律条文中的错误和格式不一致进行修正；Correct errors and format inconsistencies in applicable legal provisions of the case;

利用哈工大的语言技术平台LTP对案件事实描述进行分词和词性标注。Use Harbin Institute of Technology's language technology platform LTP to perform word segmentation and part-of-speech tagging for the description of the facts of the case.

步骤2，由于法律系统中法律条文的组织呈树形结构，对应地，由多标签训练集中的类别标签构成的标签空间呈树形结构。基于标签空间的层次结构，扩展所有样本的案件事实对应的法律条文，使每个案件事实对应的类别标签集合为标签空间的一个子集且满足层次限制；Step 2. Since the organization of legal provisions in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space, the legal provisions corresponding to the case facts of all samples are expanded, so that the category label set corresponding to each case fact is a subset of the label space and meets the hierarchical constraints;

步骤3，对步骤1中原始训练集所得分词结果进行特征选择，选取能够充分表示案件事实的特征词构建特征向量；经过文本表示，得到结构化的扩展多标签训练集Tr和测试集Te；Step 3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select feature words that can fully represent the facts of the case to construct a feature vector; after text representation, a structured extended multi-label training set Tr and test set Te are obtained;

步骤4，构建预测模型：找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x)，给每个近邻样本设置权重，根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度，预测未见实例的类别标签集合h(x)，且h(x)满足层次限制。最后根据标签空间的树形结构，除去预测类别集合h(x)中的层次限制，(即标签扩展的逆过程)，得到未见实例的具体适用法律条文。Step 4, build a prediction model: Find the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, set weights for each neighbor sample, according to k The classification weights of neighboring samples for each category in the label space calculate the confidence that the unseen instance belongs to each category in the label space, and predict the category label set h(x) of the unseen instance, and h(x) satisfies the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category set h(x) are removed (ie, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained.

步骤2包括：Step 2 includes:

步骤2-1，在层次多标签分类问题中，给定d维实例空间和包含q个类别的标签空间Y＝{y₁,y₂,…,y_q}，y_i表示第i个类别，则类别标签空间层次结构可以用二元组(Y,＜)表示，＜表示类别标签的偏序关系，偏序关系＜可以理解为“属于”关系，即如果有y_i,y_j∈Y且y_i＜y_j，则类别y_i属于类别y_j，y_i是y_j的子孙类别，y_j是y_i的祖先类别。偏序关系＜具有非对称性、非自反性和传递性，可以用以下四个特征描述：Step 2-1, in the hierarchical multi-label classification problem, given a d-dimensional instance space And the label space Y={y ₁ ,y ₂ ,…,y _q } containing q categories, y _i represents the i-th category, then the category label space hierarchy can be represented by a two-tuple (Y,<), < Represents the partial order relationship of category labels. The partial order relationship < can be understood as the "belongs to" relationship, that is, if there is y _i , y _j ∈ Y and y _i < y _j , then category y _i belongs to category y _j , and y _i is y The descendant category of _j , y _j is the ancestor category of y _i . Partial order < is asymmetric, non-reflexive and transitive, and can be described by the following four characteristics:

e)类别标签层次结构中唯一的根节点用虚拟类别标签R表示，对任意y_i∈Y，有y_i＜R；e) The only root node in the category label hierarchy is represented by a virtual category label R, and for any y _i ∈ Y, y _i <R;

f)对任意y_i,y_j∈Y，如果有y_i＜y_j，那么 f) For any y _i , y _j ∈ Y, if y _i < y _j , then

g)任意y_i∈Y，有 g) For any y _i ∈ Y, we have

h)任意y_i,y_j,y_k∈Y，y_i＜y_j且y_j＜y_k，则有y_i＜y_k。h) Any y _i , y _j , y _k ∈ Y, y _i <y _j and y _j <y _k , then y _i <y _k .

类别标签的组织结构满足上述四个特征的多标签分类问题都可以认为是层次多标签分类问题。由上述形式化定义可知，在层次化的类别标签空间中，从任一类别节点开始往上追溯到根节点而形成的唯一路径上的所有其他类别节点(除去开始节点)都是该类别节点的祖先类别节点。因此如果样本具有类别标签c_i，则样本也隐含地具有了c_i的所有祖先类别标签，这就要求分类器对未见实例的预测类别集合h(x)也要满足层次限制，即，且y′＜y″:y″∈h(x)。Any multi-label classification problem whose organizational structure of category labels satisfies the above four characteristics can be considered as a hierarchical multi-label classification problem. From the above formal definition, in the hierarchical category label space, all other category nodes (except the start node) on the unique path formed from any category node traced back to the root node are the category nodes. Ancestor class node. Therefore, if the sample has the category label _ci , the sample also implicitly has all the ancestor category labels of _ci , which requires the classifier to satisfy the hierarchical limit for the predicted category set h(x) of unseen instances, that is, And y′<y″: y″∈h(x).

步骤2-2，对于任意训练样本(x_i,y_i)(1≤i≤m)，m为获取的全部裁判文书样本的数量，x_i∈X为d维的特征向量，为与x_i对应的一组类别标签。令扩展后的类别标签集合为y_i′，则y_i′中包含了y_i中的所有类别标签及其所有祖先类别标签。形式化地，Step 2-2, for any training sample (x _i , y _i ) (1≤i≤m), m is the number of samples of all referee documents obtained, x _i ∈ X is a d-dimensional feature vector, is a set of category labels corresponding to _xi . Let the expanded category label set be y _i ′, then y _i ′ contains all the category labels in y _i and all their ancestor category labels. formally,

步骤3包括如下步骤：Step 3 includes the following steps:

步骤3-1，特征选择的目的是为了特征降维，由于一般的文本特征选择算法不能直接处理多标签数据集，因此需要将多标签数据转换为单标签数据进行处理。转换的方法是：对于每一个多标签样本(x,h)，用|h|表示标签类别集合h中标签类别的个数，将其替换为|h|个新的单标签样本(x,y_i)(1≤i≤|y|,y_i∈h)，每个新样本的类y_i即为原多标签样本类别标签集合h中的一个类别标签，表1给出了按照上述策略，将多标签样本转化为单标签样本的示例。In step 3-1, the purpose of feature selection is to reduce the dimensionality of features. Since the general text feature selection algorithm cannot directly process multi-label data sets, it is necessary to convert multi-label data into single-label data for processing. The conversion method is: for each multi-label sample (x, h), use |h| to represent the number of label categories in the label category set h, and replace it with |h| new single-label samples (x, y _i )(1≤i≤|y|,y _i ∈h), the class y _i of each new sample is a category label in the original multi-label sample category label set h. Table 1 shows that according to the above strategy, An example of converting multi-label samples to single-label samples.

表1多标签样本转换过程Table 1 Multi-label sample conversion process

步骤3-2，经过步骤3-1的转换过程，多标签的案件样本就转换成为了单标签的案件样本，可以利用一般特征选择算法对步骤1中原始训练集所得分词结果进行特征选择，选择大约100个最具有区分能力的特征词构成特征空间。用来自特征空间的特征词表示每个案件样本的案件事实部分，其中，每个特征词对应的属性值，也就是特征权重，采用常用的TF-IDF算法进行计算。将每个样本的案件事实部分看成一个已经分词的文档，则所有样本的案件事实部分组成一个文档集合。第i个文档中第j维特征的特征权重tf-idf_ij定义如下：Step 3-2, after the conversion process of step 3-1, the multi-label case samples are converted into single-label case samples, and the general feature selection algorithm can be used to perform feature selection on the word segmentation results obtained in the original training set in step 1, and select About 100 most discriminative feature words constitute the feature space. The feature words from the feature space are used to represent the case facts of each case sample, and the attribute value corresponding to each feature word, that is, the feature weight, is calculated using the commonly used TF-IDF algorithm. Considering the case fact part of each sample as a word-segmented document, the case fact part of all samples forms a document set. The feature weight tf-idf _ij of the j-th dimension feature in the i-th document is defined as follows:

步骤3-3，对步骤1中原始训练集所得分词结果进行特征选择，选择一定数量的具有区分能力的特征词构成特征向量。常用的文本特征选择方法主要基于文档频率(DF)，互信息(MI)，信息增益(IG)，卡方统计(χ²Statistic,CHI)等衡量指标。基于文档频率的特征选择过于简单，往往无法选取最具分类信息的特征词，互信息的缺点在于容易受到特征词的边缘概率影响，因此本层次多标签分类方法选择信息增益或者卡方统计算法进行特征选择。Step 3-3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select a certain number of feature words with distinguishing ability to form a feature vector. Commonly used text feature selection methods are mainly based on document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics (χ ² Statistics, CHI) and other measurement indicators. The feature selection based on document frequency is too simple, and it is often impossible to select the feature words with the most classification information. The disadvantage of mutual information is that it is easily affected by the marginal probability of feature words. Therefore, the multi-label classification method of this level chooses information gain or chi-square statistical algorithm. feature selection.

其中，P_r(y_i)表示类别y_i出现的概率，P_r(t)表示特征t出现的概率，P_r(y_i|t)表示在特征t出现的前提下类别y_i出现的概率，表示特征t不出现的概率，表示在特征t不出现的前提下类别y_i出现的概率。对于文档集合中的每个特征词，计算其信息增益，信息增益值低于设定的阈值的特征词不纳入特征空间。Among them, P _r (y _i ) represents the probability of class y _i appearing, P _r (t) represents the probability of feature t appearing, P _r (y _i |t) represents the probability of class y _i appearing under the premise of feature t , Indicates the probability that feature t does not appear, Indicates the probability of category y _i appearing under the premise that feature t does not appear. For each feature word in the document collection, its information gain is calculated, and the feature words whose information gain value is lower than the set threshold are not included in the feature space.

步骤3-3还可以采用卡方统计算法对训练集中的案件事实文本进行特征选择：先假设特征词与类别是不相关的，如果利用CHI分布计算出的检验值偏离阈值越大，那么更有信心否定原假设，接受原假设的备择假设：即特征词与类别有着很高的相关度。Step 3-3 can also use the chi-square statistical algorithm to select the features of the case fact texts in the training set: first assume that the feature words are not related to the category, if the test value calculated by using the CHI distribution deviates from the threshold, the more Confidence negates the null hypothesis and accepts the alternative hypothesis of the null hypothesis: that is, the feature words have a high correlation with the category.

特征词t和类别y独立时，其卡方统计量为0，针对一个特征词，计算其关于各个类别的卡方统计量，然后分别计算均值χ² _avg(t)和最大值X² _max(t)，用这两种方式进行综合考虑，选出最有区分能力的特征词：When the feature word t and category y are independent, the chi-square statistic is 0. For a feature word, calculate its chi-square statistic for each category, and then calculate the mean value χ ² _avg (t) and maximum value X ² _max ( t), use these two methods to comprehensively consider and select the most distinguishable feature words:

X² _avg(t)＝∑_i＝1P_r(y_i)χ²(t,y_i)，X ² _avg (t)=∑ _i=1 P _r (y _i )χ ² (t,y _i ),

P_r(y_i)表示类别y_i出现的概率。卡方统计特征选择算法，相比于互信息的主要优点在于它是归一化的值，因此可以更好地衡量同一类别中的不同特征词。P _r (y _i ) represents the probability of class y _i appearing. The main advantage of the chi-square statistical feature selection algorithm over mutual information is that it is a normalized value, so it can better measure different feature words in the same category.

步骤4中，用d(x,x_i)表示实例x与样本(x_i,h_i)的距离，采用全标签距离权重法计算样本((x_i,h_i)∈N(x))对于类别y_j的分类权重w_ij：In step 4, use d(x, _xi ) to represent the distance between instance x and sample ( _xi , _hi ), and use the full-label distance weight method to calculate the sample (( _xi , _hi )∈N(x)) for Classification weight w _ij for class y _j :

未见实例属于类别y_j的置信度c(x,y_j)计算公式如下：The formula for calculating the confidence c(x,y _j ) of the unseen instance belonging to the category y _j is as follows:

实施例Example

如图1所示，本发明的步骤为：As shown in Figure 1, the steps of the present invention are:

步骤一，利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集，以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理，主要完成以下几项工作：Step 1: Use jsoup-based crawler technology to crawl the required original text data set of referee documents from the Internet, and randomly divide it into a training set and a test set at a ratio of 7:3. Then pre-process the referee documents, mainly completing the following tasks:

步骤二，基于标签空间的层次结构，扩展所有样本的案件事实对应的法律条文，使每个案件事实对应的类别标签为标签空间的一个子集且满足层次限制；Step 2, based on the hierarchical structure of the label space, expand the legal provisions corresponding to the case facts of all samples, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical constraints;

步骤三，对步骤1中原始训练集所得分词结果进行特征选择，选取能够充分表示案件事实的特征词构建特征向量；经过文本表示，得到结构化的扩展多标签训练集Tr和测试集Te；Step 3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select feature words that can fully represent the facts of the case to construct a feature vector; through text representation, a structured extended multi-label training set Tr and test set Te are obtained;

步骤四，构建预测模型：首先找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x)，给每个近邻样本设置权重，根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度，预测未见实例的类别标签集合h(x)，且h(x)满足层次限制。最后根据标签空间的树形结构，除去预测类别集合h(x)中的层次限制，(即标签扩展的逆过程)，得到未见实例的具体适用法律条文。Step 4, build a prediction model: first find out the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, set weights for each neighbor sample, according to k Calculate the confidence of unseen instances belonging to each category in the label space with respect to the classification weights of each category in the label space for the nearest neighbor samples, and predict the category label set h(x) of the unseen instance, and h(x) satisfies the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category set h(x) are removed (ie, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained.

本具体实施数据取自浙江法院公开网公开的浙江省各级人民法院裁判文书。The specific implementation data are taken from the judgment documents of the people's courts at all levels in Zhejiang Province published on the Zhejiang Court Open Network.

图2是裁判文书样例，其中直线下划线标注部分为案件事实部分，曲线下划线标注部分为案件适用的法律条文。根据裁判文书的行文规律，提取案件事实及其法律条文。预处理工作主要是对案件适用法律部分的清洗和修正。Figure 2 is a sample of adjudication documents, in which the underlined part is the fact of the case, and the underlined part is the legal provisions applicable to the case. According to the writing rules of judgment documents, extract the facts of the case and its legal provisions. The preprocessing work is mainly to clean up and correct the legal part of the case.

图3中，展示了法律条文标签空间的树形结构。基于这样的层次结构，对每个案件事实对应的法律条文进行标签扩展。In Figure 3, the tree structure of the label space of legal provisions is shown. Based on such a hierarchical structure, label extensions are performed on the legal provisions corresponding to the facts of each case.

图4是法律条文组合频率分布图。根据各个法律条文被引用的频率，选择了频率较高的“《中华人民共和国民事诉讼法》”、“《中华人民共和国合同法》”等26部法律以及这些法律所包含的451项具体法律条款作为类别标签组成标签空间，即标签空间的维度为477。每个案件样本的类别标签集合用标签向量的形式表示，向量的每一维代表标签空间中的一个类别标签，即一项完整的法律条文。如果案件适用了某项法律条文，则其标签向量中该项法律条文以及包含该项法律条文的所有法律条文对应的标签条目值均为1，否则为0。因此，每个样本的标签向量都对应于一个法律条文组合，各个组合出现的频率即为对应的案件样本的数量，各个法律条文组合出现的频率也可以反映案件样本集合的一些性质。通过计算各，并选取出现频率较高的组合将其按照从大到小的顺序排列，可以得到图4。从图中可以看出，法律条文组合出现频率大致呈长尾分布，少数法律条文组合出现频率极高，表明有大量案件样本适用该法律条文组合，除此之外，大多数的法律条文组合出现频率较为均衡。Figure 4 is a frequency distribution diagram of the combination of legal provisions. According to the frequency of citations of various legal provisions, 26 laws such as "Civil Procedure Law of the People's Republic of China" and "Contract Law of the People's Republic of China" with high frequency and 451 specific legal provisions contained in these laws were selected The label space is composed of category labels, that is, the dimension of the label space is 477. The category label set of each case sample is expressed in the form of a label vector, and each dimension of the vector represents a category label in the label space, that is, a complete legal provision. If a certain legal provision is applied to the case, the label entry value corresponding to this legal provision and all legal provisions containing this legal provision in the label vector is 1, otherwise it is 0. Therefore, the label vector of each sample corresponds to a combination of legal provisions, and the frequency of each combination is the number of corresponding case samples. The frequency of each combination of legal provisions can also reflect some properties of the case sample set. Figure 4 can be obtained by calculating each, and selecting combinations with higher frequency of occurrence and arranging them in descending order. It can be seen from the figure that the frequency of occurrence of combinations of legal clauses is generally in a long-tail distribution, and the frequency of occurrence of a few combinations of legal clauses is extremely high, indicating that a large number of case samples apply to this combination of legal clauses. In addition, most of the combinations of legal clauses appear The frequency is more balanced.

步骤三选择信息增益算法进行特征选择。通过计算各个特征词的信息增益可以发现，具有较高信息增益的词大多为动词或名词，表2中显示了信息增益值最高的特征词中动词和名词所占比例，可见在适用法律识别问题中名词和动词相比其他性质的词更具有区分能力，也从另一方面说明可以通过词性标注，去除文本中动词名词之外的词，从而减少文本中词的数量，简化后续计算。Step 3 Select the information gain algorithm for feature selection. By calculating the information gain of each characteristic word, it can be found that the words with higher information gain are mostly verbs or nouns. Table 2 shows the proportion of verbs and nouns in the characteristic words with the highest information gain value, which can be seen in the identification of applicable laws Middle nouns and verbs are more distinguishable than words of other natures. On the other hand, it also shows that words other than verbs and nouns in the text can be removed through part-of-speech tagging, thereby reducing the number of words in the text and simplifying subsequent calculations.

表2特征词中动词名词比例：The proportion of verbs and nouns in the feature words in Table 2:

特征词数量number of feature words 动词名词数量比例Verb Noun Quantity Ratio 动词名词信息增益总量比例Verb noun total information gain ratio 100100 88.0％88.0% 87.9％87.9% 200200 80.0％80.0% 82.3％82.3% 300300 81.0％81.0% 82.5％82.5% 400400 80.5％80.5% 82.0％82.0% 500500 76.8％76.8% 79.7％79.7%

表3实验训练集和测试集的概况：Table 3 Overview of the experimental training set and test set:

样本数量Number of samples 样本平均类别标签数量Sample Average Number of Class Labels 训练集Training set 102608102608 7.63447.6344 测试集test set 4421044210 7.63977.6397

图5和图6分别是取不同近邻个数时层次化指标和宏平均指标性能的比较。Figure 5 and Figure 6 are the performance comparisons of the hierarchical index and the macro-averaged index when different numbers of neighbors are taken.

从图5中可知：当近邻个数为偶数时，算法的精度较高，而召回率较低；当近邻个数为奇数时，算法的精度较低，而召回率较高。随着近邻个数的增大，这种区别逐渐变小。通过对算法的原理进行分析，可以对这种现象进行解释：算法设定的决策阈值为0.5，而当近邻个数为偶数时，由于加入了平滑参数，只有出现次数超过k＝2的类别标签会预测为未见实例的类别标签，而出现次数恰好为k＝2的类别标签则不会赋予未见实例。因此，当近邻个数为偶数时，各个类别标签赋予未见实例的条件更为严苛，导致算法的预测精度偏高，而相应地召回率就偏低。当近邻个数不断增大后，这种影响逐渐减弱，因此这种区别也就变小。从图中还可以看出目标类别为全部法律条文时，算法的各项预测指标都高于目标类别为具体法律条款时。这是因为更为宽泛的法律类别包含更多的案件样本，从而使得模型在这些类别上有更好的预测能力。综合来看，当近邻个数k值为5时，算法的综合预测性能最好。It can be seen from Figure 5 that when the number of neighbors is even, the precision of the algorithm is high, but the recall rate is low; when the number of neighbors is odd, the precision of the algorithm is low, but the recall rate is high. As the number of neighbors increases, this difference gradually becomes smaller. By analyzing the principle of the algorithm, this phenomenon can be explained: the decision threshold set by the algorithm is 0.5, and when the number of neighbors is even, due to the addition of smoothing parameters, only category labels with occurrences exceeding k=2 The class label of an unseen instance is predicted, while a class label with exactly k=2 occurrences is not assigned to an unseen instance. Therefore, when the number of neighbors is an even number, the conditions for each category label to assign unseen instances are more stringent, resulting in a higher prediction accuracy of the algorithm and a correspondingly lower recall rate. When the number of neighbors increases, this effect gradually weakens, so this difference becomes smaller. It can also be seen from the figure that when the target category is all legal provisions, all the predictive indicators of the algorithm are higher than when the target category is specific legal provisions. This is because broader legal categories contain a larger sample of cases, giving the model better predictive power in those categories. On the whole, when the number of neighbors k is 5, the comprehensive prediction performance of the algorithm is the best.

从图6可以发现：随着近邻个数的增加，算法的宏平均精度、召回率和F度量值都在降低。其原因可能是随着近邻个数的增加，样本数量较少的类别更难达到决策阈值，因而导致大多数类别的预测性能下降，最终导致相应的宏平均性能降低。From Figure 6, it can be found that with the increase of the number of neighbors, the macro-average precision, recall rate and F-measure value of the algorithm are all decreasing. The reason may be that as the number of neighbors increases, it is more difficult for classes with a small number of samples to reach the decision threshold, thus leading to a decrease in the prediction performance of most classes, and finally leading to a corresponding decrease in the macro-averaged performance.

图7为固定近邻个数为5，样本权重策略分别为全标签距离权重法和熵标签距离权重法时算法在各个评价指标上的表现。综合来看，不管是层次化指标还是宏平均指标，采用熵标签距离权重策略可以在精度上取得更好的效果，而采用全标签距离权重策略可以在召回率和F度量值上取得更好的效果。究其原因，熵标签权重策略偏向于类别标签个数较少的样本，而在扩展后的层次多标签样本中，样本所属的类别越具体，其类别标签就会越多，导致在熵标签权重策略下分类权重较小，因而采用熵标签权重策略预测结果更倾向于较上层的类别，导致泛化误差较大。尽管当目标类别为具体的法律条款时算法在性能上有所下降，但仍然有接近80％的层次化精度和超过65％的层次化召回率，说明基于本层次多标签分类算法的案件适用法律识别是有效的。Figure 7 shows the performance of the algorithm on each evaluation index when the number of neighbors is fixed to 5, and the sample weight strategy is the full label distance weight method and the entropy label distance weight method. Taken together, whether it is a hierarchical index or a macro-average index, using the entropy label distance weight strategy can achieve better results in accuracy, while using the full label distance weight strategy can achieve better recall and F-measure values. Effect. The reason is that the entropy label weight strategy is biased towards samples with a small number of category labels, and in the extended hierarchical multi-label samples, the more specific the category of the sample, the more category labels it will have, resulting in the entropy label weight The classification weight under the strategy is small, so the prediction result of the entropy label weight strategy is more inclined to the upper category, resulting in a larger generalization error. Although the performance of the algorithm decreases when the target category is a specific legal clause, it still has a hierarchical precision of close to 80% and a hierarchical recall of more than 65%, indicating that the case based on this hierarchical multi-label classification algorithm is applicable to the law. Identification is valid.

考虑目标类别为全部法律条文和具体法律条款两种情况，在本发明中分别用mP_all、mP_all、mP_all表示目标类别为全部法律条文时算法的宏平均精度、召回率和F度量值，用mP_partial、mP_partial、mP_partial表示目标类别为具体法律条款时算法的宏平均精度、召回率和F度量值。Consider target category is two kinds of situations of whole legal clause and specific legal clause, in the present invention, represent the macro average precision, recall rate and F measure value of algorithm when target category is whole legal clause with mP_all, mP_all, mP_all respectively, use mP_partial, mP_partial, mP_all mP_partial and mP_partial represent the macro average precision, recall rate and F-measure value of the algorithm when the target category is a specific legal clause.

本实施分别选择了TreeBoost.MH局部算法和Clus-HMC全局算法两种常用的层次多标签分类算法，与本层次多标签分类算法的预测性能进行比较，表5给出了它们在各层次化指标上的性能对比，表6给出了它们在各个宏平均指标上的预测性能对比。In this implementation, two commonly used hierarchical multi-label classification algorithms, TreeBoost.MH local algorithm and Clus-HMC global algorithm, are selected to compare with the prediction performance of this hierarchical multi-label classification algorithm. Table 5 shows their performance in each hierarchical index Table 6 shows the comparison of their prediction performance on each macro-averaged index.

表5各算法层次化指标性能比较：Table 5. Performance comparison of hierarchical indicators of each algorithm:

表6各算法宏平均性能比较：Table 6 Macro average performance comparison of each algorithm:

事实证明本层次多标签分类算法在预测性能上可以取得比现有方法更好的效果。结合Lazy-HMC算法支持增量学习的特点，可以利用Lazy-HMC算法构建有效且适用的案件适用法律自动识别系统。Facts have proved that this hierarchical multi-label classification algorithm can achieve better results than existing methods in terms of predictive performance. Combined with the characteristics of the Lazy-HMC algorithm supporting incremental learning, the Lazy-HMC algorithm can be used to build an effective and applicable automatic identification system for applicable laws.

本发明提供了一种适用于法律识别的层次多标签分类方法，具体实现该技术方案的方法和途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides a hierarchical multi-label classification method suitable for legal identification. There are many methods and approaches to specifically realize the technical solution. The above description is only a preferred embodiment of the present invention. As far as people are concerned, some improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All components that are not specified in this embodiment can be realized by existing technologies.

Claims

1. A hierarchical multi-label classification method applicable to legal identification, characterized in that, comprising the following steps:

Step 1. Obtain the original text data set of the judgment document, divide it into a training set and a test set, and perform preprocessing: extract case facts and applicable legal provisions from it according to the text structure of the judgment document, and use the case facts to generate case samples The feature vector of the applicable legal provisions is used to represent the category label of the case sample, and the original text data set is converted into a semi-structured multi-label training set and test set; word segmentation and part-of-speech labeling are performed on the description of the case facts;

Step 2, based on the hierarchical structure of the label space formed by the category labels in the multi-label training set, expand the legal provisions corresponding to the case facts of all samples, so that the category labels corresponding to each case fact are a subset of the label space and satisfy the hierarchical constraints ;

Step 3, perform feature selection on the word segmentation results from the training set in step 1, and select feature words that can fully represent the facts of the case to construct a feature vector; after text representation, a structured extended multi-label training set Tr and test set Te are obtained;

Step 4, build a prediction model: find out the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, the unseen instance is the fact of the case to be classified, given Set the weight for each neighbor sample, calculate the confidence that the unseen instance belongs to each category according to the classification weights of k neighbor samples for each category, and predict the category label set h(x) of the unseen instance, and h(x) satisfies the hierarchical limit , and finally according to the tree structure of the label space, remove the hierarchical restrictions in the predicted category label set h(x), and obtain the specific applicable legal provisions of the unseen examples.

2. according to the method described in claim 1, it is characterized in that, in step 1, with the ratio of 7:3, the original text dataset of referee documents is randomly divided into training set and test set.

3. according to the method described in claim 2, it is characterized in that: step 2 comprises:

Step 2-1, in the hierarchical multi-label classification problem, given a d-dimensional instance space is a set of real numbers, and a label space Y={y ₁ ,y ₂ ,…,y _q } containing q categories, y _i represents the i-th category, then the class label space hierarchy uses a tuple (Y,<) means, < means the partial order relationship of category labels, if there is y _i , y _j ∈ Y and y _i < y _j , then category y _i belongs to category y _j , y _i is the descendant category of y _j , and y _j is y _i The ancestor category of the classifier, the predicted category set h(x) of the unseen instance must satisfy the hierarchical restriction, that is, And y′<y″: y″∈h(x), where y′ is the category in h(x), and y″ is an ancestor category of y′;

Step 2-2, for any sample (x _i , h _i ) (1≤i≤m), m is the number of samples of all judgment documents obtained, and x _i ∈ X is a d-dimensional feature vector, which is used to represent the facts of the case part, is a set of category labels corresponding to _xi , that is, the legal provisions corresponding to _xi , let the expanded set of category labels be h _i ′, Then h _i ′ contains all category labels in h _i and all their ancestor category labels:

<mrow><msup><msub><mi>h</mi><mi>i</mi></msub><mo>&prime;</mo></msup><mo>=</mo><msub><mi>h</mi><mi>i</mi></msub><mo>&cup;</mo><msub><mo>&cup;</mo><mrow><mi>y</mi><mo>&Element;</mo><msub><mi>h</mi><mi>i</mi></msub></mrow></msub><mo>{</mo><msup><mi>y</mi><mo>&prime;</mo></msup><mo>|</mo><mi>y</mi><mo><</mo><msup><mi>y</mi><mo>&prime;</mo></msup><mo>}</mo><mo>.</mo></mrow>

4. according to the method described in claim 3, it is characterized in that: step 3 comprises the following steps:

Step 3-1, convert multi-label data into single-label data for processing: For each multi-label sample (x, h), use |h| to represent the number of label categories in the label category set h, and replace it with | h| new single-label samples (x, y _i ) (1≤i≤|h|, y _i ∈ h), the class y _i of each new sample is one of the original multi-label sample category label set h category label;

Step 3-2, after the conversion process of step 3-1, the multi-label case samples are converted into multiple single-label case samples, and the case fact part of each case sample is regarded as a word-segmented document, then all The case fact part of the case sample forms a document collection, and the feature weight tf-idf _ij of the j-th dimension feature in the i-th document in the document collection is defined as follows:

Among them, tf _ij represents the frequency of the feature word t _j in the document d _i , idf _j represents the inverse document frequency of the feature word t _j in the document collection, N represents the total number of documents in the document collection, and n _j represents the feature word t _j The document frequency in the document collection, that is, the number of documents in which the characteristic word t _j appears in the document collection, and the denominator is the normalization factor;

Step 3-3, use the general feature selection algorithm to perform feature selection on the word segmentation results obtained in the original training set in step 1, and select a certain number of feature words with distinguishing ability to form a feature space.

5. according to the method described in claim 4, it is characterized in that: step 3-3 comprises: adopt information gain algorithm to carry out feature selection: the definition of the information gain IG (t) of feature word t is as follows:

<mrow><mtable><mtr><mtd><mrow><mi>I</mi><mi>G</mi><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow><mo>=</mo></mrow></mtd></mtr><mtr><mtd><mrow><mo>-</mo><msubsup><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>q</mi></msubsup><msub><mi>P</mi><mi>r</mi></msub><mrow><mo>(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>)</mo></mrow><msub><mi>logP</mi><mi>r</mi></msub><mrow><mo>(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>)</mo></mrow><mo>+</mo><msub><mi>P</mi><mi>r</mi></msub><mrow><mo>(</mo><mi>t</mi><mo>)</mo></mrow><msubsup><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>q</mi></msubsup><msub><mi>P</mi><mi>r</mi></msub><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>|</mo><mi>t</mi></mrow><mo>)</mo></mrow><msub><mi>logP</mi><mi>r</mi></msub><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>|</mo><mi>t</mi></mrow><mo>)</mo></mrow><mo>+</mo><msub><mi>P</mi><mi>r</mi></msub><mrow><mo>(</mo><mover><mi>t</mi><mo>&OverBar;</mo></mover><mo>)</mo></mrow><msubsup><mi>&Sigma;</mi><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>q</mi></msubsup><msub><mi>P</mi><mi>r</mi></msub><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>|</mo><mover><mi>t</mi><mo>&OverBar;</mo></mover></mrow><mo>)</mo></mrow><msub><mi>logP</mi><mi>r</mi></msub><mrow><mo>(</mo><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>|</mo><mover><mi>t</mi><mo>&OverBar;</mo></mover></mrow><mo>)</mo></mrow></mrow></mtd></mtr></mtable><mo>,</mo></mrow>

Among them, P _r (y _i ) represents the probability of class y _i appearing, P _r (t) represents the probability of feature t appearing, P _r (y _i |t) represents the probability of class y _i appearing under the premise of feature t , Indicates the probability that feature t does not appear, Indicates the probability of category y _i appearing on the premise that feature t does not appear. For each feature word in the document collection, its information gain is calculated. Feature words whose information gain value is lower than the set threshold are not included in the feature space.

6. according to the method described in claim 5, it is characterized in that: step 3-3 comprises: adopt chi square statistical algorithm to carry out feature selection:

Let A be the number of documents that contain feature word t and belong to category y, B be the number of documents that contain feature word t but not belong to category y, C be the number of documents that do not contain feature word t but belong to category y, and D be the number of documents that do not contain feature word t The number of documents with word t that does not belong to category y, N is the total number of documents, then the chi-square statistic χ ² (t,y) of feature word t and category y is defined as:

<mrow><msup><mi>&chi;</mi><mn>2</mn></msup><mrow><mo>(</mo><mi>t</mi><mo>,</mo><mi>y</mi><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>N</mi><mo>&times;</mo><msup><mrow><mo>(</mo><mi>A</mi><mi>D</mi><mo>-</mo><mi>B</mi><mi>C</mi><mo>)</mo></mrow><mn>2</mn></msup></mrow><mrow><mo>(</mo><mi>A</mi><mo>+</mo><mi>C</mi><mo>)</mo><mo>&times;</mo><mo>(</mo><mi>B</mi><mo>+</mo><mi>D</mi><mo>)</mo><mo>&times;</mo><mo>(</mo><mi>A</mi><mo>+</mo><mi>B</mi><mo>)</mo><mo>&times;</mo><mo>(</mo><mi>C</mi><mo>+</mo><mi>D</mi><mo>)</mo></mrow></mfrac><mo>,</mo></mrow>

When the feature word t and category y are independent, the chi-square statistic is 0. For a feature word, calculate its chi-square statistic for each category, and then calculate the mean value χ ² _avg (t) and maximum value χ ² _max ( t), using these two methods to comprehensively consider, select a certain number of feature words with distinguishing ability, where P _r (y _i ) represents the probability of category occurrence:

χ ² _avg (t)=∑ _i=1 P _r (y _i )χ ² (t,y _i ),

χ ² _max (t)=max _{i=1, . . . , q} χ ² (t, y _i ).

7. The method according to claim 5 or 6, characterized in that: in step 4, when looking for k-nearest neighbors, the distance d(x, _xi ) between instance x and sample ( _xi ,h _i ) is not seen, The reciprocal of the cosine similarity of their eigenvectors is used to measure the cosine similarity cos(γ,λ) of the eigenvector γ of the unseen instance and the eigenvector λ of the neighbor sample as follows:

Among them, s represents the subscript of the vector component, that is, the position of the component in the vector, S represents the dimension of the vector, γ _s represents the s-th component of the vector γ, and λ _s represents the s-th component of the vector λ.

8. The method according to claim 7, characterized in that: in step 4, d(x, _xi ) is used to represent the distance between the unseen instance x and the neighbor sample ( _xi ,h _i ), and the full label distance is used The weight method or the entropy label distance weight method calculates the classification weight _{w ij} _of the neighbor sample (( _xi ,hi)∈N(x)) for the category y _{j in h i} _:

Calculate w _ij using the full-label distance weight method:

<mrow><msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo>< mfenced open = "{" close = ""><mtable><mtr><mtd><mrow><mn>1</mn><mo>/</mo><mrow><mo>(</mo><msup><mi>d</mi><mn>2</mn></msup><mo>(</mo><mrow><mi>x</mi><mo>,</mo><msub><mi>x</mi><mi>i</mi></msub></mrow><mo>)</mo><mo>)</mo></mrow><mo>,</mo></mrow></mtd><mtd><mrow><msub><mi>y</mi><mi>j</mi></msub><mo>&Element;</mo><msub><mi>h</mi><mi>i</mi></msub></mrow></mtd></mtr><mtr><mtd><mrow><mn>0</mn><mo>,</mo></mrow></mtd><mtd><mrow><msub><mi>y</mi><mi>j</mi></msub><mo>&NotElement;</mo><msub><mi>h</mi><mi>i</mi></msub></mrow></mtd></mtr></mtable></mfenced><mo>,</mo></mrow>

Entropy label distance weight method to calculate w _ij :

<mrow><msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo>< mfenced open = "{" close = ""><mtable><mtr><mtd><mrow><mn>1</mn><mo>/</mo><mrow><mo>(</mo><msup><mi>d</mi><mn>2</mn></msup><mo>(</mo><mrow><mi>x</mi><mo>,</mo><msub><mi>x</mi><mi>i</mi></msub></mrow><mo>)</mo><mo>&times;</mo><mo>|</mo><msub><mi>h</mi><mi>i</mi></msub><mo>|</mo><mo>)</mo></mrow><mo>,</mo></mrow></mtd><mtd><mrow><msub><mi>y</mi><mi>j</mi></msub><mo>&Element;</mo><msub><mi>h</mi><mi>i</mi></msub></mrow></mtd></mtr><mtr><mtd><mrow><mn>0</mn><mo>,</mo></mrow></mtd><mtd><mrow><msub><mi>y</mi><mi>j</mi></msub><mo>&NotElement;</mo><msub><mi>h</mi><mi>i</mi></msub></mrow></mtd></mtr></mtable></mfenced><mo>,</mo></mrow>

The formula for calculating the confidence c(x,y _j ) of the unseen instance belonging to the category y _j is as follows:

<mrow><mi>c</mi><mrow><mo>(</mo><mrow><mi>x</mi><mo>,</mo><msub><mi>y</mi><mi>j</mi></msub></mrow><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><msub><mi>&Sigma;</mi><mrow><mrow><mo>(</mo><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>h</mi><mi>i</mi></msub></mrow><mo>)</mo></mrow><mo>&Element;</mo><mi>N</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></msub><msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mrow><msub><mi>&Sigma;</mi><mrow><mrow><mo>(</mo><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>h</mi><mi>i</mi></msub></mrow><mo>)</mo></mrow><mo>&Element;</mo><mi>N</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></msub><msub><mi>max</mi><mrow><mn>1</mn><mo>&le;</mo><mi>r</mi><mo>&le;</mo><mi>q</mi></mrow></msub><msub><mi>w</mi><mrow><mi>i</mi><mi>r</mi></mrow></msub></mrow></mfrac><mo>,</mo></mrow>

where _w _ir represents the classification weight of the rth category y _r of hi;

The category label set h(x) of the predicted unseen instance x is:

Choose 0.5 as the decision threshold, when the confidence of each category that the unseen instance belongs to is less than the decision threshold, return the category with the highest confidence as the category to which the unseen instance belongs.