CN107577785A - A Hierarchical Multi-Label Classification Approach for Legal Identification - Google Patents
A Hierarchical Multi-Label Classification Approach for Legal Identification Download PDFInfo
- Publication number
- CN107577785A CN107577785A CN201710832304.8A CN201710832304A CN107577785A CN 107577785 A CN107577785 A CN 107577785A CN 201710832304 A CN201710832304 A CN 201710832304A CN 107577785 A CN107577785 A CN 107577785A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- category
- label
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013459 approach Methods 0.000 title description 2
- 238000000034 method Methods 0.000 claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 66
- 238000012360 testing method Methods 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000012896 Statistical algorithm Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims 1
- 238000010276 construction Methods 0.000 abstract 1
- 238000007635 classification algorithm Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000000903 blocking effect Effects 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000010187 selection method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013277 forecasting method Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明属于计算机数据分析与挖掘领域,涉及一种适用于法律识别的层次多标签分类方法。The invention belongs to the field of computer data analysis and mining, and relates to a hierarchical multi-label classification method suitable for legal identification.
背景技术Background technique
层次多标签分类是多标签分类的一个特例。与一般的多标签分类不同,层次多标签分类问题中,每个样本可以具有多个类别标签,同时样本标签空间以树形或有向无环图的层次结构组织。在有向无环图中,一个节点可能有多个父节点,相比树形结构更为复杂,算法的设计难度更大,因此目前层次多标签分类方面的研究主要针对树形的类别标签结构。根据算法考察类别层次结构的不同方式,层次多标签分类算法可以分为局部算法和全局算法。Hierarchical multi-label classification is a special case of multi-label classification. Different from general multi-label classification, in hierarchical multi-label classification problems, each sample can have multiple class labels, and the sample label space is organized in a tree or directed acyclic graph hierarchy. In a directed acyclic graph, a node may have multiple parent nodes, which is more complex than the tree structure, and the design of the algorithm is more difficult. Therefore, the current research on hierarchical multi-label classification mainly focuses on the tree-shaped category label structure. . Hierarchical multi-label classification algorithms can be divided into local algorithms and global algorithms according to the different ways in which the algorithms examine the category hierarchy.
局部算法逐一考察类别层次中的各个内部节点的局部分类信息,将层次多标签分类问题转化为多个多标签分类问题。而且在训练内部节点上的多标签分类器时,需要选择合适的局部样本集。在预测阶段采用自顶向下等预测方式使预测结果满足层次要求。文献ESULI A,FAGNI T,SEBASTIANI F.TreeBoost.MH:A boosting algorithm formulti-labelhierarchical text categorization[C]//String Processing andInformationRetrieval.2006:13–24.提出了TreeBoost.MH算法来处理层次多标签文本分类问题。算法递归地在类别标签树中的每一个非叶子节点上训练多标签分类器,基分类器选择AdaBoost.MH,在每个多标签分类器训练过程中,特征选择和训练样本的选择都局部地进行。实验效果证明TreeBoost.MH算法在时间效率和预测性能上都好于AdaBoost.MH算法。文献CERRI R,BARROS R C,DE CARVALHO AC.Hierarchical multi-labelclassificationusing local neural networks[J].Journal of Computer and SystemSciences,2014,80(1):39–56.提出了基于多层感知机的局部层次多标签分类算法,在类别层次的每一层训练一个多层感知机网络,每个神经网络与一个类别层次关联,用于预测该层次上的类别标签,某一层上神经网络的预测结果将作为下一层神经网络的输入。由于每一层神经网络都是在同样的样本集合上训练得到,因此预测结果会出现不满足层次限制的情况,需要通过对预测结果进行后续处理来保证其满足层次限制。The local algorithm examines the local classification information of each internal node in the category hierarchy one by one, and transforms the hierarchical multi-label classification problem into multiple multi-label classification problems. Moreover, when training a multi-label classifier on an internal node, it is necessary to select an appropriate local sample set. In the forecasting stage, top-down forecasting methods are used to make the forecasting results meet the hierarchical requirements. Literature ESULI A, FAGNI T, SEBASTIANI F.TreeBoost.MH: A boosting algorithm formulti-labelhierarchical text categorization[C]//String Processing andInformationRetrieval.2006:13–24. The TreeBoost.MH algorithm is proposed to deal with hierarchical multi-label text classification question. The algorithm recursively trains a multi-label classifier on each non-leaf node in the category label tree. The base classifier selects AdaBoost.MH. During the training process of each multi-label classifier, feature selection and training sample selection are locally conduct. Experimental results prove that TreeBoost.MH algorithm is better than AdaBoost.MH algorithm in terms of time efficiency and prediction performance. Literature CERRI R, BARROS R C, DE CARVALHO AC. Hierarchical multi-label classification using local neural networks [J]. Journal of Computer and System Sciences, 2014, 80(1): 39–56. Proposed local hierarchical multi-label classification based on multi-layer perceptron The label classification algorithm trains a multi-layer perceptron network at each layer of the category hierarchy. Each neural network is associated with a category hierarchy and is used to predict the category label at this level. The prediction result of the neural network on a certain layer will be used as Input to the next layer of neural network. Since each layer of the neural network is trained on the same sample set, the prediction results may not meet the hierarchical constraints, and subsequent processing of the prediction results is required to ensure that they meet the hierarchical constraints.
局部算法的缺点一方面在于需要训练多个分类器,造成模型较为复杂,影响了模型的可理解性;另一方面在于预测过程中会出现阻塞问题,即在上层被错误分类的样本无法到达下层的分类器,虽然有人提出了降低阈值、限制投票和扩展阈值倍增三种策略来应对局部算法的阻塞问题,但局部算法往往在预测准确率上较为不理想。The disadvantage of the local algorithm is that on the one hand, it needs to train multiple classifiers, which makes the model more complex and affects the intelligibility of the model; on the other hand, there will be blocking problems in the prediction process, that is, samples that are misclassified in the upper layer cannot reach the lower layer Although some people have proposed three strategies of lowering the threshold, limiting voting and expanding the threshold multiplication to deal with the blocking problem of the local algorithm, the local algorithm is often not ideal in terms of prediction accuracy.
全局算法从整体上考虑类别的层次结构,训练单一的层次多标签分类器,对未见实例进行预测。全局算法根据其处理类别标签层次结构的方式主要可以分为以下几种:一种全局算法是利用类别聚类,首先计算测试样本与各个类别的相似度,然后将测试样本分类到距离最近的类别。另一种方法是将层次多标签分类问题转换为多标签分类问题进行处理:文献KIRITCHENKO S,MATWIN S,FAMILI F.Functional annotation of genesusinghierarchical text categorization[J],2005.对训练样本的类别标签进行扩展,增加其祖先类别标签,将层次多标签分类问题转换为多标签分类问题进行处理。在测试阶段,由于采用的多标签分类算法AdaBoost.MH没有考虑类别的层次结构,因此面临了与局部算法相同的问题,即预测结果会有层次不一致情况,同样需要对模型的输出进行修正来保证层次限制满足。还有的全局算法是改造现有非层次分类算法使其能够直接处理层次信息并利用层次信息来改善性能。文献VENS C,STRUYF J,SCHIETGAT L,et al.Decision treesfor hierarchical multilabelclassification[J].Machine Learning,2008,73(2):185–214.基于预测聚类树(PCT)提出了Clus-HMC算法,训练一棵决策树来处理层次多标签分类问题,并且与Clus-HSC和Clus-SC方法进行了比较,Clus-SC忽略类别标签的层次结构,为每个类别标签训练一个独立的分类器,Clus-HSC方法是层次化的Clus-SC,预测结果满足层次限制。实验结果表明,全局的Clus-HMC算法不仅在预测性能上好于Clus-SC和Clus-HSC算法,而且在时间效率上也更好。Global algorithms consider the hierarchy of categories holistically and train a single hierarchical multi-label classifier to make predictions on unseen instances. The global algorithm can be mainly divided into the following types according to the way it deals with the category label hierarchy: a global algorithm uses category clustering, first calculates the similarity between the test sample and each category, and then classifies the test sample to the nearest category . Another method is to convert the hierarchical multi-label classification problem into a multi-label classification problem for processing: Document KIRITCHENKO S, MATWIN S, FAMILI F. Functional annotation of genes using hierarchical text categorization[J], 2005. Extend the category labels of training samples , increase its ancestor category label, and convert the hierarchical multi-label classification problem into a multi-label classification problem for processing. In the test phase, because the multi-label classification algorithm AdaBoost.MH adopted does not consider the hierarchical structure of categories, it faces the same problem as the local algorithm, that is, the prediction results will have hierarchical inconsistencies, and the output of the model needs to be corrected to ensure Hierarchy constraints are met. Another global algorithm is to transform the existing non-hierarchical classification algorithm so that it can directly process hierarchical information and use hierarchical information to improve performance. Literature VENS C, STRUYF J, SCHIETGAT L, et al. Decision trees for hierarchical multilabel classification [J]. Machine Learning, 2008, 73(2): 185–214. Based on the predictive clustering tree (PCT), the Clus-HMC algorithm was proposed. A decision tree is trained to handle hierarchical multi-label classification problems and compared with the Clus-HSC and Clus-SC methods, Clus-SC ignores the hierarchy of class labels and trains a separate classifier for each class label, Clus - The HSC method is a hierarchical Clus-SC, and the prediction results meet the hierarchical constraints. Experimental results show that the global Clus-HMC algorithm is not only better than the Clus-SC and Clus-HSC algorithms in prediction performance, but also better in time efficiency.
总的来说,全局算法有两方面特征:一次性的从整体上考虑类别的层次结构;不具有局部算法所特有的模块性。全局算法和局部算法的关键不同之处在于训练过程,在测试阶段,全局算法甚至也可以像局部算法一样使用自顶向下的方式对未见实例进行类别预测。In general, the global algorithm has two characteristics: it considers the hierarchical structure of the category as a whole at one time; it does not have the unique modularity of the local algorithm. The key difference between the global algorithm and the local algorithm is the training process. In the test phase, the global algorithm can even use a top-down method to predict the category of unseen instances like the local algorithm.
由于层次多标签分类问题中,类别标签的组织呈层次结构,因此如果样本具有类别标签ci,则样本也隐含地具有了ci的所有祖先类别标签;另一方面,在预测未见实例的类别时,也要满足层次限制,即不能出现未见实例属于某类别而不属于该类别的祖先类别的情况。一般的层次多标签分类算法往往无法保证其预测结果满足层次限制,或者由于没有利用到标签空间的层次结构特征而无法取得最优的学习效果。因此,层次多标签分类算法不仅要充分利用类别标签之间的关联和层次结构,提高分类模型的预测性能,还要使预测结果满足层次限制。Since the category labels are organized in a hierarchical structure in hierarchical multi-label classification problems, if a sample has a category label ci , the sample also implicitly has all the ancestor category labels of ci ; on the other hand, when predicting unseen instances When the category of , it must also satisfy the hierarchical restriction, that is, it cannot appear that no instance belongs to a certain category but does not belong to the ancestor category of this category. The general hierarchical multi-label classification algorithm often cannot guarantee that its prediction results meet the hierarchical constraints, or cannot achieve the optimal learning effect because it does not utilize the hierarchical structure features of the label space. Therefore, the hierarchical multi-label classification algorithm should not only make full use of the association and hierarchical structure between category labels to improve the prediction performance of the classification model, but also make the prediction results meet the hierarchical constraints.
案件适用法律自动识别问题本质上是一个层次多标签分类问题,样本的类别标签即案件适用的法律条文呈树形结构组织,一个案件可能适用多项法律条文,且案件适用的各项法律条文具体程度可能不同。相应的用于解决案件适用法律自动识别问题的层次多标签分类算法需要能够处理树形的类别层次结构,而且为非强制叶节点预测算法,预测的类别标签可以对应到类别层次结构中的任意节点。The problem of automatic identification of the applicable laws of a case is essentially a hierarchical multi-label classification problem. The category labels of the samples, that is, the legal provisions applicable to the case, are organized in a tree structure. A case may apply to multiple legal provisions, and the applicable legal provisions of the case are specific. The extent may vary. The corresponding hierarchical multi-label classification algorithm used to solve the problem of automatic identification of the applicable law of the case needs to be able to deal with the tree-shaped category hierarchy, and it is a non-mandatory leaf node prediction algorithm, and the predicted category label can correspond to any node in the category hierarchy .
发明内容Contents of the invention
发明目的:本发明所要解决的技术问题是针对现有技术的不足,提供一种有效的的适用于法律识别的层次多标签分类方法。Purpose of the invention: The technical problem to be solved by the present invention is to provide an effective hierarchical multi-label classification method suitable for legal identification in view of the deficiencies of the prior art.
技术方案:本发明公开了一种适用于法律识别的层次多标签分类方法,包括以下步骤:Technical solution: the present invention discloses a hierarchical multi-label classification method suitable for legal identification, including the following steps:
步骤1,利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集,一份裁判文书对应一个样本,以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理:根据裁判文书的行文结构从中提取案件事实及其适用的法律条文,案件事实用于生成案件样本的特征向量,适用的法律条文用于表示案件样本的类别标签,将原始文本数据集转化为半结构化的多标签训练集和测试集,半结构化的样本形式为:(案件事实描述,法律条文文本);对案件适用法律条文中的错误和格式不一致进行修正;利用哈工大的语言技术平台LTP作为语言处理工具(LTP是一整套中文语言处理系统,制定了基于XML的语言处理结果表示,并在此基础上提供了一整套自底向上的丰富而且高效的中文语言处理模块(包括词法、句法、语义等六项中文处理核心技术),以及基于动态链接库(DLL)的应用程序接口、可视化工具,并且能够以网络服务的形式进行使用)对案件事实描述进行分词和词性标注。Step 1: Crawl the required original text data set of referee documents from the Internet using jsoup-based crawler technology. One referee document corresponds to one sample, and it is randomly divided into a training set and a test set at a ratio of 7:3. Then pre-process the judgment documents: extract the case facts and applicable legal provisions according to the text structure of the judgment documents. The case facts are used to generate the feature vectors of the case samples, and the applicable legal provisions are used to represent the category labels of the case samples. The original text data set is converted into a semi-structured multi-label training set and test set. The semi-structured sample form is: (case fact description, legal text); correct the errors and inconsistent formats in the applicable legal provisions of the case; Utilize the language technology platform LTP of Harbin Institute of Technology as a language processing tool (LTP is a set of Chinese language processing system, formulate an XML-based language processing result representation, and provide a set of bottom-up rich and efficient Chinese language The processing module (including six Chinese processing core technologies such as lexical, syntactic, and semantic), as well as application programming interfaces and visualization tools based on dynamic link libraries (DLL), and can be used in the form of network services) performs word segmentation for case fact descriptions and part-of-speech tagging.
步骤2,由于法律系统中法律条文的组织呈树形结构,对应地,由多标签训练集中的类别标签构成的标签空间呈树形结构。基于多标签训练集中的类别标签构成的标签空间标签空间的层次结构,扩展所有样本的案件事实对应的法律条文,使每个案件事实对应的类别标签为标签空间的一个子集且满足层次限制;Step 2. Since the organization of legal provisions in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space label space formed by the category labels in the multi-label training set, expand the legal provisions corresponding to the case facts of all samples, so that the category labels corresponding to each case fact are a subset of the label space and meet the hierarchical constraints;
步骤3,对步骤1中来自训练集的分词结果(指的是步骤1所述半结构化的多标签训练集的案件事实部分的分词结果)进行特征选择,选取能够充分表示案件事实的特征词构建特征向量;经过文本表示,得到结构化的扩展多标签训练集Tr和测试集Te;Step 3, perform feature selection on the word segmentation results from the training set in step 1 (referring to the word segmentation results of the case fact part of the semi-structured multi-label training set described in step 1), and select feature words that can fully represent the case facts Construct feature vectors; through text representation, a structured extended multi-label training set Tr and test set Te are obtained;
步骤4,构建预测模型:找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x),未见实例即待分类的案件事实,给每个近邻样本设置权重,根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度,预测未见实例的类别标签集合h(x),且h(x)满足层次限制。最后根据标签空间的树形结构,除去预测类别标签集合h(x)中的层次限制,(即标签扩展的逆过程),得到未见实例的具体适用法律条文。。Step 4, build a prediction model: find out the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, the unseen instance is the fact of the case to be classified, given Set weights for each neighbor sample, calculate the confidence that the unseen instance belongs to each category in the label space according to the classification weights of k neighbor samples for each category in the label space, and predict the category label set h(x) of the unseen instance, and h (x) Satisfy the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category label set h(x) are removed (that is, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained. .
步骤2包括:Step 2 includes:
步骤2-1,在层次多标签分类问题中,给定d维实例空间(为实数集),和包含q个类别的标签空间Y={y1,y2,…,yq},yi表示第i个类别,则类别标签空间层次结构可以用二元组(Y,<)表示,如果有yi,yj∈Y且yi<yj,则类别yi属于类别yj,yi是yj的子孙类别,yj是yi的祖先类别,<表示类别标签的偏序关系,偏序关系<可以理解为“属于”关系,即如果有yi,yj∈Y且yi<yj,则类别yi属于类别yj,yi是yj的子孙类别,yj是yi的祖先类别。偏序关系<具有非对称性、非自反性和传递性,可以用以下四个特征描述:Step 2-1, in the hierarchical multi-label classification problem, given a d-dimensional instance space ( is a set of real numbers), and the label space Y={y 1 ,y 2 ,…,y q } containing q categories, y i represents the i-th category, then the category label space hierarchy can be used as a binary group (Y, <) means that if there is y i , y j ∈ Y and y i < y j , then category y i belongs to category y j , y i is the descendant category of y j , y j is the ancestor category of y i , < means category The partial order relationship of the label, the partial order relationship < can be understood as the "belongs to" relationship, that is, if there is y i , y j ∈ Y and y i < y j , then category y i belongs to category y j , and y i belongs to y j Descendant category, y j is the ancestor category of y i . Partial order < is asymmetric, non-reflexive and transitive, and can be described by the following four characteristics:
a)类别标签层次结构中唯一的根节点用虚拟类别标签R表示,对任意yi∈Y,有yi<R;a) The only root node in the category label hierarchy is represented by a virtual category label R, and for any y i ∈ Y, y i <R;
b)对任意yi,yj∈Y,如果有yi<yj,那么 b) For any y i , y j ∈ Y, if y i < y j , then
c)任意yi∈Y,有 c) For any y i ∈ Y, we have
d)任意yi,yj,yk∈Y,yi<yj且yj<yk,则有yi<yk。d) Any y i , y j , y k ∈ Y, y i <y j and y j <y k , then y i <y k .
类别标签的组织结构满足上述四个特征的多标签分类问题都可以认为是层次多标签分类问题。由上述形式化定义可知,在层次化的类别标签空间中,从任一类别节点开始往上追溯到根节点而形成的唯一路径上的所有其他类别节点(除去开始节点)都是该类别节点的祖先类别节点。因此如果样本具有类别标签yi,则样本也隐含地具有了yi的所有祖先类别标签,这就要求分类器对未见实例的预测类别集合h(x)也要满足层次限制,即,且y′<y″:y″∈h(x)。其中y′为h(x)中的类别,y″为y′的一个祖先类别;Any multi-label classification problem whose organizational structure of category labels satisfies the above four characteristics can be considered as a hierarchical multi-label classification problem. From the above formal definition, in the hierarchical category label space, all other category nodes (except the start node) on the unique path formed from any category node traced back to the root node are the category nodes. Ancestor class node. Therefore, if a sample has a category label y i , the sample also implicitly has all the ancestor category labels of y i , which requires the classifier to satisfy the hierarchical constraints on the predicted category set h(x) of unseen instances, that is, And y′<y″: y″∈h(x). Where y' is the category in h(x), and y" is an ancestor category of y';
步骤2-2,对于任意训练样本(xi,hi)(1≤i≤m),m为获取的全部裁判文书样本的数量,xi∈X为d维的特征向量,用于表示案件事实部分,为与xi对应的一组类别标签,即xi对应的法律条文,令扩展后的类别标签集合为则hi′中包含hi中的所有类别标签及其所有祖先类别标签。形式化地,Step 2-2, for any training sample ( xi , h i ) (1≤i≤m), m is the number of samples of all judgment documents obtained, and x i ∈ X is a d-dimensional feature vector, which is used to represent the case factual part, is a set of category labels corresponding to xi , that is, the legal provisions corresponding to xi , so that the expanded set of category labels is Then h i ′ contains all category labels in hi and all their ancestor category labels. formally,
标签扩展过程将类别标签的层次关系明确地在样本的类别标签中表达出来:如果样本被标记为某些类别,那么经过标签扩展,这些类别的祖先类别也会显式地赋予该样本;因此每个样本的类别标签可以看作标签空间树的一棵子树,并且各个子树的顶层都是根节点。由此可见,如果有yi,yj∈Y且yi<yj,未见实例在扩展后的多标签训练集中的k近邻样本中,具有类别标签yi的样本数一定不小于具有类别标签yj的样本数。标签扩展是保证本学习算法预测结果满足层次限制的重要步骤。The label expansion process explicitly expresses the hierarchical relationship of category labels in the category labels of samples: if a sample is labeled as certain categories, then after label expansion, the ancestor categories of these categories will also be explicitly assigned to the sample; therefore, every The category label of each sample can be regarded as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is y i , y j ∈ Y and y i < y j , among the k-nearest neighbor samples in the expanded multi-label training set, the number of samples with class label y i must not be less than that with The number of samples for class label yj . Label expansion is an important step to ensure that the prediction results of this learning algorithm meet the hierarchical constraints.
步骤3包括如下步骤:Step 3 includes the following steps:
步骤3-1,特征选择的目的是为了特征降维,由于一般的文本特征选择算法不能直接处理多标签数据集,因此需要将多标签数据转换为单标签数据进行处理。转换的方法是:对于每一个多标签样本(x,h),用|h|表示标签类别集合h中标签类别的个数,将其替换为|h|个新的单标签样本(x,yi)(1≤i≤|h|,yi∈h),每个新样本的类yi即为原多标签样本类别标签集合h中的一个类别标签,表1给出了按照上述策略,将多标签样本转化为单标签样本的示例。In step 3-1, the purpose of feature selection is to reduce the dimensionality of features. Since the general text feature selection algorithm cannot directly process multi-label data sets, it is necessary to convert multi-label data into single-label data for processing. The conversion method is: for each multi-label sample (x, h), use |h| to represent the number of label categories in the label category set h, and replace it with |h| new single-label samples (x, y i )(1≤i≤|h|,y i ∈h), the class y i of each new sample is a category label in the original multi-label sample category label set h. Table 1 shows that according to the above strategy, An example of converting multi-label samples to single-label samples.
表1多标签样本转换过程Table 1 Multi-label sample conversion process
步骤3-2,经过步骤3-1的转换过程,多标签的案件样本就转换成为了多个单标签的案件样本,可以利用一般特征选择算法对步骤1中原始训练集所得分词结果进行特征选择,选择一定数量(通常视原始文本数据集情况而定,比如用信息增益算法进行特征选择时,应使所选特征词的信息增益总量尽可能大且特征词数量不至于过多,一般至少取100个特征词)的具有区分能力的特征词构成特征空间,用来自特征空间的特征词表示每个案件样本的案件事实部分。其中,每个特征词对应的属性值,也就是特征权重,采用常用的TF-IDF算法进行计算。将每个案件样本的案件事实部分看成一个已经分词的文档,则所有案件样本的案件事实部分组成一个文档集合。文档集合中第i个文档中第j维特征的特征权重tf-idfij定义如下:Step 3-2, after the conversion process of step 3-1, the multi-label case samples are converted into multiple single-label case samples, and the general feature selection algorithm can be used to perform feature selection on the word segmentation results obtained in the original training set in step 1 , select a certain number (usually depending on the original text data set, for example, when using the information gain algorithm for feature selection, the total amount of information gain of the selected feature words should be as large as possible and the number of feature words should not be too large, generally at least Take 100 feature words) with distinguishing ability to form a feature space, and use the feature words from the feature space to represent the case fact part of each case sample. Among them, the attribute value corresponding to each feature word, that is, the feature weight, is calculated using the commonly used TF-IDF algorithm. Considering the case fact part of each case sample as a word-segmented document, the case fact part of all case samples forms a document set. The feature weight tf-idf ij of the j-th dimension feature in the i-th document in the document collection is defined as follows:
其中,tfij表示特征词tj在文档di中出现的频率,idfj表示特征词tj在文档集合中的反文档频率,N表示文档集合中的文档总数,nj表示特征词tj在文档集合中的文档频率,即文档集合中出现特征词tj的文档数目,分母为归一化因子。Among them, tf ij represents the frequency of the feature word t j in the document d i , idf j represents the inverse document frequency of the feature word t j in the document collection, N represents the total number of documents in the document collection, and n j represents the feature word t j The document frequency in the document collection, that is, the number of documents in which the characteristic word t j appears in the document collection, and the denominator is the normalization factor.
步骤3-3,对步骤1中原始训练集所得分词结果进行特征选择,选择大约100个最具有区分能力的特征词构成特征向量。常用的文本特征选择方法主要基于文档频率(DF),互信息(MI),信息增益(IG),卡方统计(χ2Statistic,CHI)等衡量指标。基于文档频率的特征选择过于简单,往往无法选取最具分类信息的特征词,互信息的缺点在于容易受到特征词的边缘概率影响,因此本层次多标签分类方法选择信息增益或者卡方统计算法进行特征选择。Step 3-3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select about 100 feature words with the most distinguishing ability to form a feature vector. Commonly used text feature selection methods are mainly based on document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics (χ 2 Statistics, CHI) and other measurement indicators. The feature selection based on document frequency is too simple, and it is often impossible to select the feature words with the most classification information. The disadvantage of mutual information is that it is easily affected by the marginal probability of feature words. Therefore, the multi-label classification method of this level chooses information gain or chi-square statistical algorithm. feature selection.
步骤3-3包括:采用信息增益算法进行特征选择:特征词t的信息增益IG(t)的定义如下:Step 3-3 includes: using the information gain algorithm for feature selection: the definition of the information gain IG(t) of the feature word t is as follows:
其中,Pr(yi)表示类别yi出现的概率,Pr(t)表示特征t出现的概率,Pr(yi|t)表示在特征t出现的前提下类别yi出现的概率,表示特征t不出现的概率,表示在特征t不出现的前提下类别yi出现的概率。对于文档集合中的每个特征词,计算其信息增益,信息增益值低于设定的阈值(比如取0.15,设定阈值时应使所选特征词的信息增益总量尽可能大且特征词数量不至于过多)的特征词不纳入特征空间。Among them, P r (y i ) represents the probability of class y i appearing, P r (t) represents the probability of feature t appearing, P r (y i |t) represents the probability of class y i appearing under the premise of feature t , Indicates the probability that feature t does not appear, Indicates the probability of category y i appearing under the premise that feature t does not appear. For each feature word in the document collection, its information gain is calculated, and the information gain value is lower than the set threshold (for example, 0.15, when setting the threshold, the total amount of information gain of the selected feature words should be as large as possible and the feature words The number of feature words is not included in the feature space.
步骤3-3还可以采用卡方统计算法进行特征选择:先假设特征词与类别是不相关的,如果利用CHI分布计算出的检验值偏离阈值越大,那么更有信心否定原假设,接受原假设的备择假设:即特征词与类别有着很高的相关度。Step 3-3 can also use the chi-square statistical algorithm for feature selection: first assume that the feature word is not related to the category, if the test value calculated using the CHI distribution deviates from the threshold, then it is more confident to reject the null hypothesis and accept the original hypothesis. Hypothetical alternative hypothesis: That is, the feature word has a high correlation with the category.
令A为包含特征词t且属于类别y的文档数量,B为包含特征词t而不属于类别y的文档数量,C为不包含特征词t而属于类别y的文档数量,D为不包含特征词t且不属于类别y的文档数量,N为总文档数量,则特征词t和类别y的卡方统计量χ2(t,y)定义为:Let A be the number of documents that contain feature word t and belong to category y, B be the number of documents that contain feature word t but not belong to category y, C be the number of documents that do not contain feature word t but belong to category y, and D be the number of documents that do not contain feature word t The number of documents with word t that does not belong to category y, N is the total number of documents, then the chi-square statistic χ 2 (t,y) of feature word t and category y is defined as:
特征词t和类别y独立时,其卡方统计量为0,针对一个特征词,计算其关于各个类别的卡方统计量,然后分别计算均值χ2 avg(t)和最大值χ2 max(t),用这两种方式进行综合考虑,选出大约100个最具有区分能力的特征词:When the feature word t and category y are independent, the chi-square statistic is 0. For a feature word, calculate its chi-square statistic for each category, and then calculate the mean value χ 2 avg (t) and maximum value χ 2 max ( t), using these two methods for comprehensive consideration, select about 100 feature words with the most distinguishing ability:
χ2 avg(t)=∑i=1Pr(yi)χ2(t,yi),χ 2 avg (t)=∑ i=1 P r (y i )χ 2 (t,y i ),
χ2 max(t)=maxi=1,...,qχ2(t,yi)。χ 2 max (t)=max i=1, . . . , q χ 2 (t, y i ).
Pr(yi)表示类别yi出现的概率。卡方统计特征选择算法相比于互信息的主要优点在于它是归一化的值,因此可以更好地衡量同一类别中的不同特征词。P r (y i ) represents the probability of class y i appearing. The main advantage of the chi-square statistical feature selection algorithm over mutual information is that it is a normalized value, so it can better measure different feature words in the same category.
步骤4中,找k近邻时,未见实例x与样本(xi,hi)的距离d(x,xi),采用它们的特征向量的余弦相似度的倒数进行衡量。未见实例的特征向量γ和近邻样本的特征向量λ的余弦相似度cos(γ,λ)计算公式如下:In step 4, when finding k-nearest neighbors, the distance d(x, xi ) between the unseen instance x and the sample ( xi , hi ) is measured by the reciprocal of the cosine similarity of their feature vectors. The cosine similarity cos(γ,λ) of the feature vector γ of the unseen instance and the feature vector λ of the neighbor sample is calculated as follows:
其中,s表示向量分量的下标,即该分量位于向量中的位置,S表示向量的维度,γs表示向量γ的第s分量,λs表示向量λ的第s个分量。Among them, s represents the subscript of the vector component, that is, the position of the component in the vector, S represents the dimension of the vector, γ s represents the s-th component of the vector γ, and λ s represents the s-th component of the vector λ.
步骤4中,用d(x,xi)表示实例x与样本(xi,hi)的距离,采用全标签距离权重法或者熵标签距离权重法计算样本((xi,hi)∈N(x))对于hi中的类别yj的分类权重wij:In step 4, use d(x, xi ) to represent the distance between the instance x and the sample ( xi , hi ), and use the full label distance weight method or the entropy label distance weight method to calculate the sample (( xi , hi )∈ N(x)) for classification weight w ij of category y j in h i :
全标签距离权重法计算wij:Calculate w ij using the full-label distance weight method:
熵标签距离权重法计算wij:Entropy label distance weight method to calculate w ij :
实例属于类别yj的置信度c(x,yj)计算公式如下:The calculation formula of the confidence c(x,y j ) that the instance belongs to the category y j is as follows:
其中r表示第r个类别,wir表示hi的第r个类别yr的分类权重;where r represents the r-th category, and w ir represents the classification weight of the r-th category y r of hi;
预测未见实例x的类别标签集合h(x)为:The category label set h(x) of the predicted unseen instance x is:
选择0.5作为决策阈值,当未见实例属于各个类别的置信度都小于决策阈值时,返回置信度最大的类别作为未见实例所属的类别。Choose 0.5 as the decision threshold, when the confidence of each category that the unseen instance belongs to is less than the decision threshold, return the category with the highest confidence as the category to which the unseen instance belongs.
作为一种层次多标签分类方法,其预测结果需要满足层次限制,即, 且y′<y″:y″∈h(x)。下面给出证明:由置信度计算公式知,如果算法预测未见实例x具有类别标签ya(ya∈Y),则x属于类别ya的置信度c(x,ya)大于阈值t,或者在所有类别中为最大值。考察类别ya的祖先类别yb(yb∈Y,ya<yb),如果yb对应于类别层次结构中的虚拟根节点,则x具有类别标签ya显然符合层次限制;否则,对于x的任意近邻样本(xi,Yi)∈N(x),如果ya∈Yi,则也有yb∈Yi,而反之则不一定成立,训练集的标签扩展过程保证了上述结论成立。因此,采用全标签距离权重法和熵标签距离权重法,可以推导出:As a hierarchical multi-label classification method, its prediction results need to meet the hierarchical constraints, that is, And y′<y″: y″∈h(x). The proof is given below: from the confidence calculation formula, if the algorithm predicts that no instance x has the category label y a (y a ∈ Y), then the confidence c(x, y a ) that x belongs to the category y a is greater than the threshold t , or the maximum value across all categories. Consider the ancestor category y b of category y a (y b ∈ Y, y a < y b ), if y b corresponds to the virtual root node in the category hierarchy, then x has the category label y a obviously meets the hierarchical constraints; otherwise, For any neighbor sample (x i , Y i )∈N(x) of x, if y a ∈ Y i , then there is also y b ∈ Y i , and vice versa. The label expansion process of the training set guarantees the above The conclusion holds. Therefore, using the full label distance weight method and the entropy label distance weight method, it can be deduced that:
分母上保持不变,因此x属于类别yb的置信度c(x,yb)不小于x属于类别ya的置信度c(x,ya),如果有c(x,ya)>t,必然也有c(x,yb)>t,因此预测结果满足层次限制。on the denominator remains unchanged, so the confidence c(x,y b ) of x belonging to category y b is not less than the confidence c(x,y a ) of x belonging to category y a , if c(x,y a )>t, There must also be c(x,y b )>t, so the prediction result satisfies the level restriction.
最后,本学习方法的性能评价指标采用的层次化评价指标:层次化的精度(hP)、层次化的召回率(hR)和层次化的F度量值(hF),它们的定义如下:Finally, the hierarchical evaluation indicators used in the performance evaluation indicators of this learning method: hierarchical precision (hP), hierarchical recall (hR) and hierarchical F-measure (hF), which are defined as follows:
其中,是预测测试样本i属于的类别及其祖先类别的集合,是测试样本i实际属于的类别及其祖先类别的集合,求和操作是为了计算在所有测试样本上的值。in, is the set of the category to which the test sample i belongs and its ancestor categories, is the set of the category that the test sample i actually belongs to and its ancestor categories, and the summation operation is to calculate the value on all test samples.
为了使案件适用法律的识别更有实用性,算法预测的目标类别最好是具体的法律条款,而不只是宽泛的法律,所以本方法考虑目标类别为全部法律条文和具体法律条款两种情况下的预测性能。下文分别用hP_all、hR_all、hF_all表示在目标类别为全部法律条文时系统的层次化精度、召回率和F度量值,用hP_partial、hR_partial、hF_partial表示在目标类别为具体法律条款时算法的层次化精度、召回率和F度量值。In order to make the identification of the applicable law of the case more practical, the target category predicted by the algorithm is preferably a specific legal clause, not just a broad law. Therefore, this method considers that the target category is all legal clauses and specific legal clauses. predictive performance. In the following, hP_all, hR_all, and hF_all are used to represent the hierarchical precision, recall rate, and F-measure value of the system when the target category is all legal provisions, and hP_partial, hR_partial, and hF_partial are used to represent the hierarchical precision of the algorithm when the target category is specific legal provisions , recall and F-measure.
除了层次化评价指标,还可以分别计算各个类别上的精度、召回率和F度量值,将所有类别上的精度、召回率和F度量值的均值作为系统性能的评价指标,即精度、召回率和F度量值的宏平均(Macro-averaging)。对于各个类别,令TP表示真正例的个数,FP表示伪正例的个数,TN表示真负例的个数,FN表示伪负例的个数,则精度、召回率和F值的宏平均Macro-P、Macro-R、Macro-F的计算公式如下:In addition to the hierarchical evaluation index, the precision, recall rate and F measure value of each category can be calculated separately, and the mean value of the precision, recall rate and F measure value of all categories can be used as the evaluation index of system performance, that is, precision, recall rate Macro-averaging of F-measure and F-measure. For each category, let TP represent the number of true cases, FP represent the number of false positive cases, TN represent the number of true negative cases, and FN represent the number of false negative cases, then the macro of precision, recall and F value The formulas for calculating the average Macro-P, Macro-R, and Macro-F are as follows:
本发明是一种全局的层次多标签分类方法,在整体上考虑类别标签的层次结构,保证预测结果也满足层次限制。本学习方法是一种惰性学习算法,不需要在训练集上构造明确的预测模型,只将原始的多标签样本进行标签扩展后存储起来,因而支持增量学习;在预测阶段,首先找到未见实例在训练集中的k个近邻样本,根据这些近邻样本对各个类别的分类权重来确定实例属于各个类别的置信度,进而预测未见实例所属的类别。本学习方法模型简单,支持增量学习,可以很好地应用到案件适用法律自动识别这类包含海量数据且数据不断增长的层次多标签分类问题中。The present invention is a global hierarchical multi-label classification method, which considers the hierarchical structure of category labels as a whole, and ensures that the prediction results also meet the hierarchical restrictions. This learning method is an inert learning algorithm. It does not need to construct a clear prediction model on the training set, and only stores the original multi-label samples after label expansion, thus supporting incremental learning; in the prediction stage, first find the unseen The k neighbor samples of the instance in the training set, according to the classification weights of these neighbor samples for each category, determine the confidence that the instance belongs to each category, and then predict the category to which the unseen instance belongs. The model of this learning method is simple, supports incremental learning, and can be well applied to hierarchical multi-label classification problems such as automatic identification of applicable laws of cases, which contain massive amounts of data and the data is constantly growing.
有益效果:本发明提供的一种适用于法律识别的层次多标签分类方法,在整体上充分考虑了法律条文标签空间的树形层次结构,使预测结果满足层次限制,不需要对预测结果进行额外修正。同时,本方法模型简单,支持增量学习,可以很好地应用到案件适用法律自动识别这类包含海量数据且数据不断增长的层次多标签分类问题中。Beneficial effects: the present invention provides a hierarchical multi-label classification method suitable for legal identification, which fully considers the tree-like hierarchical structure of the label space of legal provisions as a whole, so that the prediction results meet the hierarchical restrictions, and no additional processing is required for the prediction results. fix. At the same time, the model of this method is simple, supports incremental learning, and can be well applied to hierarchical multi-label classification problems such as automatic identification of applicable laws in cases, which contain massive amounts of data and the data is constantly growing.
附图说明Description of drawings
下面结合附图和具体实施方式对本发明做更进一步的具体说明,本发明的上述或其他方面的优点将会变得更加清楚。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments, and the advantages of the above and other aspects of the present invention will become clearer.
图1本发明主要流程图。Fig. 1 main flow chart of the present invention.
图2裁判文书样例。Figure 2 Sample referee documents.
图3法律条文标签空间树形结构。Figure 3. The tree structure of the label space of legal provisions.
图4法律条文组合频率分布。Fig. 4 Frequency distribution of combinations of legal provisions.
图5不同近邻个数下的层次化指标性能比较。Figure 5. Performance comparison of hierarchical indicators under different numbers of neighbors.
图6不同近邻个数下的宏平均指标性能比较。Figure 6. Performance comparison of macro-averaged indicators with different numbers of neighbors.
图7不同权重策略下的各指标性能比较。Figure 7. Performance comparison of each index under different weighting strategies.
具体实施方式Detailed ways
下面结合附图及实施例对本发明做进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.
本发明公开了一种适用于法律识别的层次多标签分类方法,包括以下步骤:The invention discloses a hierarchical multi-label classification method suitable for legal identification, which includes the following steps:
步骤1,利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集,以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理,主要完成以下几项工作:Step 1, use jsoup-based crawler technology to crawl the required original text data set of referee documents from the Internet, and divide it randomly into training set and test set at a ratio of 7:3. Then pre-process the referee documents, mainly completing the following tasks:
根据裁判文书的行文结构从中提取案件事实及其适用的法律条文,前者用于生成案件样本的特征向量,后者用于表示案件样本的类别标签,将原始文本数据集转化为半结构化的多标签训练集和测试集;According to the text structure of the judgment document, the case facts and the applicable legal provisions are extracted. The former is used to generate the feature vector of the case sample, and the latter is used to represent the category label of the case sample. The original text data set is converted into a semi-structured multi- label training and test sets;
对案件适用法律条文中的错误和格式不一致进行修正;Correct errors and format inconsistencies in applicable legal provisions of the case;
利用哈工大的语言技术平台LTP对案件事实描述进行分词和词性标注。Use Harbin Institute of Technology's language technology platform LTP to perform word segmentation and part-of-speech tagging for the description of the facts of the case.
步骤2,由于法律系统中法律条文的组织呈树形结构,对应地,由多标签训练集中的类别标签构成的标签空间呈树形结构。基于标签空间的层次结构,扩展所有样本的案件事实对应的法律条文,使每个案件事实对应的类别标签集合为标签空间的一个子集且满足层次限制;Step 2. Since the organization of legal provisions in the legal system is in a tree structure, correspondingly, the label space formed by the category labels in the multi-label training set is in a tree structure. Based on the hierarchical structure of the label space, the legal provisions corresponding to the case facts of all samples are expanded, so that the category label set corresponding to each case fact is a subset of the label space and meets the hierarchical constraints;
步骤3,对步骤1中原始训练集所得分词结果进行特征选择,选取能够充分表示案件事实的特征词构建特征向量;经过文本表示,得到结构化的扩展多标签训练集Tr和测试集Te;Step 3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select feature words that can fully represent the facts of the case to construct a feature vector; after text representation, a structured extended multi-label training set Tr and test set Te are obtained;
步骤4,构建预测模型:找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x),给每个近邻样本设置权重,根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度,预测未见实例的类别标签集合h(x),且h(x)满足层次限制。最后根据标签空间的树形结构,除去预测类别集合h(x)中的层次限制,(即标签扩展的逆过程),得到未见实例的具体适用法律条文。Step 4, build a prediction model: Find the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, set weights for each neighbor sample, according to k The classification weights of neighboring samples for each category in the label space calculate the confidence that the unseen instance belongs to each category in the label space, and predict the category label set h(x) of the unseen instance, and h(x) satisfies the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category set h(x) are removed (ie, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained.
步骤2包括:Step 2 includes:
步骤2-1,在层次多标签分类问题中,给定d维实例空间和包含q个类别的标签空间Y={y1,y2,…,yq},yi表示第i个类别,则类别标签空间层次结构可以用二元组(Y,<)表示,<表示类别标签的偏序关系,偏序关系<可以理解为“属于”关系,即如果有yi,yj∈Y且yi<yj,则类别yi属于类别yj,yi是yj的子孙类别,yj是yi的祖先类别。偏序关系<具有非对称性、非自反性和传递性,可以用以下四个特征描述:Step 2-1, in the hierarchical multi-label classification problem, given a d-dimensional instance space And the label space Y={y 1 ,y 2 ,…,y q } containing q categories, y i represents the i-th category, then the category label space hierarchy can be represented by a two-tuple (Y,<), < Represents the partial order relationship of category labels. The partial order relationship < can be understood as the "belongs to" relationship, that is, if there is y i , y j ∈ Y and y i < y j , then category y i belongs to category y j , and y i is y The descendant category of j , y j is the ancestor category of y i . Partial order < is asymmetric, non-reflexive and transitive, and can be described by the following four characteristics:
e)类别标签层次结构中唯一的根节点用虚拟类别标签R表示,对任意yi∈Y,有yi<R;e) The only root node in the category label hierarchy is represented by a virtual category label R, and for any y i ∈ Y, y i <R;
f)对任意yi,yj∈Y,如果有yi<yj,那么 f) For any y i , y j ∈ Y, if y i < y j , then
g)任意yi∈Y,有 g) For any y i ∈ Y, we have
h)任意yi,yj,yk∈Y,yi<yj且yj<yk,则有yi<yk。h) Any y i , y j , y k ∈ Y, y i <y j and y j <y k , then y i <y k .
类别标签的组织结构满足上述四个特征的多标签分类问题都可以认为是层次多标签分类问题。由上述形式化定义可知,在层次化的类别标签空间中,从任一类别节点开始往上追溯到根节点而形成的唯一路径上的所有其他类别节点(除去开始节点)都是该类别节点的祖先类别节点。因此如果样本具有类别标签ci,则样本也隐含地具有了ci的所有祖先类别标签,这就要求分类器对未见实例的预测类别集合h(x)也要满足层次限制,即,且y′<y″:y″∈h(x)。Any multi-label classification problem whose organizational structure of category labels satisfies the above four characteristics can be considered as a hierarchical multi-label classification problem. From the above formal definition, in the hierarchical category label space, all other category nodes (except the start node) on the unique path formed from any category node traced back to the root node are the category nodes. Ancestor class node. Therefore, if the sample has the category label ci , the sample also implicitly has all the ancestor category labels of ci , which requires the classifier to satisfy the hierarchical limit for the predicted category set h(x) of unseen instances, that is, And y′<y″: y″∈h(x).
步骤2-2,对于任意训练样本(xi,yi)(1≤i≤m),m为获取的全部裁判文书样本的数量,xi∈X为d维的特征向量,为与xi对应的一组类别标签。令扩展后的类别标签集合为yi′,则yi′中包含了yi中的所有类别标签及其所有祖先类别标签。形式化地,Step 2-2, for any training sample (x i , y i ) (1≤i≤m), m is the number of samples of all referee documents obtained, x i ∈ X is a d-dimensional feature vector, is a set of category labels corresponding to xi . Let the expanded category label set be y i ′, then y i ′ contains all the category labels in y i and all their ancestor category labels. formally,
标签扩展过程将类别标签的层次关系明确地在样本的类别标签中表达出来:如果样本被标记为某些类别,那么经过标签扩展,这些类别的祖先类别也会显式地赋予该样本;因此每个样本的类别标签可以看作标签空间树的一棵子树,并且各个子树的顶层都是根节点。由此可见,如果有yi,yj∈Y且yi<yj,未见实例在扩展后的多标签训练集中的k近邻样本中,具有类别标签yi的样本数一定不小于具有类别标签yj的样本数。标签扩展是保证本学习算法预测结果满足层次限制的重要步骤。The label expansion process explicitly expresses the hierarchical relationship of category labels in the category labels of samples: if a sample is labeled as certain categories, then after label expansion, the ancestor categories of these categories will also be explicitly assigned to the sample; therefore, every The category label of each sample can be regarded as a subtree of the label space tree, and the top level of each subtree is the root node. It can be seen that if there is y i , y j ∈ Y and y i < y j , among the k-nearest neighbor samples in the expanded multi-label training set, the number of samples with class label y i must not be less than that with The number of samples for class label yj . Label expansion is an important step to ensure that the prediction results of this learning algorithm meet the hierarchical constraints.
步骤3包括如下步骤:Step 3 includes the following steps:
步骤3-1,特征选择的目的是为了特征降维,由于一般的文本特征选择算法不能直接处理多标签数据集,因此需要将多标签数据转换为单标签数据进行处理。转换的方法是:对于每一个多标签样本(x,h),用|h|表示标签类别集合h中标签类别的个数,将其替换为|h|个新的单标签样本(x,yi)(1≤i≤|y|,yi∈h),每个新样本的类yi即为原多标签样本类别标签集合h中的一个类别标签,表1给出了按照上述策略,将多标签样本转化为单标签样本的示例。In step 3-1, the purpose of feature selection is to reduce the dimensionality of features. Since the general text feature selection algorithm cannot directly process multi-label data sets, it is necessary to convert multi-label data into single-label data for processing. The conversion method is: for each multi-label sample (x, h), use |h| to represent the number of label categories in the label category set h, and replace it with |h| new single-label samples (x, y i )(1≤i≤|y|,y i ∈h), the class y i of each new sample is a category label in the original multi-label sample category label set h. Table 1 shows that according to the above strategy, An example of converting multi-label samples to single-label samples.
表1多标签样本转换过程Table 1 Multi-label sample conversion process
步骤3-2,经过步骤3-1的转换过程,多标签的案件样本就转换成为了单标签的案件样本,可以利用一般特征选择算法对步骤1中原始训练集所得分词结果进行特征选择,选择大约100个最具有区分能力的特征词构成特征空间。用来自特征空间的特征词表示每个案件样本的案件事实部分,其中,每个特征词对应的属性值,也就是特征权重,采用常用的TF-IDF算法进行计算。将每个样本的案件事实部分看成一个已经分词的文档,则所有样本的案件事实部分组成一个文档集合。第i个文档中第j维特征的特征权重tf-idfij定义如下:Step 3-2, after the conversion process of step 3-1, the multi-label case samples are converted into single-label case samples, and the general feature selection algorithm can be used to perform feature selection on the word segmentation results obtained in the original training set in step 1, and select About 100 most discriminative feature words constitute the feature space. The feature words from the feature space are used to represent the case facts of each case sample, and the attribute value corresponding to each feature word, that is, the feature weight, is calculated using the commonly used TF-IDF algorithm. Considering the case fact part of each sample as a word-segmented document, the case fact part of all samples forms a document set. The feature weight tf-idf ij of the j-th dimension feature in the i-th document is defined as follows:
其中,tfij表示特征词tj在文档di中出现的频率,idfj表示特征词tj在文档集合中的反文档频率,N表示文档集合中的文档总数,nj表示特征词tj在文档集合中的文档频率,即文档集合中出现特征词tj的文档数目,分母为归一化因子。Among them, tf ij represents the frequency of the feature word t j in the document d i , idf j represents the inverse document frequency of the feature word t j in the document collection, N represents the total number of documents in the document collection, and n j represents the feature word t j The document frequency in the document collection, that is, the number of documents in which the characteristic word t j appears in the document collection, and the denominator is the normalization factor.
步骤3-3,对步骤1中原始训练集所得分词结果进行特征选择,选择一定数量的具有区分能力的特征词构成特征向量。常用的文本特征选择方法主要基于文档频率(DF),互信息(MI),信息增益(IG),卡方统计(χ2Statistic,CHI)等衡量指标。基于文档频率的特征选择过于简单,往往无法选取最具分类信息的特征词,互信息的缺点在于容易受到特征词的边缘概率影响,因此本层次多标签分类方法选择信息增益或者卡方统计算法进行特征选择。Step 3-3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select a certain number of feature words with distinguishing ability to form a feature vector. Commonly used text feature selection methods are mainly based on document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics (χ 2 Statistics, CHI) and other measurement indicators. The feature selection based on document frequency is too simple, and it is often impossible to select the feature words with the most classification information. The disadvantage of mutual information is that it is easily affected by the marginal probability of feature words. Therefore, the multi-label classification method of this level chooses information gain or chi-square statistical algorithm. feature selection.
步骤3-3包括:采用信息增益算法进行特征选择:特征词t的信息增益IG(t)的定义如下:Step 3-3 includes: using the information gain algorithm for feature selection: the definition of the information gain IG(t) of the feature word t is as follows:
其中,Pr(yi)表示类别yi出现的概率,Pr(t)表示特征t出现的概率,Pr(yi|t)表示在特征t出现的前提下类别yi出现的概率,表示特征t不出现的概率,表示在特征t不出现的前提下类别yi出现的概率。对于文档集合中的每个特征词,计算其信息增益,信息增益值低于设定的阈值的特征词不纳入特征空间。Among them, P r (y i ) represents the probability of class y i appearing, P r (t) represents the probability of feature t appearing, P r (y i |t) represents the probability of class y i appearing under the premise of feature t , Indicates the probability that feature t does not appear, Indicates the probability of category y i appearing under the premise that feature t does not appear. For each feature word in the document collection, its information gain is calculated, and the feature words whose information gain value is lower than the set threshold are not included in the feature space.
步骤3-3还可以采用卡方统计算法对训练集中的案件事实文本进行特征选择:先假设特征词与类别是不相关的,如果利用CHI分布计算出的检验值偏离阈值越大,那么更有信心否定原假设,接受原假设的备择假设:即特征词与类别有着很高的相关度。Step 3-3 can also use the chi-square statistical algorithm to select the features of the case fact texts in the training set: first assume that the feature words are not related to the category, if the test value calculated by using the CHI distribution deviates from the threshold, the more Confidence negates the null hypothesis and accepts the alternative hypothesis of the null hypothesis: that is, the feature words have a high correlation with the category.
令A为包含特征词t且属于类别y的文档数量,B为包含特征词t而不属于类别y的文档数量,C为不包含特征词t而属于类别y的文档数量,D为不包含特征词t且不属于类别y的文档数量,N为总文档数量,则特征词t和类别y的卡方统计量χ2(t,y)定义为:Let A be the number of documents that contain feature word t and belong to category y, B be the number of documents that contain feature word t but not belong to category y, C be the number of documents that do not contain feature word t but belong to category y, and D be the number of documents that do not contain feature word t The number of documents with word t that does not belong to category y, N is the total number of documents, then the chi-square statistic χ 2 (t,y) of feature word t and category y is defined as:
特征词t和类别y独立时,其卡方统计量为0,针对一个特征词,计算其关于各个类别的卡方统计量,然后分别计算均值χ2 avg(t)和最大值X2 max(t),用这两种方式进行综合考虑,选出最有区分能力的特征词:When the feature word t and category y are independent, the chi-square statistic is 0. For a feature word, calculate its chi-square statistic for each category, and then calculate the mean value χ 2 avg (t) and maximum value X 2 max ( t), use these two methods to comprehensively consider and select the most distinguishable feature words:
X2 avg(t)=∑i=1Pr(yi)χ2(t,yi),X 2 avg (t)=∑ i=1 P r (y i )χ 2 (t,y i ),
χ2 max(t)=maxi=1,...,qχ2(t,yi)。χ 2 max (t)=max i=1, . . . , q χ 2 (t, y i ).
Pr(yi)表示类别yi出现的概率。卡方统计特征选择算法,相比于互信息的主要优点在于它是归一化的值,因此可以更好地衡量同一类别中的不同特征词。P r (y i ) represents the probability of class y i appearing. The main advantage of the chi-square statistical feature selection algorithm over mutual information is that it is a normalized value, so it can better measure different feature words in the same category.
步骤4中,找k近邻时,未见实例x与样本(xi,hi)的距离d(x,xi),采用它们的特征向量的余弦相似度的倒数进行衡量。未见实例的特征向量γ和近邻样本的特征向量λ的余弦相似度cos(γ,λ)计算公式如下:In step 4, when finding k-nearest neighbors, the distance d(x, xi ) between the unseen instance x and the sample ( xi , hi ) is measured by the reciprocal of the cosine similarity of their feature vectors. The cosine similarity cos(γ,λ) of the feature vector γ of the unseen instance and the feature vector λ of the neighbor sample is calculated as follows:
其中,s表示向量分量的下标,即该分量位于向量中的位置,S表示向量的维度,γs表示向量γ的第s分量,λs表示向量λ的第s个分量。Among them, s represents the subscript of the vector component, that is, the position of the component in the vector, S represents the dimension of the vector, γ s represents the s-th component of the vector γ, and λ s represents the s-th component of the vector λ.
步骤4中,用d(x,xi)表示实例x与样本(xi,hi)的距离,采用全标签距离权重法计算样本((xi,hi)∈N(x))对于类别yj的分类权重wij:In step 4, use d(x, xi ) to represent the distance between instance x and sample ( xi , hi ), and use the full-label distance weight method to calculate the sample (( xi , hi )∈N(x)) for Classification weight w ij for class y j :
全标签距离权重法计算wij:Calculate w ij using the full-label distance weight method:
熵标签距离权重法计算wij:Entropy label distance weight method to calculate w ij :
未见实例属于类别yj的置信度c(x,yj)计算公式如下:The formula for calculating the confidence c(x,y j ) of the unseen instance belonging to the category y j is as follows:
预测未见实例x的类别标签集合h(x)为:The category label set h(x) of the predicted unseen instance x is:
选择0.5作为决策阈值,当未见实例属于各个类别的置信度都小于决策阈值时,返回置信度最大的类别作为未见实例所属的类别。Choose 0.5 as the decision threshold, when the confidence of each category that the unseen instance belongs to is less than the decision threshold, return the category with the highest confidence as the category to which the unseen instance belongs.
实施例Example
如图1所示,本发明的步骤为:As shown in Figure 1, the steps of the present invention are:
步骤一,利用基于jsoup的爬虫技术从互连网上爬取所需的裁判文书原始文本数据集,以7:3的比例将其随机划分为训练集和测试集。然后进行裁判文书的预处理,主要完成以下几项工作:Step 1: Use jsoup-based crawler technology to crawl the required original text data set of referee documents from the Internet, and randomly divide it into a training set and a test set at a ratio of 7:3. Then pre-process the referee documents, mainly completing the following tasks:
根据裁判文书的行文结构从中提取案件事实及其适用的法律条文,前者用于生成案件样本的特征向量,后者用于表示案件样本的类别标签,将原始文本数据集转化为半结构化的多标签训练集和测试集;According to the text structure of the judgment document, the case facts and the applicable legal provisions are extracted. The former is used to generate the feature vector of the case sample, and the latter is used to represent the category label of the case sample. The original text data set is converted into a semi-structured multi- label training and test sets;
对案件适用法律条文中的错误和格式不一致进行修正;Correct errors and format inconsistencies in applicable legal provisions of the case;
利用哈工大的语言技术平台LTP对案件事实描述进行分词和词性标注。Use Harbin Institute of Technology's language technology platform LTP to perform word segmentation and part-of-speech tagging for the description of the facts of the case.
步骤二,基于标签空间的层次结构,扩展所有样本的案件事实对应的法律条文,使每个案件事实对应的类别标签为标签空间的一个子集且满足层次限制;Step 2, based on the hierarchical structure of the label space, expand the legal provisions corresponding to the case facts of all samples, so that the category label corresponding to each case fact is a subset of the label space and meets the hierarchical constraints;
步骤三,对步骤1中原始训练集所得分词结果进行特征选择,选取能够充分表示案件事实的特征词构建特征向量;经过文本表示,得到结构化的扩展多标签训练集Tr和测试集Te;Step 3, perform feature selection on the word segmentation results obtained in the original training set in step 1, and select feature words that can fully represent the facts of the case to construct a feature vector; through text representation, a structured extended multi-label training set Tr and test set Te are obtained;
步骤四,构建预测模型:首先找出来自扩展多标签测试集Te的未见实例x在扩展多标签训练集Tr中的k近邻样本集合N(x),给每个近邻样本设置权重,根据k个近邻样本对标签空间中各个类别的分类权重计算未见实例属于标签空间中各个类别的置信度,预测未见实例的类别标签集合h(x),且h(x)满足层次限制。最后根据标签空间的树形结构,除去预测类别集合h(x)中的层次限制,(即标签扩展的逆过程),得到未见实例的具体适用法律条文。Step 4, build a prediction model: first find out the k-nearest neighbor sample set N(x) of the unseen instance x from the extended multi-label test set Te in the extended multi-label training set Tr, set weights for each neighbor sample, according to k Calculate the confidence of unseen instances belonging to each category in the label space with respect to the classification weights of each category in the label space for the nearest neighbor samples, and predict the category label set h(x) of the unseen instance, and h(x) satisfies the hierarchical constraints. Finally, according to the tree structure of the label space, the hierarchical restrictions in the predicted category set h(x) are removed (ie, the inverse process of label expansion), and the specific applicable legal provisions of the unseen examples are obtained.
本具体实施数据取自浙江法院公开网公开的浙江省各级人民法院裁判文书。The specific implementation data are taken from the judgment documents of the people's courts at all levels in Zhejiang Province published on the Zhejiang Court Open Network.
图2是裁判文书样例,其中直线下划线标注部分为案件事实部分,曲线下划线标注部分为案件适用的法律条文。根据裁判文书的行文规律,提取案件事实及其法律条文。预处理工作主要是对案件适用法律部分的清洗和修正。Figure 2 is a sample of adjudication documents, in which the underlined part is the fact of the case, and the underlined part is the legal provisions applicable to the case. According to the writing rules of judgment documents, extract the facts of the case and its legal provisions. The preprocessing work is mainly to clean up and correct the legal part of the case.
图3中,展示了法律条文标签空间的树形结构。基于这样的层次结构,对每个案件事实对应的法律条文进行标签扩展。In Figure 3, the tree structure of the label space of legal provisions is shown. Based on such a hierarchical structure, label extensions are performed on the legal provisions corresponding to the facts of each case.
图4是法律条文组合频率分布图。根据各个法律条文被引用的频率,选择了频率较高的“《中华人民共和国民事诉讼法》”、“《中华人民共和国合同法》”等26部法律以及这些法律所包含的451项具体法律条款作为类别标签组成标签空间,即标签空间的维度为477。每个案件样本的类别标签集合用标签向量的形式表示,向量的每一维代表标签空间中的一个类别标签,即一项完整的法律条文。如果案件适用了某项法律条文,则其标签向量中该项法律条文以及包含该项法律条文的所有法律条文对应的标签条目值均为1,否则为0。因此,每个样本的标签向量都对应于一个法律条文组合,各个组合出现的频率即为对应的案件样本的数量,各个法律条文组合出现的频率也可以反映案件样本集合的一些性质。通过计算各,并选取出现频率较高的组合将其按照从大到小的顺序排列,可以得到图4。从图中可以看出,法律条文组合出现频率大致呈长尾分布,少数法律条文组合出现频率极高,表明有大量案件样本适用该法律条文组合,除此之外,大多数的法律条文组合出现频率较为均衡。Figure 4 is a frequency distribution diagram of the combination of legal provisions. According to the frequency of citations of various legal provisions, 26 laws such as "Civil Procedure Law of the People's Republic of China" and "Contract Law of the People's Republic of China" with high frequency and 451 specific legal provisions contained in these laws were selected The label space is composed of category labels, that is, the dimension of the label space is 477. The category label set of each case sample is expressed in the form of a label vector, and each dimension of the vector represents a category label in the label space, that is, a complete legal provision. If a certain legal provision is applied to the case, the label entry value corresponding to this legal provision and all legal provisions containing this legal provision in the label vector is 1, otherwise it is 0. Therefore, the label vector of each sample corresponds to a combination of legal provisions, and the frequency of each combination is the number of corresponding case samples. The frequency of each combination of legal provisions can also reflect some properties of the case sample set. Figure 4 can be obtained by calculating each, and selecting combinations with higher frequency of occurrence and arranging them in descending order. It can be seen from the figure that the frequency of occurrence of combinations of legal clauses is generally in a long-tail distribution, and the frequency of occurrence of a few combinations of legal clauses is extremely high, indicating that a large number of case samples apply to this combination of legal clauses. In addition, most of the combinations of legal clauses appear The frequency is more balanced.
步骤三选择信息增益算法进行特征选择。通过计算各个特征词的信息增益可以发现,具有较高信息增益的词大多为动词或名词,表2中显示了信息增益值最高的特征词中动词和名词所占比例,可见在适用法律识别问题中名词和动词相比其他性质的词更具有区分能力,也从另一方面说明可以通过词性标注,去除文本中动词名词之外的词,从而减少文本中词的数量,简化后续计算。Step 3 Select the information gain algorithm for feature selection. By calculating the information gain of each characteristic word, it can be found that the words with higher information gain are mostly verbs or nouns. Table 2 shows the proportion of verbs and nouns in the characteristic words with the highest information gain value, which can be seen in the identification of applicable laws Middle nouns and verbs are more distinguishable than words of other natures. On the other hand, it also shows that words other than verbs and nouns in the text can be removed through part-of-speech tagging, thereby reducing the number of words in the text and simplifying subsequent calculations.
表2特征词中动词名词比例:The proportion of verbs and nouns in the feature words in Table 2:
表3实验训练集和测试集的概况:Table 3 Overview of the experimental training set and test set:
图5和图6分别是取不同近邻个数时层次化指标和宏平均指标性能的比较。Figure 5 and Figure 6 are the performance comparisons of the hierarchical index and the macro-averaged index when different numbers of neighbors are taken.
从图5中可知:当近邻个数为偶数时,算法的精度较高,而召回率较低;当近邻个数为奇数时,算法的精度较低,而召回率较高。随着近邻个数的增大,这种区别逐渐变小。通过对算法的原理进行分析,可以对这种现象进行解释:算法设定的决策阈值为0.5,而当近邻个数为偶数时,由于加入了平滑参数,只有出现次数超过k=2的类别标签会预测为未见实例的类别标签,而出现次数恰好为k=2的类别标签则不会赋予未见实例。因此,当近邻个数为偶数时,各个类别标签赋予未见实例的条件更为严苛,导致算法的预测精度偏高,而相应地召回率就偏低。当近邻个数不断增大后,这种影响逐渐减弱,因此这种区别也就变小。从图中还可以看出目标类别为全部法律条文时,算法的各项预测指标都高于目标类别为具体法律条款时。这是因为更为宽泛的法律类别包含更多的案件样本,从而使得模型在这些类别上有更好的预测能力。综合来看,当近邻个数k值为5时,算法的综合预测性能最好。It can be seen from Figure 5 that when the number of neighbors is even, the precision of the algorithm is high, but the recall rate is low; when the number of neighbors is odd, the precision of the algorithm is low, but the recall rate is high. As the number of neighbors increases, this difference gradually becomes smaller. By analyzing the principle of the algorithm, this phenomenon can be explained: the decision threshold set by the algorithm is 0.5, and when the number of neighbors is even, due to the addition of smoothing parameters, only category labels with occurrences exceeding k=2 The class label of an unseen instance is predicted, while a class label with exactly k=2 occurrences is not assigned to an unseen instance. Therefore, when the number of neighbors is an even number, the conditions for each category label to assign unseen instances are more stringent, resulting in a higher prediction accuracy of the algorithm and a correspondingly lower recall rate. When the number of neighbors increases, this effect gradually weakens, so this difference becomes smaller. It can also be seen from the figure that when the target category is all legal provisions, all the predictive indicators of the algorithm are higher than when the target category is specific legal provisions. This is because broader legal categories contain a larger sample of cases, giving the model better predictive power in those categories. On the whole, when the number of neighbors k is 5, the comprehensive prediction performance of the algorithm is the best.
从图6可以发现:随着近邻个数的增加,算法的宏平均精度、召回率和F度量值都在降低。其原因可能是随着近邻个数的增加,样本数量较少的类别更难达到决策阈值,因而导致大多数类别的预测性能下降,最终导致相应的宏平均性能降低。From Figure 6, it can be found that with the increase of the number of neighbors, the macro-average precision, recall rate and F-measure value of the algorithm are all decreasing. The reason may be that as the number of neighbors increases, it is more difficult for classes with a small number of samples to reach the decision threshold, thus leading to a decrease in the prediction performance of most classes, and finally leading to a corresponding decrease in the macro-averaged performance.
图7为固定近邻个数为5,样本权重策略分别为全标签距离权重法和熵标签距离权重法时算法在各个评价指标上的表现。综合来看,不管是层次化指标还是宏平均指标,采用熵标签距离权重策略可以在精度上取得更好的效果,而采用全标签距离权重策略可以在召回率和F度量值上取得更好的效果。究其原因,熵标签权重策略偏向于类别标签个数较少的样本,而在扩展后的层次多标签样本中,样本所属的类别越具体,其类别标签就会越多,导致在熵标签权重策略下分类权重较小,因而采用熵标签权重策略预测结果更倾向于较上层的类别,导致泛化误差较大。尽管当目标类别为具体的法律条款时算法在性能上有所下降,但仍然有接近80%的层次化精度和超过65%的层次化召回率,说明基于本层次多标签分类算法的案件适用法律识别是有效的。Figure 7 shows the performance of the algorithm on each evaluation index when the number of neighbors is fixed to 5, and the sample weight strategy is the full label distance weight method and the entropy label distance weight method. Taken together, whether it is a hierarchical index or a macro-average index, using the entropy label distance weight strategy can achieve better results in accuracy, while using the full label distance weight strategy can achieve better recall and F-measure values. Effect. The reason is that the entropy label weight strategy is biased towards samples with a small number of category labels, and in the extended hierarchical multi-label samples, the more specific the category of the sample, the more category labels it will have, resulting in the entropy label weight The classification weight under the strategy is small, so the prediction result of the entropy label weight strategy is more inclined to the upper category, resulting in a larger generalization error. Although the performance of the algorithm decreases when the target category is a specific legal clause, it still has a hierarchical precision of close to 80% and a hierarchical recall of more than 65%, indicating that the case based on this hierarchical multi-label classification algorithm is applicable to the law. Identification is valid.
考虑目标类别为全部法律条文和具体法律条款两种情况,在本发明中分别用mP_all、mP_all、mP_all表示目标类别为全部法律条文时算法的宏平均精度、召回率和F度量值,用mP_partial、mP_partial、mP_partial表示目标类别为具体法律条款时算法的宏平均精度、召回率和F度量值。Consider target category is two kinds of situations of whole legal clause and specific legal clause, in the present invention, represent the macro average precision, recall rate and F measure value of algorithm when target category is whole legal clause with mP_all, mP_all, mP_all respectively, use mP_partial, mP_partial, mP_all mP_partial and mP_partial represent the macro average precision, recall rate and F-measure value of the algorithm when the target category is a specific legal clause.
本实施分别选择了TreeBoost.MH局部算法和Clus-HMC全局算法两种常用的层次多标签分类算法,与本层次多标签分类算法的预测性能进行比较,表5给出了它们在各层次化指标上的性能对比,表6给出了它们在各个宏平均指标上的预测性能对比。In this implementation, two commonly used hierarchical multi-label classification algorithms, TreeBoost.MH local algorithm and Clus-HMC global algorithm, are selected to compare with the prediction performance of this hierarchical multi-label classification algorithm. Table 5 shows their performance in each hierarchical index Table 6 shows the comparison of their prediction performance on each macro-averaged index.
表5各算法层次化指标性能比较:Table 5. Performance comparison of hierarchical indicators of each algorithm:
表6各算法宏平均性能比较:Table 6 Macro average performance comparison of each algorithm:
事实证明本层次多标签分类算法在预测性能上可以取得比现有方法更好的效果。结合Lazy-HMC算法支持增量学习的特点,可以利用Lazy-HMC算法构建有效且适用的案件适用法律自动识别系统。Facts have proved that this hierarchical multi-label classification algorithm can achieve better results than existing methods in terms of predictive performance. Combined with the characteristics of the Lazy-HMC algorithm supporting incremental learning, the Lazy-HMC algorithm can be used to build an effective and applicable automatic identification system for applicable laws.
本发明提供了一种适用于法律识别的层次多标签分类方法,具体实现该技术方案的方法和途径很多,以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides a hierarchical multi-label classification method suitable for legal identification. There are many methods and approaches to specifically realize the technical solution. The above description is only a preferred embodiment of the present invention. As far as people are concerned, some improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All components that are not specified in this embodiment can be realized by existing technologies.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710832304.8A CN107577785B (en) | 2017-09-15 | 2017-09-15 | Hierarchical multi-label classification method suitable for legal identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710832304.8A CN107577785B (en) | 2017-09-15 | 2017-09-15 | Hierarchical multi-label classification method suitable for legal identification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107577785A true CN107577785A (en) | 2018-01-12 |
CN107577785B CN107577785B (en) | 2020-02-07 |
Family
ID=61035969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710832304.8A Active CN107577785B (en) | 2017-09-15 | 2017-09-15 | Hierarchical multi-label classification method suitable for legal identification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107577785B (en) |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304386A (en) * | 2018-03-05 | 2018-07-20 | 上海思贤信息技术股份有限公司 | A kind of logic-based rule infers the method and device of legal documents court verdict |
CN108334500A (en) * | 2018-03-05 | 2018-07-27 | 上海思贤信息技术股份有限公司 | A kind of judgement document's mask method and device based on machine learning algorithm |
CN108664924A (en) * | 2018-05-10 | 2018-10-16 | 东南大学 | A kind of multi-tag object identification method based on convolutional neural networks |
CN108763361A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of multi-tag taxonomy model method based on topic model |
CN109543178A (en) * | 2018-11-01 | 2019-03-29 | 银江股份有限公司 | A kind of judicial style label system construction method and system |
CN109685158A (en) * | 2019-01-08 | 2019-04-26 | 东北大学 | A kind of cluster result semantic feature extraction and method for visualizing based on strong point collection |
CN109919368A (en) * | 2019-02-26 | 2019-06-21 | 西安交通大学 | A system and method for recommendation prediction based on association graph |
CN109961094A (en) * | 2019-03-07 | 2019-07-02 | 北京达佳互联信息技术有限公司 | Sample acquiring method, device, electronic equipment and readable storage medium storing program for executing |
CN110046256A (en) * | 2019-04-22 | 2019-07-23 | 成都四方伟业软件股份有限公司 | The prediction technique and device of case differentiation result |
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110163849A (en) * | 2019-04-28 | 2019-08-23 | 上海鹰瞳医疗科技有限公司 | Training data processing method, classification model training method and equipment |
CN110245907A (en) * | 2018-03-09 | 2019-09-17 | 北京国双科技有限公司 | The generation method and device of court's trial notes content |
CN110245229A (en) * | 2019-04-30 | 2019-09-17 | 中山大学 | A deep learning topic sentiment classification method based on data augmentation |
CN110287287A (en) * | 2019-06-18 | 2019-09-27 | 北京百度网讯科技有限公司 | Case by prediction technique, device and server |
CN110347839A (en) * | 2019-07-18 | 2019-10-18 | 湖南数定智能科技有限公司 | A kind of file classification method based on production multi-task learning model |
CN110442722A (en) * | 2019-08-13 | 2019-11-12 | 北京金山数字娱乐科技有限公司 | Method and device for training classification model and method and device for data classification |
CN110543634A (en) * | 2019-09-02 | 2019-12-06 | 北京邮电大学 | Processing method, device, electronic device and storage medium of corpus data set |
CN110633365A (en) * | 2019-07-25 | 2019-12-31 | 北京国信利斯特科技有限公司 | A hierarchical multi-label text classification method and system based on word vectors |
CN110751188A (en) * | 2019-09-26 | 2020-02-04 | 华南师范大学 | User label prediction method, system and storage medium based on multi-label learning |
CN110781650A (en) * | 2020-01-02 | 2020-02-11 | 四川大学 | Method and system for automatically generating referee document based on deep learning |
CN110825879A (en) * | 2019-09-18 | 2020-02-21 | 平安科技(深圳)有限公司 | Case decision result determination method, device and equipment and computer readable storage medium |
CN110837735A (en) * | 2019-11-17 | 2020-02-25 | 太原蓝知科技有限公司 | Intelligent data analysis and identification method and system |
CN110851596A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Text classification method and device and computer readable storage medium |
CN110895703A (en) * | 2018-09-12 | 2020-03-20 | 北京国双科技有限公司 | Legal document routing identification method and device |
CN110909157A (en) * | 2018-09-18 | 2020-03-24 | 阿里巴巴集团控股有限公司 | Text classification method and device, computing equipment and readable storage medium |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | A computational method for multi-label text classification based on ensemble learning |
CN111126053A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Information processing method and related equipment |
CN111143569A (en) * | 2019-12-31 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111475647A (en) * | 2020-03-19 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | Document processing method and device and server |
CN111540468A (en) * | 2020-04-21 | 2020-08-14 | 重庆大学 | ICD automatic coding method and system for visualization of diagnosis reason |
CN111723208A (en) * | 2020-06-28 | 2020-09-29 | 西南财经大学 | Conditional classification tree-based legal decision document multi-classification method and device and terminal |
CN111737479A (en) * | 2020-08-28 | 2020-10-02 | 深圳追一科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
CN111738303A (en) * | 2020-05-28 | 2020-10-02 | 华南理工大学 | A Hierarchical Learning-Based Image Recognition Method for Long-tailed Distribution |
CN111930944A (en) * | 2020-08-12 | 2020-11-13 | 中国银行股份有限公司 | File label classification method and device |
CN112016430A (en) * | 2020-08-24 | 2020-12-01 | 郑州轻工业大学 | A hierarchical action recognition method for multiple mobile phone wearing positions |
CN112131884A (en) * | 2020-10-15 | 2020-12-25 | 腾讯科技(深圳)有限公司 | Method and device for entity classification and method and device for entity presentation |
CN112182213A (en) * | 2020-09-27 | 2021-01-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN112232524A (en) * | 2020-12-14 | 2021-01-15 | 北京沃东天骏信息技术有限公司 | Multi-label information identification method and device, electronic equipment and readable storage medium |
CN112464973A (en) * | 2020-08-13 | 2021-03-09 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN113407727A (en) * | 2021-03-22 | 2021-09-17 | 天津汇智星源信息技术有限公司 | Qualitative measure and era recommendation method based on legal knowledge graph and related equipment |
CN114117040A (en) * | 2021-11-08 | 2022-03-01 | 重庆邮电大学 | Multi-label classification of text data based on label-specific features and correlations |
US11379758B2 (en) | 2019-12-06 | 2022-07-05 | International Business Machines Corporation | Automatic multilabel classification using machine learning |
CN114860892A (en) * | 2022-07-06 | 2022-08-05 | 腾讯科技(深圳)有限公司 | Hierarchical category prediction method, device, equipment and medium |
CN117216688A (en) * | 2023-11-07 | 2023-12-12 | 西南科技大学 | Enterprise industry identification method and system based on hierarchical label tree and neural network |
CN118210926A (en) * | 2024-05-21 | 2024-06-18 | 山东云海国创云计算装备产业创新中心有限公司 | Text label prediction method and device, electronic equipment and storage medium |
CN118313376A (en) * | 2024-06-07 | 2024-07-09 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment, storage medium and product |
CN119537898A (en) * | 2025-01-21 | 2025-02-28 | 武夷学院 | Hierarchical feature selection method and system based on label correlation and instance correlation |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
US20150161198A1 (en) * | 2013-12-05 | 2015-06-11 | Sony Corporation | Computer ecosystem with automatically curated content using searchable hierarchical tags |
CN104881689A (en) * | 2015-06-17 | 2015-09-02 | 苏州大学张家港工业技术研究院 | Method and system for multi-label active learning classification |
CN105868773A (en) * | 2016-03-23 | 2016-08-17 | 华南理工大学 | Hierarchical random forest based multi-tag classification method |
CN106126972A (en) * | 2016-06-21 | 2016-11-16 | 哈尔滨工业大学 | A kind of level multi-tag sorting technique for protein function prediction |
-
2017
- 2017-09-15 CN CN201710832304.8A patent/CN107577785B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150161198A1 (en) * | 2013-12-05 | 2015-06-11 | Sony Corporation | Computer ecosystem with automatically curated content using searchable hierarchical tags |
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
CN104881689A (en) * | 2015-06-17 | 2015-09-02 | 苏州大学张家港工业技术研究院 | Method and system for multi-label active learning classification |
CN105868773A (en) * | 2016-03-23 | 2016-08-17 | 华南理工大学 | Hierarchical random forest based multi-tag classification method |
CN106126972A (en) * | 2016-06-21 | 2016-11-16 | 哈尔滨工业大学 | A kind of level multi-tag sorting technique for protein function prediction |
Cited By (79)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334500A (en) * | 2018-03-05 | 2018-07-27 | 上海思贤信息技术股份有限公司 | A kind of judgement document's mask method and device based on machine learning algorithm |
CN108304386A (en) * | 2018-03-05 | 2018-07-20 | 上海思贤信息技术股份有限公司 | A kind of logic-based rule infers the method and device of legal documents court verdict |
CN108334500B (en) * | 2018-03-05 | 2022-02-22 | 上海思贤信息技术股份有限公司 | Referee document labeling method and device based on machine learning algorithm |
CN110245907A (en) * | 2018-03-09 | 2019-09-17 | 北京国双科技有限公司 | The generation method and device of court's trial notes content |
CN108664924A (en) * | 2018-05-10 | 2018-10-16 | 东南大学 | A kind of multi-tag object identification method based on convolutional neural networks |
CN108763361A (en) * | 2018-05-17 | 2018-11-06 | 南京大学 | A kind of multi-tag taxonomy model method based on topic model |
CN110895703A (en) * | 2018-09-12 | 2020-03-20 | 北京国双科技有限公司 | Legal document routing identification method and device |
CN110895703B (en) * | 2018-09-12 | 2023-05-23 | 北京国双科技有限公司 | Legal document case recognition method and device |
CN110909157A (en) * | 2018-09-18 | 2020-03-24 | 阿里巴巴集团控股有限公司 | Text classification method and device, computing equipment and readable storage medium |
CN110909157B (en) * | 2018-09-18 | 2023-04-11 | 阿里巴巴集团控股有限公司 | Text classification method and device, computing equipment and readable storage medium |
CN111126053B (en) * | 2018-10-31 | 2023-07-04 | 北京国双科技有限公司 | Information processing method and related equipment |
CN111126053A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Information processing method and related equipment |
CN109543178B (en) * | 2018-11-01 | 2023-02-28 | 银江技术股份有限公司 | Method and system for constructing judicial text labeling system |
CN109543178A (en) * | 2018-11-01 | 2019-03-29 | 银江股份有限公司 | A kind of judicial style label system construction method and system |
CN109685158A (en) * | 2019-01-08 | 2019-04-26 | 东北大学 | A kind of cluster result semantic feature extraction and method for visualizing based on strong point collection |
CN109919368B (en) * | 2019-02-26 | 2020-11-17 | 西安交通大学 | Law recommendation prediction system and method based on association graph |
CN109919368A (en) * | 2019-02-26 | 2019-06-21 | 西安交通大学 | A system and method for recommendation prediction based on association graph |
CN109961094B (en) * | 2019-03-07 | 2021-04-30 | 北京达佳互联信息技术有限公司 | Sample acquisition method and device, electronic equipment and readable storage medium |
CN109961094A (en) * | 2019-03-07 | 2019-07-02 | 北京达佳互联信息技术有限公司 | Sample acquiring method, device, electronic equipment and readable storage medium storing program for executing |
CN110046256A (en) * | 2019-04-22 | 2019-07-23 | 成都四方伟业软件股份有限公司 | The prediction technique and device of case differentiation result |
CN110163849A (en) * | 2019-04-28 | 2019-08-23 | 上海鹰瞳医疗科技有限公司 | Training data processing method, classification model training method and equipment |
CN110245229A (en) * | 2019-04-30 | 2019-09-17 | 中山大学 | A deep learning topic sentiment classification method based on data augmentation |
CN110245229B (en) * | 2019-04-30 | 2023-03-28 | 中山大学 | Deep learning theme emotion classification method based on data enhancement |
CN110135592B (en) * | 2019-05-16 | 2023-09-19 | 腾讯科技(深圳)有限公司 | Classification effect determining method and device, intelligent terminal and storage medium |
CN110135592A (en) * | 2019-05-16 | 2019-08-16 | 腾讯科技(深圳)有限公司 | Classifying quality determines method, apparatus, intelligent terminal and storage medium |
CN110287287A (en) * | 2019-06-18 | 2019-09-27 | 北京百度网讯科技有限公司 | Case by prediction technique, device and server |
CN110287287B (en) * | 2019-06-18 | 2021-11-23 | 北京百度网讯科技有限公司 | Case prediction method and device and server |
CN110347839A (en) * | 2019-07-18 | 2019-10-18 | 湖南数定智能科技有限公司 | A kind of file classification method based on production multi-task learning model |
CN110347839B (en) * | 2019-07-18 | 2021-07-16 | 湖南数定智能科技有限公司 | Text classification method based on generative multi-task learning model |
CN110633365A (en) * | 2019-07-25 | 2019-12-31 | 北京国信利斯特科技有限公司 | A hierarchical multi-label text classification method and system based on word vectors |
CN110442722A (en) * | 2019-08-13 | 2019-11-12 | 北京金山数字娱乐科技有限公司 | Method and device for training classification model and method and device for data classification |
CN110442722B (en) * | 2019-08-13 | 2022-05-13 | 北京金山数字娱乐科技有限公司 | Method and device for training classification model and method and device for data classification |
CN110543634A (en) * | 2019-09-02 | 2019-12-06 | 北京邮电大学 | Processing method, device, electronic device and storage medium of corpus data set |
CN110825879A (en) * | 2019-09-18 | 2020-02-21 | 平安科技(深圳)有限公司 | Case decision result determination method, device and equipment and computer readable storage medium |
CN110825879B (en) * | 2019-09-18 | 2024-05-07 | 平安科技(深圳)有限公司 | Decide a case result determination method, device, equipment and computer readable storage medium |
CN110751188A (en) * | 2019-09-26 | 2020-02-04 | 华南师范大学 | User label prediction method, system and storage medium based on multi-label learning |
CN110851596A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Text classification method and device and computer readable storage medium |
CN110851596B (en) * | 2019-10-11 | 2023-06-27 | 平安科技(深圳)有限公司 | Text classification method, apparatus and computer readable storage medium |
CN110968693A (en) * | 2019-11-08 | 2020-04-07 | 华北电力大学 | A computational method for multi-label text classification based on ensemble learning |
CN110837735B (en) * | 2019-11-17 | 2023-11-03 | 内蒙古中媒互动科技有限公司 | Intelligent data analysis and identification method and system |
CN110837735A (en) * | 2019-11-17 | 2020-02-25 | 太原蓝知科技有限公司 | Intelligent data analysis and identification method and system |
US11379758B2 (en) | 2019-12-06 | 2022-07-05 | International Business Machines Corporation | Automatic multilabel classification using machine learning |
CN111143569A (en) * | 2019-12-31 | 2020-05-12 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111143569B (en) * | 2019-12-31 | 2023-05-02 | 腾讯科技(深圳)有限公司 | Data processing method, device and computer readable storage medium |
CN110781650A (en) * | 2020-01-02 | 2020-02-11 | 四川大学 | Method and system for automatically generating referee document based on deep learning |
CN110781650B (en) * | 2020-01-02 | 2020-04-14 | 四川大学 | A method and system for automatic generation of judgment documents based on deep learning |
CN111475647A (en) * | 2020-03-19 | 2020-07-31 | 平安国际智慧城市科技股份有限公司 | Document processing method and device and server |
CN111540468A (en) * | 2020-04-21 | 2020-08-14 | 重庆大学 | ICD automatic coding method and system for visualization of diagnosis reason |
CN111540468B (en) * | 2020-04-21 | 2023-05-16 | 重庆大学 | A method and system for ICD automatic coding with visualization of diagnosis causes |
CN111738303A (en) * | 2020-05-28 | 2020-10-02 | 华南理工大学 | A Hierarchical Learning-Based Image Recognition Method for Long-tailed Distribution |
CN111738303B (en) * | 2020-05-28 | 2023-05-23 | 华南理工大学 | A Long Tail Distribution Image Recognition Method Based on Hierarchical Learning |
CN111723208B (en) * | 2020-06-28 | 2023-04-18 | 西南财经大学 | Conditional classification tree-based legal decision document multi-classification method and device and terminal |
CN111723208A (en) * | 2020-06-28 | 2020-09-29 | 西南财经大学 | Conditional classification tree-based legal decision document multi-classification method and device and terminal |
CN111930944B (en) * | 2020-08-12 | 2023-08-22 | 中国银行股份有限公司 | File label classification method and device |
CN111930944A (en) * | 2020-08-12 | 2020-11-13 | 中国银行股份有限公司 | File label classification method and device |
CN112464973B (en) * | 2020-08-13 | 2024-02-02 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN112464973A (en) * | 2020-08-13 | 2021-03-09 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN112016430A (en) * | 2020-08-24 | 2020-12-01 | 郑州轻工业大学 | A hierarchical action recognition method for multiple mobile phone wearing positions |
CN112016430B (en) * | 2020-08-24 | 2022-10-11 | 郑州轻工业大学 | Hierarchical action identification method for multi-mobile-phone wearing positions |
CN111737479B (en) * | 2020-08-28 | 2020-11-17 | 深圳追一科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
CN111737479A (en) * | 2020-08-28 | 2020-10-02 | 深圳追一科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
CN112182213B (en) * | 2020-09-27 | 2022-07-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN112182213A (en) * | 2020-09-27 | 2021-01-05 | 中润普达(十堰)大数据中心有限公司 | Modeling method based on abnormal lacrimation feature cognition |
CN112131884A (en) * | 2020-10-15 | 2020-12-25 | 腾讯科技(深圳)有限公司 | Method and device for entity classification and method and device for entity presentation |
CN112131884B (en) * | 2020-10-15 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Method and device for entity classification, method and device for entity presentation |
CN112232524B (en) * | 2020-12-14 | 2021-06-29 | 北京沃东天骏信息技术有限公司 | Multi-label information identification method and device, electronic equipment and readable storage medium |
CN112232524A (en) * | 2020-12-14 | 2021-01-15 | 北京沃东天骏信息技术有限公司 | Multi-label information identification method and device, electronic equipment and readable storage medium |
CN113407727B (en) * | 2021-03-22 | 2023-01-13 | 天津汇智星源信息技术有限公司 | Qualitative measure and era recommendation method based on legal knowledge graph and related equipment |
CN113407727A (en) * | 2021-03-22 | 2021-09-17 | 天津汇智星源信息技术有限公司 | Qualitative measure and era recommendation method based on legal knowledge graph and related equipment |
CN114117040A (en) * | 2021-11-08 | 2022-03-01 | 重庆邮电大学 | Multi-label classification of text data based on label-specific features and correlations |
CN114860892B (en) * | 2022-07-06 | 2022-09-06 | 腾讯科技(深圳)有限公司 | Hierarchical category prediction method, device, equipment and medium |
CN114860892A (en) * | 2022-07-06 | 2022-08-05 | 腾讯科技(深圳)有限公司 | Hierarchical category prediction method, device, equipment and medium |
CN117216688A (en) * | 2023-11-07 | 2023-12-12 | 西南科技大学 | Enterprise industry identification method and system based on hierarchical label tree and neural network |
CN117216688B (en) * | 2023-11-07 | 2024-01-23 | 西南科技大学 | Enterprise industry identification method and system based on hierarchical label tree and neural network |
CN118210926A (en) * | 2024-05-21 | 2024-06-18 | 山东云海国创云计算装备产业创新中心有限公司 | Text label prediction method and device, electronic equipment and storage medium |
CN118313376A (en) * | 2024-06-07 | 2024-07-09 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment, storage medium and product |
CN118313376B (en) * | 2024-06-07 | 2024-08-27 | 腾讯科技(深圳)有限公司 | Text processing method, device, equipment, storage medium and product |
CN119537898A (en) * | 2025-01-21 | 2025-02-28 | 武夷学院 | Hierarchical feature selection method and system based on label correlation and instance correlation |
CN119537898B (en) * | 2025-01-21 | 2025-06-27 | 武夷学院 | Hierarchical feature selection method and system based on label correlation and instance correlation |
Also Published As
Publication number | Publication date |
---|---|
CN107577785B (en) | 2020-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107577785B (en) | Hierarchical multi-label classification method suitable for legal identification | |
US20040158569A1 (en) | Method and apparatus for document filtering using ensemble filters | |
CN107180075A (en) | The label automatic generation method of text classification integrated level clustering | |
CN114265935B (en) | A text mining-based decision-making support method and system for scientific and technological project establishment management | |
CN105760493A (en) | Automatic work order classification method for electricity marketing service hot spot 95598 | |
CN110765781B (en) | A Human-Machine Collaborative Construction Method for Domain Terminology Semantic Knowledge Base | |
Sadiq et al. | Hybrid intelligent technique for text categorization | |
CN107291895A (en) | A kind of quick stratification document searching method | |
CN114997288A (en) | Design resource association method | |
CN115618014A (en) | Standard document analysis management system and method applying big data technology | |
CN105335510A (en) | Text data efficient searching method | |
Celikyilmaz et al. | A graph-based semi-supervised learning for question-answering | |
Gao et al. | Library Similar Literature Screening System Research Based on LDA Topic Model | |
CN105160046A (en) | Text-based data retrieval method | |
CN114417885A (en) | Network table column type detection method based on probability graph model | |
Lamirel et al. | Novel labeling strategies for hierarchical representation of multidimensional data analysis results | |
Zobeidi et al. | Effective text classification using multi-level fuzzy neural network | |
CN108804524A (en) | Emotion based on stratification taxonomic hierarchies differentiates and importance division methods | |
Hamdi et al. | Machine learning vs deterministic rule-based system for document stream segmentation | |
Badawi et al. | Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization | |
Wu et al. | KMF: knowledge-aware multi-faceted representation learning for zero-shot node classification | |
CN117076659A (en) | A similar case retrieval method based on BERT's two-stage ranking | |
Vadivel et al. | An Effective Document Category Prediction System Using Support Vector Machines, Mann-Whitney Techniques | |
CN114610880A (en) | A text classification method, system, electronic device and storage medium | |
CN106528595A (en) | Website homepage content based field information collection and association method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |