CN115878777A

CN115878777A - Judicial document index extraction method based on few-sample comparative learning

Info

Publication number: CN115878777A
Application number: CN202211708714.9A
Authority: CN
Inventors: 唐欢容; 左辉阳; 欧阳建权
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-03-31

Abstract

A method for extracting element indexes aiming at judicial texts comprises the following steps: 1) Acquiring judicial writing data, cleaning the judicial writing data, and constructing training data based on the referee writing. 2) A structured language (JDISL) of the judicial literature indexes is provided, the JDISL is adopted to conduct prompt guidance, a structural form of the indexes based on the prompt is obtained, and the training corpus is further processed into the prompt training corpus. 3) And carrying out few-sample negative sampling amplification on the prompt training corpus. 4) On the basis of an open-source Unilm basic pre-training model, a prompt training corpus is used as input. 5) And (4) taking a Unilm model hidden layer vector, and processing the hidden layer vector into Gaussian embedding. 6) And calculating contrast learning loss of Gaussian embedding, and performing iterative updating to obtain a final model. 7) And inputting the judicial documents of the indexes to be extracted into the trained model, and extracting information by the index extraction model to obtain the information extraction result of each label type.

Description

Judicial document index extraction method based on few-sample comparative learning

技术领域technical field

本发明属于裁判文书数据处理技术领域，具体设计一种基于少样本对比学习(即，基于知识图谱)的司法文书指标提取方法。The invention belongs to the technical field of judgment document data processing, and specifically designs a judicial document index extraction method based on few-sample comparative learning (that is, based on knowledge graph).

背景技术Background technique

信息时代下互联网数据量呈指数增长，随着互联网技术的飞速发展，网络上的信息呈现爆炸式地增长，不仅信息规模不断扩大，信息种类也不断增多。与此同时，大量数据在各个领域的成功应用宣告了大数据时代的到来，大数据在社会的发展中起着越来越重要的作用，其价值已得到了社会的普遍认可。近年来，随着我国法制化建设的不断深入，司法案件的审理也变得愈发透明，裁判文书在网上的公开就是一个典型例子。裁判文书作为承载法院案件审理过程以及审判结果的“司法产品”，其蕴含着丰富的司法信息，包括判决法院、案号、当事人诉讼请求、案件名称、判决结果、适用法律等，这些恰恰聚集了法院“大数据”的核心要素。通过对这些信息进行深度挖掘，可以总结案件审判规律，预测审判趋势，提升司法公信力，为实现司法公正，建设法制社会提供技术支撑。然而，裁判文书是一种半结构化的领域文本，其既有程式化的法言法语，也有日常的普通用语，同时裁判文书的书写很大程度上决定于法官，这使得裁判文书具有多态性、异构性以及随意性等一系列特征。因此如何在这类特殊的文本中抽取有价值的信息是一项具有重要价值及意义的课题。In the information age, the amount of Internet data is growing exponentially. With the rapid development of Internet technology, the information on the network is growing explosively. Not only the scale of information is expanding, but also the types of information are increasing. At the same time, the successful application of a large amount of data in various fields has announced the arrival of the era of big data. Big data plays an increasingly important role in the development of society, and its value has been generally recognized by the society. In recent years, with the continuous deepening of legalization in our country, the trial of judicial cases has become more transparent. The disclosure of judgment documents on the Internet is a typical example. Judgment documents, as "judicial products" that carry the trial process and trial results of court cases, contain rich judicial information, including the court of judgment, case number, litigation claims of the parties, case name, judgment result, applicable law, etc., which precisely gather The core elements of court "big data". Through in-depth mining of this information, it is possible to summarize the law of trial trials, predict trial trends, enhance judicial credibility, and provide technical support for the realization of judicial justice and the construction of a legal society. However, adjudication documents are semi-structured field texts, which include both stylized legal language and everyday common language. At the same time, the writing of adjudication documents is largely determined by judges, which makes adjudication documents polymorphic. A series of characteristics such as sex, heterogeneity and randomness. Therefore, how to extract valuable information from such special texts is a subject of great value and significance.

司法领域的大量司法文本存在广泛的指标提取需要，用于对司法程序或司法审判改革的成效评判等。随着我国法治社会的建设和信息技术的发展，司法领域的司法文书数量呈指数级增长。案多人少的情况难以满足案件处理的需求，司法人员专业素养的差异性也会对案件判决造成一定程度的影响。因此，司法文书关键要素智能化的抽取能够为司法人员提供参考，辅助办案，提高工作效率，同时也是后续进行深入分析和高效判案的关键。A large number of judicial texts in the judicial field have a wide range of index extraction needs, which are used to evaluate the effectiveness of judicial procedures or judicial trial reforms. With the construction of a society ruled by law and the development of information technology in our country, the number of judicial documents in the judicial field has grown exponentially. It is difficult to meet the needs of case handling when there are too few cases, and the differences in the professionalism of judicial personnel will also have a certain degree of impact on case judgments. Therefore, the intelligent extraction of key elements of judicial documents can provide reference for judicial personnel, assist in handling cases, improve work efficiency, and is also the key to subsequent in-depth analysis and efficient judgment.

目前司法文书领域可用标注数据稀少，标注难度大，存在可用数据资源稀缺的问题，属于低资源的少样本应用场景；司法文书指标提取任务包括实体抽取、关系抽取、事件抽取等多个子任务，不同的子任务所需要的标注数据不同、算法模型不同，这导致在低资源的少样本场景下，还面临着训练多个子任务模型的应用需求。综上可知，司法文书指标提取面临着多任务算法模型在低资源的少样本应用场景下训练难以收敛、抽取精度低、泛化性差的问题，这样的司法文书指标提取模型是无法满足辅助司法人员办案，提高工作效率这一目标的。因此，本发明提出一种基于少样本对比学习的司法文书指标提取方法有效的解决了上述问题。At present, the available annotation data in the field of judicial documents is scarce, the annotation is difficult, and there is a problem of scarcity of available data resources. It belongs to the low-resource and few-sample application scenario; the judicial document index extraction task includes multiple subtasks such as entity extraction, relationship extraction, and event extraction. Different subtasks require different labeled data and different algorithm models, which leads to the application requirements of training multiple subtask models in low-resource and few-sample scenarios. In summary, the extraction of judicial document indicators faces the problems of multi-task algorithm model training in low-resource and few-sample application scenarios, which is difficult to converge, low extraction accuracy, and poor generalization. Such a judicial document indicator extraction model cannot satisfy the needs of judicial personnel. The goal of handling cases and improving work efficiency. Therefore, the present invention proposes a judicial document index extraction method based on comparative learning with few samples, which effectively solves the above-mentioned problems.

发明内容Contents of the invention

针对现有技术的不足，本发明提供了一种基于少样本对比学习(即，基于知识图谱)的司法文书指标提取方法。首先获取司法文书数据，对司法文书数据进行数据清洗，构建基于裁判文书的训练数据。然后使用设计的司法文书指标结构化语言(JDISL)对司法文书指标进行prompt引导，进一步将训练语料处理为prompt训练语料。随后将prompt训练语料进行少样本负采样扩增。在开源的UniLM基础预训练模型上，将prompt训练语料作为输入，取UniLM模型隐藏层向量，处理为高斯嵌入，计算高斯嵌入的对比学习损失，迭代更新得到最终模型。将待提取指标的司法文书输入到训练好的模型中，得到各个标签类型的信息抽取结果。Aiming at the deficiencies of the prior art, the present invention provides a judicial document index extraction method based on few-sample comparative learning (that is, based on knowledge graph). Firstly, obtain judicial document data, perform data cleaning on judicial document data, and construct training data based on judgment documents. Then use the designed Judicial Document Index Structured Language (JDISL) to conduct prompt guidance on the judicial document indicators, and further process the training corpus into prompt training corpus. Then the prompt training corpus is amplified by few-sample negative sampling. On the open-source UniLM basic pre-training model, the prompt training corpus is used as input, the hidden layer vector of the UniLM model is taken, processed into a Gaussian embedding, the contrastive learning loss of the Gaussian embedding is calculated, and the final model is iteratively updated. Input the judicial documents of the indicators to be extracted into the trained model, and obtain the information extraction results of each label type.

针对现有技术的不足，本发明提供了一种基于少样本对比学习的司法文书指标提取方法。Aiming at the deficiencies of the prior art, the present invention provides a judicial document index extraction method based on comparative learning with few samples.

根据本发明的一个实施方案，提供一种基于少样本对比学习的司法文书指标提取方法，该方法包括以下步骤：According to one embodiment of the present invention, there is provided a method for extracting judicial document indicators based on few-sample contrastive learning, the method comprising the following steps:

1)获取司法文书数据(例如在网上公开的裁判文书或裁判文书数据)，对司法文书数据进行数据清洗，构建基于裁判文书(即，司法文书)的包含训练语料的训练数据；1) Obtain judicial document data (such as judgment documents or judgment document data published on the Internet), perform data cleaning on judicial document data, and construct training data containing training corpus based on judgment documents (that is, judicial documents);

2)对司法文书指标提出了一种司法文书指标结构化语言(JDISL)，采用JDISL进行prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料；2) A Judicial Document Indicator Structured Language (JDISL) is proposed for judicial document indicators. JDISL is used for prompt guidance, and the prompt-based construction form of indicators is obtained, and the training corpus is further processed into prompt training corpus;

3)将prompt训练语料进行少样本负采样扩增，得到扩增后的prompt训练语料；3) Amplify the prompt training corpus by negative sampling with few samples to obtain the amplified prompt training corpus;

4)在开源的Unilm基础预训练模型的基础上，将扩增后的prompt训练语料作为输入；4) On the basis of the open source Unilm basic pre-training model, the amplified prompt training corpus is used as input;

5)取Unilm模型隐藏层向量，处理为高斯嵌入；5) Take the hidden layer vector of the Unilm model and process it as a Gaussian embedding;

6)计算高斯嵌入的对比学习损失，迭代更新得到最终模型；6) Calculate the contrastive learning loss of Gaussian embedding, and iteratively update to obtain the final model;

7)将待提取指标的司法文书输入到训练好的所述最终模型中，指标抽取模型进行信息抽取，得到各个标签类型的信息抽取结果。7) Input the judicial documents of the indicators to be extracted into the trained final model, and the indicator extraction model performs information extraction to obtain the information extraction results of each label type.

优选，步骤2)中对司法文书指标采用prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料的过程包括以下2.1)和2.2)两个子步骤：Preferably, step 2) adopts prompt guidance to the judicial document index, obtains the construction form based on prompt of the index, and further processes the training corpus into the prompt training corpus and includes the following two substeps of 2.1) and 2.2):

步骤2.1)利用包括实体抽取、关系抽取、事件抽取和响应抽取的信息抽取结构化生成手段(或信息提取手段)，对司法文书中的庭审要素指标进行prompt构造。以便统一建模为由文本到prompt结构的生成框架，对司法文书中的实体指标、关系指标、事件指标、响应指标实现了统一的建模。Step 2.1) Prompt construction of court trial element indicators in judicial documents by using information extraction structured generation means (or information extraction means) including entity extraction, relationship extraction, event extraction and response extraction. In order to create a framework from text to prompt structure for unified modeling, unified modeling is realized for entity indicators, relationship indicators, event indicators, and response indicators in judicial documents.

信息提取受到不同目标、异构结构和特定需求模式的影响，分为实体抽取、关系抽取、事件抽取、响应抽取等。对庭审要素指标进行prompt构造。统一建模为由文本到prompt结构的生成框架，对司法文书中的实体指标、关系指标、事件指标、响应指标实现了统一的建模。Information extraction is affected by different goals, heterogeneous structures, and specific demand patterns, and is divided into entity extraction, relationship extraction, event extraction, and response extraction. Prompt construction is carried out on the index of court trial elements. Unified modeling is a generation framework from text to prompt structure, and realizes unified modeling of entity indicators, relationship indicators, event indicators, and response indicators in judicial documents.

信息抽取结构化生成可以分解为两个原子操作。The structured generation of information extraction can be decomposed into two atomic operations.

1)定位：从句子中定位目标信息片段，该信息片段包括例如司法文书中的实体和事件等。1) Locating: locating target information fragments from sentences, such information fragments include, for example, entities and events in judicial documents.

2)关联：根据所需的关联连接不同的信息片段，如(一个)实体与(另一个)实体之间的连接，实体与事件之间的连接，或(一个)事件与(另一个)事件之间的连接。2) Association: Connecting different pieces of information according to the required association, such as a connection between (one) entity and (another) entity, between an entity and an event, or between (one) event and (another) event the connection between.

不同的信息抽取任务结构都可以由原子操作组合生成。Different information extraction task structures can be generated by combining atomic operations.

根据司法文书指标的结构特定，设计了一种司法文书指标结构化语言(JDISL)，其中JDISL的符号“：”表示从信息到其定位或关联片段的映射，符号“[]、()”用于表示抽取指标信息之间的层次结构。According to the specific structure of judicial document indicators, a judicial document indicator structured language (JDISL) is designed, in which the symbol ":" of JDISL represents the mapping from information to its location or associated fragments, and the symbols "[], ()" are used It is used to express the hierarchical structure between extracted index information.

JDISL由三部分组成：JDISL consists of three parts:

1)info_name:表示源文本种存在的一个特定信息片段1) info_name: Indicates a specific piece of information that exists in the source text

2)relation_name:表示源文本存在的一个特定信息片段，该信息片段与结构中的上层info信息具有关联。2) relation_name: Indicates a specific piece of information that exists in the source text, and this piece of information is associated with the upper-level info information in the structure.

3)info_span:表示源文本种的特定信息或关联片段相对应的文本片段跨度。3) info_span: Indicates the specific information of the source text or the text segment span corresponding to the associated segment.

通过JDISL可以对不同的信息抽取结构进行统一编码，将不同的司法文书指标信息抽取任务统一建模为相同的由文本到结构化的生成过程。Through JDISL, different information extraction structures can be uniformly encoded, and different judicial document index information extraction tasks can be uniformly modeled as the same generation process from text to structure.

对于实体识别和事件检测任务可以建模为：For entity recognition and event detection tasks can be modeled as:

(info_name:info_span)，(info_name:info_span),

对于关系提取和事件提取任务可以统一建模为：For relation extraction and event extraction tasks, it can be modeled as:

(info_name:info_span(relation_name:info_span),…)，(info_name:info_span(relation_name:info_span),...),

以庭审要素为例，则所要抽取的指标包括证人、原告、被告、证据、法院、法官，指标之间存在的关系包括(原告)起诉(被告)。对其进行prompt引导，对于实体，将其并列表示为：Taking court trial elements as an example, the indicators to be extracted include witnesses, plaintiffs, defendants, evidence, courts, and judges, and the relationship between indicators includes (plaintiff) suing (defendant). It is prompted to guide it, and for the entity, it is expressed in parallel as:

[证人，原告，被告，法官][witness, plaintiff, defendant, judge]

对于属性与关系依赖，将其表示为：For attribute and relationship dependencies, express it as:

[法院：证据，证人：(证词)，[原告：(起诉)]，被告，法官][court: evidence, witness: (testimony), [plaintiff: (suit)], defendant, judge]

步骤2.2)在对司法文书中的庭审要素指标进行prompt构造之后，将包含训练语料的训练数据基于prompt构造进行处理，从而得到prompt训练语料。Step 2.2) After the prompt construction is performed on the court trial element indicators in the judicial documents, the training data containing the training corpus is processed based on the prompt construction, so as to obtain the prompt training corpus.

在对庭审要素指标进行prompt构造之后，需要将训练语料基于prompt构造进行处理，从而得到prompt训练语料。After the prompt construction of the court trial element indicators, the training corpus needs to be processed based on the prompt construction to obtain the prompt training corpus.

如对于原语料：For example, for the original corpus:

Content:“原告崔某与被告张某劳动合同纠纷一案，证人刘某的证词表明崔某存在过错。”Content: "In the labor contract dispute case between the plaintiff Cui and the defendant Zhang, the testimony of the witness Liu showed that Cui was at fault."

Label:[2,3,“原告”]，[7,8,“被告”]，[20,21,“证人”]，[27,32,“证词”]，[(2,3),(7,8),“起诉”]Label: [2,3, "Plaintiff"], [7,8, "Defendant"], [20,21, "Witness"], [27,32, "Testimony"], [(2,3),( 7,8), "Sue"]

基于prompt构造处理为prompt结构化语料：Based on the prompt structure, it is processed into a prompt structured corpus:

Content:“原告崔某与被告张某劳动合同纠纷一案”Content: "A labor contract dispute between the plaintiff Cui and the defendant Zhang"

Result:[(2,3,“崔某”):(“起诉”)，(7,8,“张某”)：(“被起诉”)，(20,21，“刘某”)：(27,32，“崔某存在过错”)]Result: [(2,3, "Cui"): ("Sued"), (7,8, "Zhang"): ("Sued"), (20,21, "Liu"): ( 27,32,"Cui was at fault")]

Prompt构造作为预定义模式，对训练文本处理为prompt结构化语料形式。Prompt is constructed as a predefined pattern, and the training text is processed into a prompt structured corpus.

优选，步骤3)中将prompt训练语料进行少样本负采样扩增的过程更具体地包括以下步骤：Preferably, in step 3), the process of carrying out the few-sample negative sampling amplification of the prompt training corpus more specifically includes the following steps:

步骤3.1)对于prompt构造语料：Step 3.1) For prompt construction corpus:

Content:textContent: text

Result:[[info_type_1],[info_type_2],…,[info_type_n]]Result: [[info_type_1],[info_type_2],…,[info_type_n]]

其中[info_type_n]表示所抽取的标签类型为n的结构形式为[[info_name],[relation_name],[info_span]]的信息集合。Among them, [info_type_n] means that the extracted information set whose label type is n has the structure form [[info_name], [relation_name], [info_span]].

将Result基于promt处理为多个result子集：Process Result into multiple result subsets based on promt:

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_2:[info_type_2][prompt:label_2]Result_2:[info_type_2][prompt:label_2]

……

Result_n:[info_type_n][prompt:label_n]Result_n:[info_type_n][prompt:label_n]

其中label_n表示[info_type_n]集合所对应的标签类型n；Where label_n represents the label type n corresponding to the [info_type_n] set;

随后基于prompt类型进行负采样，将数据量由n扩充为n^2,如将Result_1扩充为n倍：Then perform negative sampling based on the prompt type, and expand the amount of data from n to n^2, such as expanding Result_1 to n times:

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_1:[info_type_1][prompt:label_2]Result_1:[info_type_1][prompt:label_2]

……

Result_1:[info_type_1][prompt:label_n]。Result_1:[info_type_1][prompt:label_n].

从而得到扩增后的prompt训练语料。Thus, the amplified prompt training corpus is obtained.

优选，步骤4)中在开源的Unilm基础预训练模型的基础上，将prompt训练语料作为输入的过程更具体地包括：Preferably, in step 4), on the basis of the Unilm basic pre-training model of open source, the process of using the prompt training corpus as input more specifically includes:

步骤4.1)将prompt训练语料处理为Unilm模型所需要的输入形式。Unilm模型所能接受的输入形式为：Step 4.1) Process the prompt training corpus into the input form required by the Unilm model. The input forms accepted by the Unilm model are:

{x₁，x₂，x₃，...，x_m-1，x_m}{x ₁ , x ₂ , x ₃ ,..., x _m-1 , x _m }

由JDISL得到的prompt结构化信息为[[info_name],[relation_name],[text]],其中text是在源文本中info_span跨度之内所指代的信息。The prompt structured information obtained by JDISL is [[info_name], [relation_name], [text]], where text is the information referred to within the info_span span in the source text.

随后将prompt结构化信息作为前缀与片段信息所关联，得到模型输入语料：{info_name_1,…info_name_n,…,relation_name_1,…,relation_name_n,…,text_1,…,text_n，x₁，x₂，x₃，...，x_m-1，x_m}；Then the prompt structured information is used as a prefix to associate with the segment information to obtain the model input corpus: {info_name_1,...info_name_n,...,relation_name_1,...,relation_name_n,...,text_1,...,text_n, x ₁ , x ₂ , x ₃ , ..., x _m-1 , x _m };

以上，n的数值指的是实体/关系片段信息的数量。m的数值指的是输入文书的序列长度。Above, the value of n refers to the number of entity/relationship fragment information. The value of m refers to the sequence length of the input document.

Unilm即可以用于生成任务，也可以用于分类任务。输入之后，首先使用Unilm预训练的tokenizer对输入进行分词，随后进行向量嵌入，该向量嵌入包括token嵌入，段嵌入，和位置嵌入。Unilm can be used for both generation and classification tasks. After the input, the tokenizer pre-trained by Unilm is used to segment the input, and then the vector embedding is performed. The vector embedding includes token embedding, segment embedding, and position embedding.

Token嵌入指在每个句子的句首插入[CLS]标记，句尾插入[SEP]标记，其中[CLS]标记代表当前句子的向量，[SEP]标记代表分句用于切分文本中的句子。Token embedding refers to inserting the [CLS] tag at the beginning of each sentence and the [SEP] tag at the end of the sentence, where the [CLS] tag represents the vector of the current sentence, and the [SEP] tag represents the sentence used to segment the sentence in the text .

段嵌入用于区分两个句子，不同句子之前分别为A和B标记，所以输入的句子表示为(E_A，E_B，E_A，E_B,……)。Segment embeddings are used to distinguish two sentences. Different sentences are marked with A and B respectively, so the input sentence is expressed as (E _A , E _B , E _A , E _B ,…).

位置嵌入则是根据句子的下标索引相对位置添加位置信息，即或如[0,1,2,3…]。Position embedding is to add position information according to the relative position of the subscript index of the sentence, that is, or such as [0,1,2,3...].

经过嵌入之后，会经过24个串联的transformer结构层，通过多头注意力机制计算得到隐藏层输出向量After embedding, it will go through 24 serial transformer structure layers, and calculate the hidden layer output vector through the multi-head attention mechanism.

h＝{H；h₁，h₂，h₃，...，h_m-1，h_m}h={H; h ₁ , h ₂ , h ₃ , . . . , h _m-1 , h _m }

其中，h_m代表第m个司法文书序列的隐藏层状态，H代表prompt结构化信息的状态集合。Among them, h _m represents the hidden layer state of the m-th judicial document sequence, and H represents the state set of prompt structured information.

作为优选，步骤5)中取Unilm模型隐藏层向量，处理为高斯嵌入的过程更具体地包括：As preferably, in step 5), get Unilm model hidden layer vector, the process that is processed as Gaussian embedding comprises more specifically:

根据步骤5.1)使用由全连接层与非线性激活函数组成的投影网络f_u和f_∑来生成高斯分布参数：According to step 5.1), use the projection network f _u and f _∑ consisting of fully connected layers and nonlinear activation functions to generate Gaussian distribution parameters:

u_i＝f_u(h_i)；u _i =f _u (h _i );

Σ_i＝ELu(f_∑(h_i))+(1+∈)Σ _i ＝ELu(f _∑ (h _i ))+(1+∈)

其中u_i，Σ_i分别代表高斯嵌入的平均协方差和对角协方差(非零元素仅沿矩阵的对角)，f_u和f_∑都由单层线性层组成，以Relu作为激活函数，为了稳定性∈取e^(-14)。where u _i , Σ _i represent the average covariance and diagonal covariance of the Gaussian embedding respectively (non-zero elements are only along the diagonal of the matrix), both f _u and f _∑ are composed of a single-layer linear layer, with Relu as the activation function, For stability ∈ take e^(-14).

将隐藏层输出向量处理为了高斯嵌入(u_i，Σ_i)。Process the hidden layer output vector into a Gaussian embedding (u _i , Σ _i ).

优选，步骤6)中计算高斯嵌入的对比学习损失，迭代更新得到最终模型的过程更具体地包括以下6.1)和6.2)两个子步骤：Preferably, in step 6), the contrastive learning loss of Gaussian embedding is calculated, and the process of iteratively updating to obtain the final model more specifically includes the following two substeps of 6.1) and 6.2):

步骤6.1)为了计算对比度损失，考虑采样语料中所有有效token对之间KL发散，如果两个标记x_q和x_p具有相同的标签y_q＝y_p，则它们被视为正面示例，对于它们的高斯嵌入N(u_p，Σ_p)和N(u_q，∑_q)，对它的KL散度进行计算：Step 6.1) To compute the contrast loss, consider the KL divergence between all valid token pairs in the sampled corpus, two tokens x _q and x _p are considered positive examples if they have the same label y _q = y _p , for which The Gaussian embedding N(u _p , Σ _p ) and N(u _q , ∑ _q ), calculate its KL divergence:

KL发散的两个方向都是计算出来的，因为它不是对称的。Both directions of KL divergence are calculated since it is not symmetrical.

d(p，q)＝1/2(D_KL|N_p||N_q|+D_KL|N_q||N_p|)d(p,q)＝1/2(D _KL |N _p ||N _q |+D _KL |N _q ||N _p |)

对于正样本x_p，计算x_p相对于批次中其它有效令牌的高斯嵌入损失：For a positive sample x _p , compute the Gaussian embedding loss of x _p with respect to other valid tokens in the batch:

χ_p＝{(χ_q，y_q)∈χ|y_p＝y_q·p≠q}χ _p ＝{(χ _q ，y _q )∈χ|y _p ＝y _q ·p≠q}

最后利用KL散度和高斯嵌入损失对对比学习损失进行计算：Finally, the contrastive learning loss is calculated using KL divergence and Gaussian embedding loss:

步骤6.2)基于对比学习损失函数进行反向传播，对UniLM模型参数进行迭代更新，得到最终可用于司法文书指标提取的信息抽取模型，即最终模型。Step 6.2) Backpropagation is performed based on the contrastive learning loss function, and the parameters of the UniLM model are iteratively updated to obtain an information extraction model that can finally be used for index extraction of judicial documents, that is, the final model.

目前司法文书领域可用标注数据稀少，标注难度大，存在可用数据资源稀缺的问题，属于低资源的少样本应用场景；司法文书指标提取任务包括实体抽取、关系抽取、事件抽取等多个子任务，不同的子任务所需要的标注数据不同、算法模型不同，这导致在低资源的少样本场景下，还面临着训练多个子任务模型的应用需求。因此，亟需一种司法文书指标提取方法，以解决多任务算法模型在低资源的少样本应用场景下训练难以收敛、抽取精度低、泛化性差的问题。At present, the available annotation data in the field of judicial documents is scarce, the annotation is difficult, and there is a problem of scarcity of available data resources. It belongs to the low-resource and few-sample application scenario; the judicial document index extraction task includes multiple subtasks such as entity extraction, relationship extraction, and event extraction. Different subtasks require different labeled data and different algorithm models, which leads to the application requirements of training multiple subtask models in low-resource and few-sample scenarios. Therefore, there is an urgent need for a judicial document index extraction method to solve the problems of difficult convergence, low extraction accuracy, and poor generalization of multi-task algorithm models in low-resource and few-sample application scenarios.

在本发明中，对司法文书指标采用prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料。In the present invention, prompt guidance is adopted for the indicators of judicial documents, the prompt-based construction form of the indicators is obtained, and the training corpus is further processed into prompt training corpus.

1)定位：从句子中定位目标信息片段，例如文书中的实体和事件等。1) Locating: Locate target information fragments from sentences, such as entities and events in documents.

2)关联：根据所需的关联连接不同的信息片段，如实体与实体之间的连接，实体与事件之间的连接。2) Association: Connect different pieces of information according to the required association, such as the connection between entities and entities and the connections between entities and events.

根据司法文书指标的结构特定，设计了一种司法文书指标结构化语言(JDISL)，JDISL的：表示从信息到其定位或关联片段的映射，[]、()用于表示抽取指标信息之间的层次结构。According to the specific structure of judicial document indicators, a judicial document indicator structured language (JDISL) is designed. JDISL: represents the mapping from information to its location or associated fragments, [], () are used to represent the extracted index information hierarchy.

JDISL由三部分组成：JDISL consists of three parts:

(info_name:info_span)，(info_name:info_span),

[证人，原告，被告，法官][witness, plaintiff, defendant, judge]

如对于原语料：For example, for the original corpus:

Result:[(2,3,“崔某”):(“起诉”)，(7,8，“张某”)：(“被起诉”)，(20,21，“刘某”)：(27,32，“崔某存在过错”)]Result: [(2,3,"Cui"):("Sued"), (7,8,"Zhang"):("Sued"), (20,21,"Liu"):( 27,32,"Cui was at fault")]

在本发明中，将prompt训练语料进行少样本负采样扩增。In the present invention, the prompt training corpus is subjected to few-sample negative sampling amplification.

对于prompt构造语料：For prompt construction corpus:

Content:textContent: text

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_2:[info_type_2][prompt:label_2]Result_2:[info_type_2][prompt:label_2]

……

Result_n:[info_type_n][prompt:label_n]Result_n:[info_type_n][prompt:label_n]

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_1:[info_type_1][prompt:label_2]Result_1:[info_type_1][prompt:label_2]

……

以上得到扩增后的prompt训练语料。The amplified prompt training corpus is obtained above.

在本发明中，在开源的Unilm基础预训练模型的基础上，将prompt训练语料作为输入。In the present invention, on the basis of the open-source Unilm basic pre-training model, the prompt training corpus is used as input.

将prompt训练语料处理为Unilm模型所需要的输入形式。Unilm模型所能接受的输入形式为：Process the prompt training corpus into the input form required by the Unilm model. The input forms accepted by the Unilm model are:

由JDISL得到的prompt结构化信息为[[info_name]，[relation_name]，[text]]，其中text是在源文本中info_span跨度之内所指代的信息。The prompt structured information obtained by JDISL is [[info_name], [relation_name], [text]], where text is the information referred to within the info_span span in the source text.

随后将prompt结构化信息作为前缀与片段信息所关联，得到模型输入语料：Then associate the prompt structured information as a prefix with the fragment information to obtain the model input corpus:

{info_name_1,…info_name_n,…,relation_name_1,…,relation_name_n,…,text_1,…,text_n,x₁，x₂，x₃，...，x_m-1，x_m}{info_name_1,...info_name_n,...,relation_name_1,...,relation_name_n,...,text_1,...,text_n,x ₁ ,x ₂ ,x ₃ ,...,x _m-1 ,x _m }

Unilm即可以用于生成任务，也可以用于分类任务，输入之后，首先使用Unilm预训练的tokenizer对输入进行分词，随后进行向量嵌入，该向量嵌入包括token嵌入，段嵌入，和位置嵌入。Unilm can be used for both generation tasks and classification tasks. After the input, the tokenizer pre-trained by Unilm is used to segment the input, and then vector embedding is performed. The vector embedding includes token embedding, segment embedding, and position embedding.

位置嵌入则是根据句子的下标索引相对位置添加位置信息，如[0,1,2,3…]。Position embedding is to add position information according to the relative position of the subscript index of the sentence, such as [0,1,2,3...].

在本发明中，取Unilm模型隐藏层向量，处理为高斯嵌入。In the present invention, the hidden layer vector of the Unilm model is taken and processed as a Gaussian embedding.

使用由全连接层与非线性激活函数组成的投影网络f_u和f_∑来生成高斯分布参数：Use a projection network f _u and f _∑ consisting of fully connected layers and nonlinear activation functions to generate Gaussian distribution parameters:

u_i＝f_u(h_i)；u _i =f _u (h _i );

∑_i＝ELU(f_∑(h_i))+(1+∈)∑ _i ＝ ELU(f _∑ (h _i ))+(1+∈)

其中u_i，∑_i分别代表高斯嵌入的平均协方差和对角协方差(非零元素仅沿矩阵的对角)，f_u和f_∑都由单层线性层组成，以Relu作为激活函数，为了稳定性∈取e^(-14)。where u _i , ∑ _i represent the average covariance and diagonal covariance of the Gaussian embedding respectively (non-zero elements are only along the diagonal of the matrix), both f _u and f _∑ are composed of a single linear layer, with Relu as the activation function, For stability ∈ take e^(-14).

在本发明中，计算高斯嵌入的对比学习损失，迭代更新得到最终模型。In the present invention, the contrastive learning loss of the Gaussian embedding is calculated, and iteratively updated to obtain the final model.

为了计算对比度损失，考虑采样语料中所有有效token对之间KL发散，如果两个标记x_q和x_p具有相同的标签y_q＝y_p，则它们被视为正面示例，对于它们的高斯嵌入N(u_p，∑_p)和N(u_q，Σ_q)，对它的KL散度进行计算：To compute the contrast loss, consider the KL divergence between all valid token pairs in the sampled corpus, two tokens x _q and x _p are considered positive examples if they have the same label y _q = y _p , for their Gaussian embeddings N(u _p , ∑ _p ) and N(u _q , Σ _q ), calculate its KL divergence:

d(p,q)＝1/2(D_KL|N_p||N_q|+D_KL|N_q||N_P|)d(p,q)＝1/2(D _KL |N _p ||N _q |+D _KL |N _q ||N _P |)

χ_p＝{(χ_q，y_g)∈χ[y_p＝y_q，p≠q)χ _p ={(χ _q , y _g )∈χ[y _p =y _q , p≠q)

基于对比学习损失函数进行反向传播，对UniLM模型参数进行迭代更新，得到最终可用于司法文书指标提取的信息抽取模型。Based on the backpropagation of the contrastive learning loss function, the parameters of the UniLM model are iteratively updated to obtain an information extraction model that can finally be used for the extraction of judicial document indicators.

与现有技术相比，本发明的技术方案具有以下有益技术效果：Compared with the prior art, the technical solution of the present invention has the following beneficial technical effects:

1、在本发明中，通过提出一种司法文书指标结构化语言(JDISL)，采用JDISL进行prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料，有效的对不同的信息抽取结构进行了统一的编码，从而能够进行有效的联合抽取，有效解决了司法文书指标提取多任务算法模型多个抽取子任务的统一建模困难，训练成本过大的问题。1, in the present invention, by proposing a kind of judicial document index structured language (JDISL), adopt JDISL to carry out prompt guidance, obtain the index based on the structural form of prompt, further train corpus to be processed as prompt training corpus, effectively to different The information extraction structure is uniformly encoded, so that effective joint extraction can be carried out, which effectively solves the difficulty of unified modeling of multiple extraction subtasks of the judicial document index extraction multi-task algorithm model, and the problem of excessive training costs.

2、在本发明中，将prompt训练语料进行少样本负采样扩增，有效解决了司法文书低资源领域的可用标注数据过少导致的少样本学习困难问题。2. In the present invention, the prompt training corpus is amplified by negative sampling with few samples, which effectively solves the problem of difficult learning with few samples caused by too little available labeling data in the low-resource field of judicial documents.

3、在本发明中，取Unilm模型隐藏层向量，处理为高斯嵌入，通过高斯嵌入对信息类分布进行明确的建模，促进了各类指标的广义特征表示，有助于少样本目标域的自适应，有效解决了少样本学习模型的泛化性能差、抽取精度低的问题。3. In the present invention, the hidden layer vector of the Unilm model is taken and processed as Gaussian embedding, and the information class distribution is clearly modeled by Gaussian embedding, which promotes the generalized feature representation of various indicators and contributes to the development of a few-sample target domain. Adaptive, which effectively solves the problems of poor generalization performance and low extraction accuracy of the few-shot learning model.

4、在本发明中，使用对比学习损失函数计算损失训练模型，迭代更新得到最终模型，有效解决了模型训练收敛困难，抽取精度低的问题。4. In the present invention, the comparative learning loss function is used to calculate the loss training model, and iteratively updated to obtain the final model, which effectively solves the problems of difficult model training convergence and low extraction accuracy.

附图说明Description of drawings

图1本发明基于少样本对比学习的司法文书指标提取方法结构示意图。Fig. 1 is a schematic diagram of the structure of the method for extracting judicial document indicators based on few-sample comparative learning in the present invention.

图2本发明基于少样本对比学习的司法文书指标提取方法的少样本数据处理模块结构图。Fig. 2 is a structure diagram of a few-sample data processing module of the judicial document index extraction method based on few-sample contrastive learning in the present invention.

图3本发明基于少样本对比学习的司法文书指标提取方法的少样本对比学习网络结构图。Fig. 3 is a network structure diagram of a few-sample comparative learning based on a judicial document index extraction method based on a few-sample comparative learning of the present invention.

具体实施方式Detailed ways

下面对本发明的技术方案进行举例说明，本发明请求保护的范围包括但不限于实施例。The technical solutions of the present invention are illustrated below, and the protection scope of the present invention includes but is not limited to the embodiments.

一种基于少样本对比学习的司法文书指标提取方法，该方法包括以下几个步骤：A method for extracting judicial document indicators based on comparative learning with few samples, the method includes the following steps:

1)获取司法文书数据，对司法文书数据进行数据清洗，构建基于裁判文书的训练数据。1) Obtain judicial document data, perform data cleaning on judicial document data, and construct training data based on judgment documents.

2)对司法文书指标提出了一种司法文书指标结构化语言(JDISL)，采用JDISL进行prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料。2) A Judicial Document Indicator Structured Language (JDISL) is proposed for judicial document indicators. JDISL is used for prompt guidance, and the prompt-based construction form of indicators is obtained, and the training corpus is further processed into prompt training corpus.

3)将prompt训练语料进行少样本负采样扩增。3) Amplify the prompt training corpus by negative sampling with few samples.

4)在开源的Unilm基础预训练模型的基础上，将prompt训练语料作为输入。4) Based on the open source Unilm basic pre-training model, the prompt training corpus is used as input.

5)取Unilm模型隐藏层向量，处理为高斯嵌入。5) Take the hidden layer vector of the Unilm model and process it as a Gaussian embedding.

6)计算高斯嵌入的对比学习损失，迭代更新得到最终模型。6) Calculate the contrastive learning loss of the Gaussian embedding, and update iteratively to obtain the final model.

7)将待提取指标的司法文书输入到训练好的模型中，指标抽取模型进行信息抽取，得到各个标签类型的信息抽取结果。7) Input the judicial documents of the indicators to be extracted into the trained model, and the indicator extraction model performs information extraction to obtain the information extraction results of each label type.

作为优选，步骤2)中对司法文书指标采用prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料包括：As preferably, step 2) adopts prompt guidance to judicial document index, obtains index based on the structural form of prompt, and further training corpus is processed as prompt training corpus and includes:

步骤2.1)信息提取受到不同目标、异构结构和特定需求模式的影响，分为实体抽取、关系抽取、事件抽取、响应抽取等。对庭审要素指标进行prompt构造。统一建模为由文本到prompt结构的生成框架，对司法文书中的实体指标、关系指标、事件指标、响应指标实现了统一的建模。Step 2.1) Information extraction is affected by different objectives, heterogeneous structures, and specific demand patterns, and is divided into entity extraction, relation extraction, event extraction, response extraction, etc. Prompt construction is carried out on the index of court trial elements. Unified modeling is a generation framework from text to prompt structure, and realizes unified modeling of entity indicators, relationship indicators, event indicators, and response indicators in judicial documents.

JDISL由三部分组成：JDISL consists of three parts:

(info_name:info_span)(info_name:info_span)

(info_name:info_span(relation_name:info_span),…)(info_name:info_span(relation_name:info_span),...)

[证人，原告，被告，法官][witness, plaintiff, defendant, judge]

步骤2.2)在对庭审要素指标进行prompt构造之后，需要将训练语料基于prompt构造进行处理，从而得到prompt训练语料。Step 2.2) After the prompt construction of the court trial element indicators, the training corpus needs to be processed based on the prompt construction, so as to obtain the prompt training corpus.

如对于原语料：For example, for the original corpus:

作为优选，步骤3)将prompt训练语料进行少样本负采样扩增：As a preference, step 3) carries out the few-sample negative sampling amplification of the prompt training corpus:

步骤3.1)对于prompt构造语料：Step 3.1) For prompt construction corpus:

Content:textContent: text

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_2:[info_type_2][prompt:label_2]Result_2:[info_type_2][prompt:label_2]

……

Result_n:[info_type_n][prompt:label_n]Result_n:[info_type_n][prompt:label_n]

其中label_n表示[info_type_n]集合所对应的标签类型nWhere label_n represents the label type n corresponding to the [info_type_n] collection

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_1:[info_type_1][prompt:label_2]Result_1:[info_type_1][prompt:label_2]

……

Result_1:[info_type_1][prompt:label_n]Result_1:[info_type_1][prompt:label_n]

作为优选，步骤4)在开源的Unilm基础预训练模型的基础上，将prompt训练语料作为输入：As a preference, step 4) uses the prompt training corpus as input on the basis of the open source Unilm basic pre-training model:

Unilm即可以用于生成任务，也可以用于分类任务，输入之后，首先使用Unilm预训练的tokenizer对输入进行分词，随后进行向量嵌入，包括token嵌入，段嵌入，位置嵌入。Unilm can be used for both generation tasks and classification tasks. After the input, first use Unilm pre-trained tokenizer to segment the input, and then perform vector embedding, including token embedding, segment embedding, and position embedding.

作为优选，步骤5)取Unilm模型隐藏层向量，处理为高斯嵌入：Preferably, step 5) takes the hidden layer vector of the Unilm model and processes it as a Gaussian embedding:

u_i＝f_u(h_i)；u _i =f _u (h _i );

∑_i＝ELU(f_∑(h_i))+(1+∈)∑ _i ＝ ELU(f _∑ (h _i ))+(1+∈)

其中u_i，∑_i分别代表高斯嵌入的平均协方差和对角协方差(非零元素仅沿矩阵的对角)，f_u和f_Σ都由单层线性层组成，以Relu作为激活函数，为了稳定性∈取e^(-14)。where u _i , ∑ _i represent the average covariance and diagonal covariance of the Gaussian embedding respectively (non-zero elements are only along the diagonal of the matrix), both f _u and f _Σ are composed of a single-layer linear layer, with Relu as the activation function, For stability ∈ take e^(-14).

便将隐藏层输出向量处理为了高斯嵌入(u_i，Σ_i)。Then the hidden layer output vector is processed into a Gaussian embedding (u _i , Σ _i ).

作为优选，步骤6)计算高斯嵌入的对比学习损失，迭代更新得到最终模型：As a preference, step 6) calculates the contrastive learning loss of Gaussian embedding, and iteratively updates to obtain the final model:

步骤6.1)为了计算对比度损失，考虑采样语料中所有有效t_oken对之间KL发散，如果两个标记x_q和x_p具有相同的标签y_q＝y_p，则它们被视为正面示例，对于它们的高斯嵌入N(u_p，Σ_p)和N(u_q，∑_q)，对它的KL散度进行计算：Step 6.1) To compute the contrast loss, consider the KL divergence between all valid _token pairs in the sampled corpus, two tokens x _q and x _p are considered as positive examples if they have the same label y _q = y _p , For their Gaussian embeddings N(u _p , Σ _p ) and N(u _q , ∑ _q ), calculate its KL divergence:

χ_p＝((χ_q，y_q)∈χ|y_p＝y_q，p≠q}χ _p ＝((χ _q ，y _q )∈χ|y _p ＝y _q ，p≠q}

步骤6.2)基于对比学习损失函数进行反向传播，对UniLM模型参数进行迭代更新，得到最终可用于司法文书指标提取的信息抽取模型。Step 6.2) Backpropagation is performed based on the contrastive learning loss function, and the parameters of the UniLM model are iteratively updated to obtain an information extraction model that can finally be used for the extraction of judicial document indicators.

实施例1Example 1

如图1所示，一种基于少样本对比学习的司法文书指标提取方法，该方法包括以下几个步骤：As shown in Figure 1, a judicial document index extraction method based on few-sample comparative learning, the method includes the following steps:

实施例2Example 2

重复实施例1，只是步骤2)中对司法文书指标采用prompt引导，得到指标基于prompt的构造形式，进一步将训练语料处理为prompt训练语料包括，如图2所示：Repeat embodiment 1, just step 2) adopts prompt guidance to the judicial document index, obtains the index based on the construction form of prompt, further training corpus is processed as prompt training corpus and includes, as shown in Figure 2:

JDISL由三部分组成：JDISL consists of three parts:

(info_name:info_span)(info_name:info_span)

[证人，原告，被告，法官][witness, plaintiff, defendant, judge]

如对于原语料：For example, for the original corpus:

实施例3Example 3

重复实施例2，只是步骤3)将prompt训练语料进行少样本负采样扩增，如图2所示：Repeat embodiment 2, just step 3) the prompt training corpus is carried out few sample negative sampling amplification, as shown in Figure 2:

步骤3.1)对于prompt构造语料：Step 3.1) For prompt construction corpus:

Content:textContent: text

其中[info_type_n]表示所抽取的标签类型为n的结构形式为[[info_name]，[relation_name]，[info_span]]的信息集合。Among them, [info_type_n] indicates that the extracted information set whose label type is n has the structure form [[info_name], [relation_name], [info_span]].

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_2:[info_type_2][prompt:label_2]Result_2:[info_type_2][prompt:label_2]

……

Result_n:[info_type_n][prompt:label_n]Result_n:[info_type_n][prompt:label_n]

随后基于prompt类型进行负采样，将数据量由n扩充为n^2，如将Result_1扩充为n倍：Then perform negative sampling based on the prompt type, and expand the amount of data from n to n^2, such as expanding Result_1 to n times:

Result_1:[info_type_1][prompt:label_1]Result_1:[info_type_1][prompt:label_1]

Result_1:[info_type_1][prompt:label_2]Result_1:[info_type_1][prompt:label_2]

……

Result_1:[info_type_1][prompt:label_n]Result_1:[info_type_1][prompt:label_n]

实施例4Example 4

重复实施例3，如图2所示，只是步骤4)在开源的Unilm基础预训练模型的基础上，将prompt训练语料作为输入，如图3所示：Repeat embodiment 3, as shown in Figure 2, just step 4) on the basis of the Unilm basic pre-training model of open source, prompt training corpus is used as input, as shown in Figure 3:

位置嵌入则是根据句子的下标索引相对位置添加位置信息，即或例如[0,1,2,3…]。Position embedding is to add position information according to the relative position of the subscript index of the sentence, ie or for example [0,1,2,3...].

实施例5Example 5

重复实施例4，如图3所示，只是步骤5)中取Unilm模型隐藏层向量，处理为高斯嵌入：Repeat embodiment 4, as shown in Figure 3, just get Unilm model hidden layer vector in step 5), be processed as Gaussian embedding:

根据步骤5.1)使用由全连接层与非线性激活函数组成的投影网络f_u和f_∑来生成高斯分布参数，如图3所示：According to step 5.1), use the projection network f _u and f _∑ composed of fully connected layers and nonlinear activation functions to generate Gaussian distribution parameters, as shown in Figure 3:

u_i＝f_u(h_i)；u _i =f _u (h _i );

∑_i＝ELU(f_∑(h₁)+(1+∈)∑ _i ＝ ELU(f _∑ (h ₁ )+(1+∈)

便将隐藏层输出向量处理为了高斯嵌入(u_i，∑_i)。Then the hidden layer output vector is processed into a Gaussian embedding (u _i , ∑ _i ).

实施例6Example 6

重复实施例5，如图3所示，只是步骤6)中计算高斯嵌入的对比学习损失，迭代更新得到最终模型：Repeat embodiment 5, as shown in Figure 3, only step 6) calculates the comparative learning loss of Gaussian embedding, and iteratively updates to obtain the final model:

步骤6.1)为了计算对比度损失，考虑采样语料中所有有效token对之间KL发散，如果两个标记x_q和x_p具有相同的标签y_q＝y_q，则它们被视为正面示例，对于它们的高斯嵌入N(u_p，∑_p)和N(u_q，∑_q)，对它的KL散度进行计算：Step 6.1) To calculate the contrast loss, consider the KL divergence between all valid token pairs in the sampled corpus, two tokens x _q and x _p are considered as positive examples if they have the same label y _q = y _q , for which Gaussian embedding N(u _p , ∑ _p ) and N(u _q , ∑ _q ), calculate its KL divergence:

x_p＝{(x_q，y_q)∈χ|yp=y_q，p≠q}x _p ＝{(x _q ，y _q )∈χ|yp=y _q ，p≠q}

步骤6.2)，如图3所示，基于对比学习损失函数进行反向传播，对UniLM模型参数进行迭代更新，得到最终可用于司法文书指标提取的信息抽取模型。Step 6.2), as shown in Figure 3, performs backpropagation based on the comparative learning loss function, and iteratively updates the parameters of the UniLM model to obtain an information extraction model that can finally be used for the extraction of judicial document indicators.

Claims

1. A method for extracting judicial document indicators based on few-sample contrastive learning, the method comprising the following steps:

1) Obtain judicial document data, perform data cleaning on judicial document data, and construct training data containing training corpus based on judicial documents;

2) A Judicial Document Indicator Structured Language (JDISL) is proposed for judicial document indicators. JDISL is used for prompt guidance, and the prompt-based construction form of indicators is obtained, and the training corpus is further processed into prompt training corpus;

3) Amplify the prompt training corpus by negative sampling with few samples to obtain the amplified prompt training corpus;

4) On the basis of the open source Unilm basic pre-training model, the amplified prompt training corpus is used as input;

5) Take the hidden layer vector of the Unilm model and process it as a Gaussian embedding;

6) Calculate the contrastive learning loss of Gaussian embedding, and iteratively update to obtain the final model;

7) Input the judicial documents of the indicators to be extracted into the trained final model, and the indicator extraction model performs information extraction to obtain the information extraction results of each label type.

2. the judicial document index extraction method according to claim 1, it is characterized in that, step 2) adopts prompt guidance to judicial document index, obtains index based on the construction form of prompt, further training corpus is processed as the process of prompt training corpus include:

Step 2.1) Use the information extraction structured generation means or information extraction means including entity extraction, relationship extraction, event extraction and response extraction to carry out prompt construction on the court trial element indicators in judicial documents, so as to uniformly model from text to prompt structure The generation framework realizes unified modeling of entity indicators, relationship indicators, event indicators, and response indicators in judicial documents;

Among them, the structured generation of information extraction is decomposed into two atomic operations:

1) Locate: Locate the target information fragment from the sentence, the information fragment includes entities and events in judicial documents;

2) Association: connect different pieces of information according to the required association, that is, the connection between entities, the connection between entities and events, or the connection between events and events;

According to the specific structure of judicial document indicators, a judicial document indicator structured language (JDISL) is designed. The symbol ":" of JDISL indicates the mapping from information to its location or associated fragments, and the symbols "[], ()" are used for Represents the hierarchical structure between the extracted indicator information;

Among them, JDISL consists of three parts:

1) info_name: represents a specific piece of information that exists in the source text;

2) relation_name: represents a specific information fragment that exists in the source text, and this information fragment is associated with the upper-level info information in the structure;

3) info_span: represents the specific information of the source text or the corresponding text fragment span of the associated fragment;

Through JDISL, different information extraction structures are uniformly encoded, and different judicial document index information extraction tasks are uniformly modeled as the same generation process from text to structure;

For entity recognition and event detection tasks modeled as:

(info_name:info_span)

For relation extraction and event extraction tasks, the unified modeling is:

(info_name:info_span(relation_name:info_span),...);

Step 2.2) After carrying out the prompt construction to the court trial element index in the judicial document, the training data containing the training corpus is processed based on the prompt construction, thereby obtaining the prompt training corpus; preferably, the Prompt construction is used as a predefined pattern, and the training text is processed as prompt Structured corpus form.

3. according to claim 1 or 2 described judicial document index extracting methods, it is characterized in that, step 3) in the process that prompt training corpus is carried out few sample negative sampling amplification comprises:

Step 3.1) For prompt construction corpus:

Content: text

Result: [[info_type_1],[info_type_2],…,[info_type_n]]

Among them, [info_type_n] indicates that the extracted label type is n and the structure form is [[info_name], [relation_name], [info_span]] information set;

Process Result into multiple result subsets based on promt:

Result_1:[info_type_1][prompt:label_1]

Result_2:[info_type_2][prompt:label_2]

…

Result_n:[info_type_n][prompt:label_n]

Where label_n represents the label type n corresponding to the [info_type_n] set;

Then perform negative sampling based on the prompt type, and expand the amount of data from n to n^2, such as expanding Result_1 to n times:

Result_1:[info_type_1][prompt:label_1]

Result_1:[info_type_1][prompt:label_2]

…

Result_1:[info_type_1][prompt:label_n],

Thus, the amplified prompt training corpus is obtained.

4. according to the judicial document index extracting method of any one in claim 1-3, it is characterized in that, step 4) on the basis of the Unilm basic pre-training model of open source, the process with prompt training corpus as input comprises:

Step 4.1) Process the prompt training corpus into the input form required by the Unilm model. The input forms accepted by the Unilm model are:

{x ₁ , x ₂ , x ₃ ,..., x _m-1 , x _m }

The prompt structured information obtained by JDISL is [[info_name], [relation_name], [text]], where text is the information referred to within the info_span span in the source text;

Then associate the prompt structured information as a prefix with the fragment information to obtain the model input corpus:

{info_name_1,...info_name_n,...,relation_name_1,...,relation_name_n,...,text_1,...,text_n, _x1 , _x2 , _x3 ,...,xm _-1 , _xm };

Use Unilm for both generation tasks and classification tasks; after the input, first use the Unilm pre-trained tokenizer to segment the input, and then perform vector embedding, which includes token embedding, segment embedding, and position embedding;

Among them, Token embedding refers to inserting the [CLS] tag at the beginning of each sentence, and inserting the [SEP] tag at the end of each sentence, where the [CLS] tag represents the vector of the current sentence, and the [SEP] tag represents the sentence used to segment the text. sentence;

Among them, the segment embedding is used to distinguish two sentences, and the different sentences are marked by A and B respectively, so the input sentence is expressed as (E _A , E _B , E _A , E _B ,…);

Among them, position embedding is to add position information according to the relative position of the subscript index of the sentence (ie or for example, [0,1,2,3...]);

After embedding, it will go through 24 serial transformer structure layers, and calculate the hidden layer output vector through the multi-head attention mechanism.

h={H; h ₁ , h ₂ , h ₃ , . . . , h _m-1 , h _m }

Among them, h _m represents the hidden layer state of the m-th judicial document sequence, and H represents the state set of prompt structured information.

5. according to the judicial document index extracting method described in any one in claim 1-4, it is characterized in that, in step 5), get Unilm model hidden layer vector, be processed as the process of Gaussian embedding and comprise:

Step 5.1) Use a projection network f _u and f _∑ consisting of fully connected layers and nonlinear activation functions to generate Gaussian distribution parameters:

u _i =f _u (h _i );

∑ _i = ELU(f _∑ (h _i ))+(1+∈);

where u _i , ∑ _i represent the average covariance and diagonal covariance of the Gaussian embedding, respectively, that is, the non-zero elements are only along the diagonal of the matrix, both f _u and f _∑ are composed of a single-layer linear layer, with Relu as the activation function, For stability ∈ take e^(-14); thus the hidden layer output vector is processed into a Gaussian embedding (u _i , ∑ _i ).

6. According to the judicial document index extraction method described in any one of claims 1-5, it is characterized in that, in step 6), the contrastive learning loss of Gaussian embedding is calculated, and the process of iteratively updating to obtain the final model includes the following 6.1) and 6.2 ) two sub-steps:

Step 6.1) To compute the contrast loss, considering the KL divergence between all valid token pairs in the sampled corpus, two tokens x _q and x _p are considered positive examples if they have the same label y _q = _y p , for which Gaussian embedding N(u _p , ∑ _p ) and N(u _q , ∑ _q ), calculate its KL divergence:

Both directions of KL divergence are calculated since it is not symmetric:

d(p,q)＝1/2(D _KL |N _p ||N _q |+D _KL |N _q ||N _p |)

For a positive sample x _p , compute the Gaussian embedding loss of x _p with respect to other valid tokens in the batch:

χ _p ＝{(χ _q ，y _q )∈χ|y _p ＝y _q ，p≠q}

Finally, the contrastive learning loss is calculated using KL divergence and Gaussian embedding loss:

Step 6.2) Backpropagation is performed based on the contrastive learning loss function, and the parameters of the UniLM model are iteratively updated to obtain an information extraction model that can finally be used for the extraction of judicial document indicators.