CN111309900B

CN111309900B - A method for judging and pushing the similarity of similar legal cases

Info

Publication number: CN111309900B
Application number: CN202010055473.7A
Authority: CN
Inventors: 陈欢欢; 何慧敏
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2022-09-06
Anticipated expiration: 2040-01-17
Also published as: CN111309900A

Abstract

The invention discloses a legal class similarity judging and pushing method, wherein the related judging method comprises the following steps: classifying target legal cases, and extracting historical cases of the same category from a historical case database to form a candidate set according to the obtained case category; performing event sequence representation on the target legal case and each similar historical case in the candidate set; calculating the distance between the event sequence corresponding to the target legal case and the event sequence corresponding to each historical case in the candidate set according to the event sequence measurement model; and calculating the similarity between the target legal case and the candidate centralized historical case based on the distance of the event sequence and the scoring function. The method can realize more comprehensive and accurate class case identification; meanwhile, the legal documents are expressed as a time sequence event sequence, similarity calculation is carried out based on an unsupervised mode, and historical cases with higher scores are selected for pushing, so that labor input is greatly reduced, and pushing intellectualization can be better realized.

Description

A method for judging and pushing the similarity of similar legal cases

技术领域technical field

本发明涉及法律智能领域，尤其涉及一种法律类案相似度判别及推送方法。The invention relates to the field of legal intelligence, in particular to a method for judging and pushing the similarity of legal similar cases.

背景技术Background technique

当前，人工智能理论和技术日益成熟，应用范围不断扩大。2017年，国家人工智能战略《新一代人工智能发展规划》中提出建设智慧法庭，促进人工智能在证据收集、案例分析、法律文件阅读与分析中的应用，实现法院审判体系和审判能力智能化。其中，通过人工智能技术实现类案类判已成为贴近法官需求的一项重要研究内容。At present, the theory and technology of artificial intelligence are becoming more and more mature, and the scope of application is constantly expanding. In 2017, the National Artificial Intelligence Strategy "New Generation Artificial Intelligence Development Plan" proposed to build a smart court, promote the application of artificial intelligence in evidence collection, case analysis, legal document reading and analysis, and realize the intelligentization of the court's trial system and trial capacity. Among them, the realization of similar cases and similar judgments through artificial intelligence technology has become an important research content that is close to the needs of judges.

类案类判作为一种辅助工具，目的为法官手头正在处理的案件寻找相似甚至相同的案件，以实现启发、拓展法官判案思路、帮助法官正确裁判，使相同或相似案件的判决结果能够有较小偏差。但是现有类案检索系统存在推送案例不精准，无法切实解决法官需要的问题。如推送案例并未做到“同案”，甚至不是“同类”；推送案件数量过高，并未真正节约法官时间，仍需大量人工筛选。As an auxiliary tool, the purpose of similar judgments is to find similar or even identical cases for the cases the judges are dealing with, so as to inspire and expand the judges' thinking in judging cases, help the judges make correct judgments, and make the judgment results of the same or similar cases more consistent. Minor deviation. However, the existing similar case retrieval system is inaccurate in pushing cases and cannot effectively solve the problem that judges need. For example, the pushed cases are not "same case" or even "similar"; the number of pushed cases is too high, which does not really save the judge's time, and still requires a lot of manual screening.

由于法律案件记录多为电子文书，其形式为自然语言表述的文本。由此可将类案识别视作文本相似性度量的一种应用场景。应用现有的自然语言处理方法，可在一定程度上实现类案识别，但尚难以准确辨别案件要素的核心区别点。主要问题如下：Since most legal case records are electronic documents, their form is text expressed in natural language. Therefore, class case recognition can be regarded as an application scenario of text similarity measurement. The application of existing natural language processing methods can realize the identification of similar cases to a certain extent, but it is still difficult to accurately identify the core distinguishing points of the elements of a case. The main issues are as follows:

1)基于关键词匹配的方式准确度不够。关键词检索实际上为“抽样验证”，其借助于少量样本得出的结论并不完备。同时，该方法得到的类案数量过多，使得法官难以甄别具有重要参考价值的案件。1) The method based on keyword matching is not accurate enough. Keyword search is actually "sampling verification", and the conclusions drawn with the help of a small number of samples are not complete. At the same time, the number of similar cases obtained by this method is too large, making it difficult for judges to identify cases with important reference value.

2)基于word2vec将词语表示为向量以此构建神经网络的方法需要大量标签化、结构化的训练语料，而当前法律领域缺少海量翔实的标签化法律数据，亦缺少既懂法律又懂技术的人才。2) The method of constructing a neural network by expressing words as vectors based on word2vec requires a large amount of labeled and structured training corpus, while the current legal field lacks massive and detailed labeled legal data, and lacks both legal and technical talents .

3)类案的主要参考价值在于针对案件中的若干法律细节或难点，推送相似历史案件中法官的判案思路与做法。但是目前未有针对法律行业特点设计的法律文书相似性度量模型。3) The main reference value of similar cases is to push forward the judgment ideas and practices of judges in similar historical cases based on some legal details or difficulties in the cases. However, there is no legal document similarity measurement model designed for the characteristics of the legal industry.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种法律类案相似度判别及推送方法，解决现有方法需大量手工标注及类案推送不准确、信息冗杂、缺乏法律问题针对性等缺点。The purpose of the present invention is to provide a method for judging and pushing the similarity of legal similar cases, which solves the shortcomings of the existing method, such as the need for a large number of manual annotations, inaccurate push of similar cases, redundant information, and lack of legal issues pertinence.

本发明的目的是通过以下技术方案实现的：The purpose of this invention is to realize through the following technical solutions:

一种法律类案相似度判别方法，包括：A method for judging similarity of similar legal cases, including:

对目标法律案件进行分类，根据得到的案件类别，从历史案件数据库提取相同类别的历史案件构成候选集；Classify the target legal case, according to the obtained case category, extract the historical cases of the same category from the historical case database to form a candidate set;

对目标法律案件与候选集中的每个同类历史案件，进行事件序列表示；Represent the event sequence for the target legal case and each similar historical case in the candidate set;

根据事件序列度量模型，计算目标法律案件对应的事件序列与候选集中每个历史案件对应的事件序列的距离；According to the event sequence measurement model, calculate the distance between the event sequence corresponding to the target legal case and the event sequence corresponding to each historical case in the candidate set;

基于事件序列的距离并结合打分函数，计算目标法律案件与候选集中历史案件的相似度。Based on the distance of the event sequence and combined with the scoring function, the similarity between the target legal case and the historical cases in the candidate set is calculated.

一种法律类案推送方法，包括：利用前述的方法计算目标法律案件与候选集中历史案件的相似度，再按照相似度分值从高到低的顺序对候选集中的历史案件进行排序，提取出排名靠前的M个历史案件进行推送。A method for pushing similar legal cases, comprising: using the aforementioned method to calculate the similarity between a target legal case and a historical case in a candidate set, and then sorting the historical cases in the candidate set in descending order of similarity scores, and extracting The top M historical cases are pushed.

由上述本发明提供的技术方案可以看出，通过对法律文书进行案件分类，并分析案件主题分布，可在保证案件类别相同的同时，选取与目标案件所描述的语义信息最为相似的历史案件，实现更加全面、准确的类案识别；同时，通过将法律文书表示为时序事件序列，并基于无监督方式进行相似性计算，选取分值较高的历史案件进行推送，大大减少了人力投入，可更好地实现推送智能化。It can be seen from the technical solution provided by the present invention that, by classifying legal documents and analyzing the distribution of case subjects, the historical cases that are most similar to the semantic information described in the target case can be selected while ensuring the same case category. Realize more comprehensive and accurate identification of similar cases; at the same time, by representing legal documents as time-series event sequences and performing similarity calculation based on an unsupervised method, historical cases with higher scores are selected for push, which greatly reduces manpower investment and can Better realization of push intelligence.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域的普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明实施例提供的一种法律类案相似度判别方法的流程图；1 is a flowchart of a method for judging similarity of legal similar cases provided by an embodiment of the present invention;

图2为本发明实施例提供的通过抽取每个时间段对应的事件所表示的事件时间链示意图；2 is a schematic diagram of an event time chain represented by extracting events corresponding to each time period provided by an embodiment of the present invention;

图3为本发明实施例提供的具体案例抽取的事件时间链示意图；3 is a schematic diagram of an event time chain extracted from a specific case provided by an embodiment of the present invention;

图4为本发明实施例提供的对事件序列时间点进行对齐的示意图。FIG. 4 is a schematic diagram of aligning event sequence time points according to an embodiment of the present invention.

具体实施方式Detailed ways

下面结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present invention.

本发明实施例提供一种法律类案相似度判别方法，如图1所示，包括：An embodiment of the present invention provides a method for judging the similarity of legal similar cases, as shown in FIG. 1 , including:

步骤1、对目标法律案件进行分类，根据得到的案件类别，从历史案件数据库提取相同类别的历史案件构成候选集。Step 1. Classify the target legal case, and according to the obtained case category, extract historical cases of the same category from the historical case database to form a candidate set.

类案识别旨在寻找同类历史案件中与目标案件相似甚至相同的案件，因此首先要进行案件分类。Similar case identification aims to find similar or even identical cases to the target case in similar historical cases, so case classification should be carried out first.

现有历史案件数据库中每个案件均有对应的类别标签，可根据案件类别分布，对每个案件类别选取一定数量历史案件构建数据集。通过采用自然语言处理中比较成熟的深度学习文本分类算法作为分类器，得到目标案件对应的案件类别。根据得到的案件类别，抽取历史案件数据库中同类历史案件构成候选集。Each case in the existing historical case database has a corresponding category label. According to the case category distribution, a certain number of historical cases can be selected for each case category to construct a data set. By using a relatively mature deep learning text classification algorithm in natural language processing as a classifier, the case category corresponding to the target case is obtained. According to the obtained case category, similar historical cases in the historical case database are extracted to form a candidate set.

步骤2、对目标法律案件与候选集中的每个同类历史案件，进行事件序列表示。Step 2. Perform event sequence representation on the target legal case and each similar historical case in the candidate set.

本步骤针对法律案件特点设计文本表示模型，可有效表示事件之间的联系，呈现案件随时间的演变过程，具有更高的实际应用价值。本步骤的优选实施方式如下：In this step, a text representation model is designed according to the characteristics of legal cases, which can effectively represent the connection between events and present the evolution process of the case over time, which has higher practical application value. The preferred embodiment of this step is as follows:

1)对于任意法律案件，利用信息抽取技术对法律案件文书中的案件要素进行抽取，案件要素至少包括：被告人职位、被告人案发后态度、时间、核心词、以及涉案金额等。1) For any legal case, use information extraction technology to extract the case elements in the legal case documents. The case elements at least include: the defendant's position, the defendant's attitude after the case, time, core words, and the amount involved.

其中，核心词通过依存句法分析方法提取，简单来说，是将法律案件文书中的语句按照句号、感叹号及问号进行分句，如果分句后数目为n，则对n句话进行依存分析，得到n个核心词。Among them, the core words are extracted by the method of dependency syntax analysis. In short, the sentences in the legal case document are divided into sentences according to periods, exclamation marks and question marks. Get n core words.

依存句法分析方法是目前较为的成熟的技术，其原理是：通过词汇之间的依存关系表达整个句子结构，这些依存关系表达了句子各成分之间的语义依赖关系，所有词汇之间的依存关系构成一棵句法树，树的根节点为句子核心谓词，用来表达整个句子的核心内容。该核心谓语即为句子对应的核心词。Dependency syntactic analysis is a relatively mature technology at present. Its principle is: the entire sentence structure is expressed through the dependencies between words. These dependencies express the semantic dependencies between the components of the sentence and the dependencies between all words. A syntax tree is formed, and the root node of the tree is the core predicate of the sentence, which is used to express the core content of the entire sentence. The core predicate is the core word corresponding to the sentence.

2)独立保留被告人职位与被告人案发后态度两个案件要素。2) Independently retain the defendant's position and the defendant's attitude after the incident.

3)将剩余案件要素按照不同的时间节点定性进行事件之间的时序关系表示，对于发生时间无交叉、且无重叠的事件，视为独立事件分别表示，其余情况进行事件合并，以此组成案件情节事件链。3) The remaining case elements are qualitatively represented by the time sequence relationship between the events according to different time nodes. For the events with no overlap and no overlap in occurrence time, they are regarded as independent events, and the remaining cases are combined to form a case. Plot event chain.

本领域技术人员可以理解，法律案件文书中通常按照时间顺序进行描述，如后文给出的两个案例，不重叠的时间段，对应不同事件。事件首先通过时间定位，然后找到该时间内对应的事件描述语句。然后通过事件抽取技术(抽取各个案件要素)来将非结构化的事件描述语句结构化为各个案件要素的组合。Those skilled in the art can understand that legal case documents are usually described in chronological order. For example, two cases given below, non-overlapping time periods correspond to different events. The event is first located by time, and then the corresponding event description sentence within the time is found. Then, the unstructured event description sentence is structured into a combination of various case elements through event extraction technology (extracting each case element).

4)对时间i时发生事件的案件要素进行数值化表示，得到event_i＝(e_i1，e_i2，…，e_in)，其中，n为时间i时发生事件的案件要素的个数。4) Numerically represent the case elements with events occurring at time i to obtain event _i =(e _i1 , e _i2 ,..., e _in ), where n is the number of case elements with events occurring at time i.

5)初始化权重向量weight＝(w₁，w₂，…，w_n)，其中，

将权重向量与event_i中每个元素对应相乘，得到时间i事件的最终表示vector_i＝(w₁e_i1，w₂e_i2，…，w_ne_in)；。5) Initialize the weight vector weight=(w ₁ , w ₂ , . . . , _wn ), where,

Multiply the weight vector with each element in event _i correspondingly to obtain the final representation vector _i of the event i at time i = (w ₁ e _i1 , w ₂ e _i2 , . . . , w _n e _in );.

本发明实施例中，权重向量为先验知识，通过经验预先初始化。在后续学习过程中，可视结果进行更新，具体更新方式可由用户根据情况或者经验自行选定。In the embodiment of the present invention, the weight vector is a priori knowledge and is pre-initialized through experience. In the follow-up learning process, the results can be updated, and the specific update method can be selected by the user according to the situation or experience.

6)将所有时间的事件表示连接起来，得到向量化表示的时序事件链EventSequence＝(vector₁，vector₂，…，vector_m)，其中vector_i表示时间i时发生的事件对应的向量表示，m为从法律案件文书中提取的独立事件与合并事件的总个数。6) Connect the event representations of all times to obtain a vectorized sequential event chain EventSequence=(vector ₁ , vector ₂ , ..., vector _m ), where vector _i represents the vector representation corresponding to the event that occurs at time i, m is the total number of independent events and combined events extracted from legal case documents.

步骤3、根据事件序列度量模型，计算目标法律案件对应的事件序列与候选集中每个历史案件对应的事件序列的距离。Step 3: Calculate the distance between the event sequence corresponding to the target legal case and the event sequence corresponding to each historical case in the candidate set according to the event sequence measurement model.

根据步骤2得到的事件序列进行序列距离度量。在衡量事件序列相似性时，存在待比对序列时间长度不一致的情况，因此需要对序列进行对齐。采用动态时间规整DTW方法，将目标法律案件对应的事件序列与历史案件对应的每个事件序列，从起点开始匹配，每到一个点，计算对应两点之间的距离，并累加之前通过的所有点的距离，最后选取最小的累加距离作为事件序列距离Distance_{EventSequence}。该方法能够通过寻找点点之间的对应关系，最大程度降低两个序列距离的点到点的匹配。Sequence distance measurement is performed according to the event sequence obtained in step 2. When measuring the sequence similarity of events, there are cases where the time lengths of the sequences to be compared are inconsistent, so the sequences need to be aligned. Using the dynamic time warping DTW method, the event sequence corresponding to the target legal case is matched with each event sequence corresponding to the historical case, starting from the starting point, and each time a point is reached, the distance between the corresponding two points is calculated, and all previously passed through are accumulated. Point distance, and finally select the smallest accumulated distance as the event sequence distance Distance _{EventSequence} . This method can minimize the point-to-point matching of the distance between the two sequences by finding the correspondence between the points.

步骤4、基于事件序列的距离并结合打分函数，计算目标法律案件与候选集中历史案件的相似度。Step 4: Calculate the similarity between the target legal case and the historical case in the candidate set based on the distance of the event sequence and combining with the scoring function.

本发明实施例中，考虑主题相似度、步骤3获得的事件序列的距离、以及步骤2获得的被告人职位、被告人案发后态度的相似度，主要如下：In the embodiment of the present invention, considering the similarity of the subject, the distance of the event sequence obtained in step 3, and the similarity of the position of the defendant obtained in step 2 and the attitude of the defendant after the incident, the main factors are as follows:

1)通过主题分析模型，对目标法律案件与候选集中历史案件进行主题分析，得到相应的主题概率分布，根据主题概率分布，计算目标法律案件与候选集中历史案件的语义距离作为主题相似度Distance_topic。1) Through the topic analysis model, perform topic analysis on the target legal case and the historical cases in the candidate set, and obtain the corresponding topic probability distribution. According to the topic probability distribution, calculate the semantic distance between the target legal case and the historical cases in the candidate set as the topic similarity Distance _topic .

2)进行事件序列表示时，从目标法律案件与候选集中历史案件中各自提取了被告人职位与被告人案发后态度两个案件要素，利用余弦距离计算目标法律案件与候选集中历史案件中，被告人职位、被告人案发后态度的相似度，分别记为Distance_position、Distance_attitude。2) In the event sequence representation, the two elements of the defendant's position and the defendant's attitude after the incident were extracted from the target legal case and the historical cases in the candidate set, respectively, and the cosine distance was used to calculate the target legal case and the historical cases in the candidate set. The similarity of the defendant's position and the defendant's attitude after the incident was recorded as Distance _position and Distance _attitude respectively.

3)目标法律案件与候选集中历史案件的事件序列距离记为Distance_{EventSequence}。3) The event sequence distance between the target legal case and the historical cases in the candidate set is recorded as Distance _{EventSequence} .

4)利用下述公式计算目标法律案件与候选集中历史案件的相似度：4) Calculate the similarity between the target legal case and the historical cases in the candidate set using the following formula:

score＝α₁Distance_topic+α₂Distance_{EventSequence} +α₃Distance_position+α₄Distance_attitude score=α ₁ Distance _topic +α ₂ Distance _{EventSequence} +α ₃ Distance _position +α ₄ Distance _attitude

其中，α₁、α₂、α₃及α₄均为权重，α₁+α₂+α₃+α₄＝1。Among them, α ₁ , α ₂ , α ₃ and α ₄ are all weights, and α ₁ +α ₂ +α ₃ +α ₄ =1.

为了便于理解，下面结合具体的示例进行介绍，下述示例中所涉及的案件类型、案件信息等均为举例。For ease of understanding, the following description is given in conjunction with specific examples, and the case types and case information involved in the following examples are all examples.

案例一：被告人孙某在担任A公司业务员期间，多次挪用A公司的应收工程款归个人使用。具体事实分述如下：1、被告人孙某于2017年4月至7月期间，收取吴某交纳的B1小区工程款9.6万元后，交给A公司6.5万元，将其余3.1万元归个人使用。2、被告人孙某于 2017年10月期间，收取王某交纳的B2小区工程款5.8万元后未交给A公司，归个人使用。 3、被告人孙某于2017年12月期间，收取周某交纳的B3小区工程款11.3万元后未交给A公司，归个人使用。案发后，被告人孙某如实供述了犯罪事实，退还了被害单位(即A公司)人民币10万元并取得谅解。Case 1: During the period when the defendant, Mr. Sun, was a salesperson of Company A, he repeatedly misappropriated the project receivables of Company A for personal use. The specific facts are as follows: 1. During the period from April to July 2017, the defendant, Mr. Sun, received 96,000 yuan for the B1 community project paid by Wu, and then handed over 65,000 yuan to Company A, and returned the remaining 31,000 yuan to the company. personal use. 2. During October 2017, the defendant, Mr. Sun, received 58,000 yuan for the B2 community project paid by Mr. Wang, but did not hand it over to Company A for personal use. 3. During December 2017, the defendant Sun received 113,000 yuan for the B3 community project paid by Zhou, but did not hand it over to Company A for personal use. After the incident, the defendant Sun Mou truthfully confessed the facts of the crime, returned RMB 100,000 to the victim unit (namely Company A) and obtained an understanding.

案例二：被告人叶某担任某汽车销售服务有限公司(以下简称为C公司)销售顾问期间，利用职务便利，多次私自将公司车辆卖给客户，并将部分车款归自己使用未上交给C公司。具体犯罪事实如下：1、2017年11月15日，被告人叶某收取客户李某人民币130500 元购车款后归自己使用。2、2018年7月31日，被告人叶某收取客户周某1人民币52678元购车款后归自己使用。3、2018年8月4日，被告人叶某收取客户代某人民币20000元购车定金后归自己使用。案发后，被告人叶某已退赔C公司全部经济损失，并取得公司的书面谅解。Case 2: During the period when the defendant Ye was a sales consultant of an automobile sales service company (hereinafter referred to as Company C), he took advantage of his position to sell the company's vehicles to customers without permission for many times, and used part of the car money for himself without handing it over. to Company C. The specific criminal facts are as follows: 1. On November 15, 2017, the defendant Ye received a car payment of RMB 130,500 from the customer Li for his own use. 2. On July 31, 2018, the defendant Ye received the car payment of RMB 52,678 from the customer Zhou for his own use. 3. On August 4, 2018, the defendant Ye received a deposit of RMB 20,000 from the customer to purchase a car on his behalf and used it for himself. After the incident, the defendant, Ye Mou, has refunded all the economic losses of Company C and obtained a written understanding from the company.

参见前文提供的方式：See the method provided above:

步骤1、首先进行案件分类。选取采用深度学习中文本分类算法，如BERT、FastText、DPCNN等作为分类器，对案例一进行事件分类，得到类别为“挪用资金罪”。然后从历史案件数据库中选取所有“挪用资金罪”案件构成候选集，并依次抽取候选集中的案件与案例一进行相似度比较，以候选集中案例二为例。Step 1. First, classify the cases. Select the text classification algorithm in deep learning, such as BERT, FastText, DPCNN, etc., as the classifier, classify the events of Case 1, and get the category as "crime of misappropriation of funds". Then, select all "crime of misappropriation of funds" cases from the historical case database to form a candidate set, and sequentially select the cases in the candidate set to compare the similarity with case 1, taking case 2 in the candidate set as an example.

步骤2、通过以下方法进行事件序列表示：Step 2. Represent the event sequence by the following methods:

1)采用信息抽取方法，对案件要素进行结构化提取。得到案例一被告人职位为“业务员”，案发后态度“退还”；案例二被告人职位“销售顾问”，案发后态度“退赔”。各案例对应的时间、核心词、涉案金额及案发目的：案例一为：2017年4月至7月：收取、工程款、 31000元、个人使用，2017年10月：收取、工程款、58000元、个人使用，2017年12 月：收取、工程款、113000元、个人使用；案例二为：2017年11月15日：收取、购车款、130500元、自己使用，2018年7月31日：收取、购车款、52678元、自己使用， 2018年8月4日：收取、购车定金、20000元、自己使用。1) Use the information extraction method to extract the case elements in a structured manner. It was obtained that the defendant in case 1 was "salesman", and his attitude was "refunded" after the incident; the defendant's position in case 2 was "sales consultant", and his attitude was "refunded" after the incident. Corresponding time, key words, amount involved and purpose of the case: Case 1: April to July 2017: collection, project payment, 31,000 yuan, personal use, October 2017: collection, project payment, 58,000 RMB, personal use, December 2017: collection, project payment, 113,000 yuan, personal use; case 2: November 15, 2017: collection, car purchase, 130,500 yuan, own use, July 31, 2018: Collection, car purchase, 52,678 yuan, own use, August 4, 2018: collection, car purchase deposit, 20,000 yuan, own use.

2)根据抽取结果，独立保留案例一被告人职位为“业务员”，案发后态度“退还”；案例二被告人职位“销售顾问”，案发后态度“退赔”两要素。2) According to the extraction results, the position of the defendant in Case 1 was independently reserved as "salesman", and the attitude of the defendant was "refunded" after the incident; the position of the defendant in Case 2 was "sales consultant", and the attitude of "refund" after the incident occurred.

3)如图2所示，按照时间定性关系对时序案件要素进行组织。对于发生时间无交叉、无重叠的事件，视为独立事件分别表示，其余情况进行事件合并，得到图3所示事件链，包含时间、触发词、涉案动机及金额等案件要素。3) As shown in Figure 2, the time series case elements are organized according to the temporal qualitative relationship. Events with no overlap or overlap in occurrence time are regarded as independent events and represented separately, and events are combined for other cases to obtain the event chain shown in Figure 3, including case elements such as time, trigger words, motives involved, and amount of money.

4)采用gensim中word2vec训练词向量方法将事件链中的案件要素数值化。4) Use the word2vec training word vector method in gensim to quantify the case elements in the event chain.

5)根据权重进行向量计算。如案例一2017年10月事件event_2017年10月＝ (收取，工程款，58000，自己使用)，假设初始化权重为weight＝(0.3，0.2，0.2，0.3)，则计算得到事件event_2017年10月的向量化表示为 vector_2017年10月＝(0.3e_收取，0.2e_工程款，0.2e₅₈₀₀₀，0.3e_自己使用)，e_要素1代表案件要素1向量值。5) Perform vector calculation according to the weight. For example, the event event _{in October 2017 in} case 1 = (collection, project payment, 58000, self-use), assuming that the initialization weight is weight=(0.3, 0.2, 0.2, 0.3), then the event event is calculated as _{10 in 2017} The vectorization of _month is expressed as vector _{October 2017} = (0.3e _{for collection} , 0.2e for _{project payment} , 0.2e for ₅₈₀₀₀ , 0.3e for _{own use} ), e _{element 1} represents the vector value of case element 1.

6)将所有时间的事件表示连接起来，即得到向量化表示的时序事件链。6) Connect the event representations at all times, that is, to obtain a vectorized sequential event chain.

步骤3、步骤2得到的事件序列进行序列距离度量。采用如图4所示动态时间规整方法，对案例一对应的事件序列EventSequence_案例一＝(vector_{2017年4月至7月}，vector_2017年10月，vector_2017年12月)和案例二对应的事件序列 EventSequence_案例二＝(vector_{2017年11月15日}，vector_{2018年7月31日}，vector_{2018年8月4日})进行时间点对齐与距离计算，则案例一中vector_{2017年4月至7月}与案例二中vector_{2017年11月15日}对应，依次类推。得到事件序列距离计算结果Distance_{EventSequence}。The event sequence obtained in step 3 and step 2 is subjected to sequence distance measurement. Using the dynamic time warping method shown in Figure 4, for the event sequence EventSequence corresponding to Case 1, _{Case 1} = (vector _{April to July 2017} , vector _{October 2017} , vector _{December 2017} ) and events corresponding to Case 2 Sequence EventSequence _{case 2} = (vector _{on November 15, 2017} , vector _{on July 31,} _{2018, vector on August 4, 2018} ) for time point alignment and distance calculation, then in case one, the vector is _{from April to July 2017} Corresponds to the vector in case 2 _{on November 15, 2017} , and so on. Get the distance calculation result of the event sequence Distance _{EventSequence} .

步骤4、通过以下方法对序列进行整体相似性打分：Step 4. Score the overall similarity of the sequences by the following methods:

1)计算两案件主题相似度Distance_topic。利用gensim中的LDA主题分析模型，输入案例一、案例二法律文书，得到相应的主题分布。根据主题概率分布，通过距离公式KL 距离计算出两篇法律文书的语义距离即主题相似度Distance_topic；1) Calculate the distance _topic similarity between the two cases. Using the LDA topic analysis model in gensim, input the legal documents of Case 1 and Case 2 to obtain the corresponding topic distribution. According to the topic probability distribution, the semantic distance of the two legal documents is calculated by the distance formula KL distance, that is, the topic similarity Distance _topic ;

2)利用余弦距离计算两案件职位“业务员”与“销售顾问”相似度Distance_position和案发后态度“退还”和“退赔”相似度Distance_attitude；2) Use the cosine distance to calculate the similarity Distance _position of the positions "salesperson" and "sales consultant" in the two cases, and the similarities Distance _attitude of "refund" and "refund" after the case;

3)综合案件的主题相似度、事件序列距离、职位及案发后态度相似度进行打分3) Scoring based on the subject similarity, event sequence distance, position and attitude similarity after the incident of the case

score＝α₁Distance_topic+α₂Distance_{EventSequence} +α₃Distance_position+α₄Distance_attitude。score=α ₁ Distance _topic +α ₂ Distance _{EventSequence} +α ₃ Distance _position +α ₄ Distance _attitude .

本发明另一实施例还提供一种法律类案推送方法，该方法利用前述实施例提供的相似度判别方法计算出目标法律案件与候选集中历史案件的相似度，再按照相似度分值从高到低的顺序对候选集中的历史案件进行排序，提取出排名靠前的M个历史案件进行推送。其中，M的取值可根据实际情况自行设定，例如，M＝10。Another embodiment of the present invention also provides a method for pushing legal similar cases, which calculates the similarity between the target legal case and the historical cases in the candidate set by using the similarity discrimination method provided in the foregoing embodiment, and then calculates the similarity according to the similarity score from high to high. Sort the historical cases in the candidate set in the lowest order, extract the top M historical cases and push them. The value of M can be set according to the actual situation, for example, M=10.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现，也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解，上述实施例的技术方案可以以软件产品的形式体现出来，该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM，U盘，移动硬盘等)中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the above embodiments can be implemented by software or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above embodiments may be embodied in the form of software products, and the software products may be stored in a non-volatile storage medium (which may be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present invention.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. Substitutions should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A legal class similarity discrimination method is characterized by comprising the following steps:

classifying target legal cases, and extracting historical cases of the same category from a historical case database to form a candidate set according to the obtained case category;

performing event sequence representation on the target legal case and each similar historical case in the candidate set;

calculating the distance between the event sequence corresponding to the target legal case and the event sequence corresponding to each historical case in the candidate set according to the event sequence measurement model;

calculating the similarity between the target legal case and the candidate centralized historical case based on the distance of the event sequence and the scoring function;

wherein the event sequence representation of the target legal case and each similar historical case in the candidate set comprises:

for any legal case, extracting case elements in the legal case document by using an information extraction technology, wherein the case elements at least comprise: the positions of the defenders, the postscript attitudes, the time, the core words and the involved amount of the defenders; the core words are extracted by a dependency syntactic analysis method, namely, sentences in the legal case document are divided according to periods, exclamation marks and question marks, dependency analysis is carried out on the sentence division results, and corresponding core words are obtained;

two case elements of the position of the announced person and the after-case attitude of the announced person are independently reserved;

representing the time sequence relation among the events of the rest case elements according to different time nodes, regarding the events without crossing and overlapping in occurrence time as independent events, respectively representing, and combining the events of the rest cases to form a case scenario event chain;

numerically expressing case elements of the events occurring at the time i to obtain the events _i ＝(e _i1 ,e _i2 ,…,e _in ) Wherein n is the number of case elements of the event occurring at the time i;

initialization weight vector weight ═ w ₁ ,w ₂ ,…,w _n ) Wherein, in the step (A),

associating the weight vector with the event _i Multiplying each element correspondingly to obtain a final representation vector of the time i event _i ＝(w ₁ e _i1 ,w ₂ e _i2 ,…,w _n e _in )；

Connecting the event representations of all the time to obtain a sequence event chain (vector) of vectorized representation ₁ ,vector ₂ ,…,vector _m ) Vector therein _i And m is the total number of independent events and combined events extracted from the legal case document.

2. The legal class similarity determination method of claim 1, wherein the classifying the target legal case comprises: and classifying the target legal case by using a deep learning text classification algorithm in natural language processing as a classifier to obtain a case category corresponding to the target legal case.

3. The method as claimed in claim 1, wherein the calculating the distance between the event sequence corresponding to the target legal case and the event sequence corresponding to each historical case in the candidate set according to the event sequence metric model comprises:

matching an event sequence corresponding to a target legal case with each event sequence corresponding to a historical case by adopting a Dynamic Time Warping (DTW) method from a starting point, calculating the Distance between the corresponding two points when each event sequence reaches one point, accumulating the distances of all points which pass before, and finally selecting the minimum accumulated Distance as the Distance of the event sequence _{EventSequence} 。

4. The legal class similarity discrimination method of claim 1, wherein calculating the similarity between the target legal case and the candidate centralized history case based on the distance of the event sequence and the scoring function comprises:

performing topic analysis on the target legal case and the candidate centralized history case through a topic analysis model to obtain corresponding topic probability distribution, and calculating the semantic Distance between the target legal case and the candidate centralized history case as topic similarity Distance according to the topic probability distribution _topic ；

When the event sequence is expressed, two case elements of the position of the advertiser and the post-event attitude of the advertiser are respectively extracted from the target legal case and the candidate centralized history case, the similarity of the position of the advertiser and the post-event attitude of the advertiser in the target legal case and the candidate centralized history case is calculated by utilizing the cosine Distance, and the similarity is respectively marked as Distance _position 、Distance _attitude ；

Event sequence distance between target legal case and candidate centralized historical caseIs recorded as Distance _{EventSequence} ；

Calculating the similarity of the target legal case and the candidate centralized historical case by using the following formula:

score＝α ₁ Distance _topic +α ₂ Distance _{EventSequence} +α ₃ Distance _position +α ₄ Disyance _attitude

wherein alpha is ₁ 、α ₂ 、α ₃ And alpha ₄ Are all weighted, α ₁ +α ₂ +α ₃ +α ₄ ＝1。

5. A legal class pushing method is characterized by comprising the following steps: the method of any one of claims 1 to 4 is utilized to calculate the similarity between the target legal case and the historical cases in the candidate set, then the historical cases in the candidate set are ranked according to the similarity score from high to low, and M historical cases with the top rank are extracted for pushing.