CN114881028A - Case similarity matching method, device, computer equipment and storage medium - Google Patents

Case similarity matching method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN114881028A
CN114881028A CN202210646944.0A CN202210646944A CN114881028A CN 114881028 A CN114881028 A CN 114881028A CN 202210646944 A CN202210646944 A CN 202210646944A CN 114881028 A CN114881028 A CN 114881028A
Authority
CN
China
Prior art keywords
text
case
case text
processing
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210646944.0A
Other languages
Chinese (zh)
Other versions
CN114881028B (en
Inventor
胡懋成
王秋阳
郑博超
凤阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sunwin Intelligent Co Ltd
Original Assignee
Shenzhen Sunwin Intelligent Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sunwin Intelligent Co Ltd filed Critical Shenzhen Sunwin Intelligent Co Ltd
Priority to CN202210646944.0A priority Critical patent/CN114881028B/en
Publication of CN114881028A publication Critical patent/CN114881028A/en
Application granted granted Critical
Publication of CN114881028B publication Critical patent/CN114881028B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

本发明实施例公开了一种案件相似度匹配方法、装置、计算机设备及存储介质,方法包括;获取案件数据库中的案件判决书文本;从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库;从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本;将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值;若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。本发明提升了案件相似度匹配的有效性和准确性。

Figure 202210646944

The embodiment of the present invention discloses a case similarity matching method, device, computer equipment and storage medium. The method includes: obtaining a case judgment text in a case database; collecting stop words and special noun words from the case judgment text, and generating Stop word vocabulary database and proprietary word vocabulary database; select the first case text and the second case text that need similarity matching from the case judgment text; input the first case text and the second case text into the twin network for processing to obtain the similarity probability value of the first case text and the second case text; if the similarity probability value of the first case text and the second case text meets the set similarity threshold, then determine the first case text and the second case text for similar cases. The present invention improves the effectiveness and accuracy of case similarity matching.

Figure 202210646944

Description

案件相似度匹配方法、装置、计算机设备及存储介质Case similarity matching method, device, computer equipment and storage medium

技术领域technical field

本发明涉及数据检索技术领域,更具体地说是案件相似度匹配方法、装置、计算机设备及存储介质。The present invention relates to the technical field of data retrieval, and more particularly to a case similarity matching method, device, computer equipment and storage medium.

背景技术Background technique

随着时代的发展,法院审判案件激增,案件库中案件判决书也越来越多,在案件分析时,往往需要从案件库中找出两个或者多个相似案件进行对比,目前针对案件相似度查找有以下几种方式。With the development of the times, the number of court trials has increased, and there are more and more case judgments in the case database. When analyzing a case, it is often necessary to find two or more similar cases from the case database for comparison. There are several ways to search.

第一种是将案件数据库中的案件调取出来,对每个案件对象模型中的人车物属性元素分别进行提取并添加到至对应的人车数组比对容器,计算各待比对人车物数组中的人车物属性的相似性,并将相似性最大的至少两个属性元素对象与相应的相似度值以键值对的形式记录至相似性映表中,最终依据相似性映射表对各案件的案件对象模型中的人车物属性元素进行相似性排序展示。该方法依据属性的键值对进行文章的相似度匹配,忽略大量非人车物信息,对非人车物或涉及其他领域的案件信息丢失严重,对案件匹配效果有严重影响。The first one is to retrieve the cases in the case database, extract the attribute elements of people, vehicles and objects in each case object model and add them to the corresponding comparison container of the people and vehicles array, and calculate the people and vehicles to be compared. The similarity of the attributes of people, vehicles and objects in the object array, and the at least two attribute element objects with the largest similarity and the corresponding similarity values are recorded in the similarity map in the form of key-value pairs, and finally based on the similarity map. Sort and display the similarity of the attribute elements of people, vehicles and objects in the case object model of each case. This method performs similarity matching of articles based on the key-value pairs of attributes, ignoring a large amount of information on non-human vehicles and objects, serious loss of information on non-human vehicles or cases involving other fields, and has a serious impact on the effect of case matching.

第二种是包括以文书的布局和要点词作为约束条件,利用自动抽取算法,抽取文书的案件事实、争议焦点以及裁判结果的三段块;基于领域词表,利用主题模型分别抽取各文书段块的主题词得到各文书段块的主题词快和非主题词块,根据各文书段块主题词以及非主题词中的特征词构建特征倒排索引。将特征倒排索引映射为特征向量,并利用主题相似度模型计算查询语句与文书数据集中各文书的相似度,对查询语句与文书数据集中各文书相似度进行排序并输出排序结果完成文书检索。该方法仅仅依据主题词进行相似度判别,忽略了文本本身内容,对于相似主题的案件的很难进行细分辨别。The second is to include three-segment blocks that use the layout of the document and key words as constraints, and use the automatic extraction algorithm to extract the case facts, dispute focus, and judgment results of the document; based on the domain vocabulary, use the topic model to extract each document segment The subject headings of the blocks are obtained from the subject headings and non-subject headings of each document segment block, and the feature inverted index is constructed according to the subject headings and the feature words in the non-subject headings of each document segment block. The feature inverted index is mapped to the feature vector, and the similarity between the query sentence and each document in the document data set is calculated by the topic similarity model, the similarity between the query sentence and each document in the document data set is sorted, and the sorting result is output to complete the document retrieval. This method only judges the similarity based on the subject words, ignoring the content of the text itself, and it is difficult to subdivide and distinguish cases with similar subjects.

第三种对法律文本进行word2vec词向量前乳,将关键词用词向量进行表示,使用余弦相似度来计算不同案件之间的相似度。当获得与案件相关联的多种案件之后,基于关键词抽取的技术找出其判决结果,智能化给出本案件合理的判决结果范围,当实际判决结果与推荐判决范围差别过大时及时进行智能预警。该方法基于word2vec技术生成向量进行相似度比较,word2vec训练的词的表征,该方法无法对于不同上下文,不同上下语境的相同词无法进行表示,尤其对于专业词汇无法有效表达,所以形似度匹配效果不好,实用性不强。The third method is to use word2vec word vector for the legal text, express the keywords with word vectors, and use the cosine similarity to calculate the similarity between different cases. After obtaining a variety of cases related to the case, based on the keyword extraction technology to find out the judgment result, intelligently give a reasonable judgment result range for this case, and timely carry out when the actual judgment result is too different from the recommended judgment range. Smart alerts. This method is based on the word2vec technology to generate vectors for similarity comparison, the representation of words trained by word2vec, this method cannot represent the same words in different contexts and contexts, especially for professional vocabulary, it cannot be effectively expressed, so the matching effect of shape similarity Not good, not very practical.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足,提供案件相似度匹配方法、装置、计算机设备及存储介质,能够提高案件相似度匹配的有效性和准确性。The purpose of the present invention is to overcome the deficiencies of the prior art, and to provide a case similarity matching method, device, computer equipment and storage medium, which can improve the effectiveness and accuracy of the case similarity matching.

为实现上述目的,本发明采用以下技术方案:To achieve the above object, the present invention adopts the following technical solutions:

第一方面,案件相似度匹配方法,包括:In the first aspect, the case similarity matching method includes:

获取案件数据库中的案件判决书文本;Obtain the text of the case judgment in the case database;

从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库;Collect stop words and proper noun words from the text of case judgments, and generate stop word word database and proper word word database;

从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本;Select the first case text and the second case text that need similarity matching from the case judgment text;

将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值;Input the first case text and the second case text into the twin network for processing to obtain the similarity probability value of the first case text and the second case text;

若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。If the similarity probability value of the first case text and the second case text meets the set similarity threshold, it is determined that the first case text and the second case text are similar cases.

其进一步技术方案为:所述将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值,所述孪生网络包括基于ERNIE的文本向量的网络模型、基于WordGCN图的文本向量的网络模型以及基于主题词的文本向量的网络模型。Its further technical scheme is: the first case text and the second case text are input into the twin network for processing, so as to obtain similar probability values of the first case text and the second case text, and the twin network includes an ERNIE-based The network model of text vector, the network model of text vector based on WordGCN graph, and the network model of text vector based on subject word.

其进一步技术方案为:所述所述将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值,包括:Its further technical solution is: the described inputting the first case text and the second case text into the twin network for processing, so as to obtain the similarity probability value of the first case text and the second case text, including:

将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征;Inputting the first case text and the second case text into the ERNIE-based text vector network model for processing to obtain the first processing features of the first case text and the second case text;

将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征;Inputting the first case text and the second case text into the network model based on the text vector of the WordGCN graph for processing to obtain the second processing features of the first case text and the second case text;

将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征;Inputting the first case text and the second case text into the network model based on the text vector of the subject word for processing, so as to obtain the third processing feature of the first case text and the second case text;

将第一案件文本和第二案件文本的第一处理特征和第一案件文本和第二案件文本的第二处理特征进行concate合并处理,以得到第一案件文本和第二案件文本的合并特征;Perform concate merge processing on the first processing feature of the first case text and the second case text and the second processing feature of the first case text and the second case text to obtain the merged feature of the first case text and the second case text;

将第一案件文本和第二案件文本的合并特征输入到全连接层处理,以得到第一案件文本和第二案件文本的全连接层处理特征;inputting the merged features of the first case text and the second case text into the fully connected layer processing to obtain the fully connected layer processing features of the first case text and the second case text;

将第一案件文本和第二案件文本的全连接层处理特征与第一案件文本和第二案件文本的第三处理特征进行乘法运算,以得到第一案件文本和第二案件文本的文本语义表征特征;Multiply the fully connected layer processing features of the first case text and the second case text with the third processing feature of the first case text and the second case text to obtain the text semantic representation of the first case text and the second case text feature;

对第一案件文本和第二案件文本的文本语义表征特征进行全连接层和激活函数处理,以得到第一案件文本和第二案件文本的文本抽象语义表征;Perform full connection layer and activation function processing on the text semantic representation features of the first case text and the second case text to obtain the text abstract semantic representation of the first case text and the second case text;

将第一案件文本和第二案件文本的文本抽象语义表征经过维度为1的全连接层的矩阵以及sigmoid激活函数处理,以得到第一案件文本和第二案件文本的相似概率值。The text abstract semantic representation of the first case text and the second case text is processed through a matrix of a fully connected layer with a dimension of 1 and a sigmoid activation function to obtain the similarity probability value of the first case text and the second case text.

其进一步技术方案为:所述将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征,包括:Its further technical scheme is: the first case text and the second case text are input into the ERNIE-based text vector network model for processing, so as to obtain the first processing features of the first case text and the second case text, including: :

根据第一案件文本和第二案件文本中文本内容的断句符号进行语句切分;Perform sentence segmentation according to the sentence segmentation symbols of the text content in the first case text and the second case text;

通过分词工具并结合停用词词汇库和专有词词汇库对语句进行分词,以得到分词数据;Tokenize the sentence by using the word segmentation tool combined with the stop word lexicon and the proprietary word lexicon to obtain word segmentation data;

通过ERNIE基于MLM对分词数据进行处理,以得到每个词的词向量;The word segmentation data is processed based on MLM through ERNIE to obtain the word vector of each word;

将每一句话中的每个词的词向量进行求和运算,以得到句向量的特征向量;The word vector of each word in each sentence is summed to obtain the feature vector of the sentence vector;

将文本内容的所有句向量的特征向量通过Bi-LSTM进行concate融合,以得到第一案件文本和第二案件文本的第一处理特征。The feature vectors of all sentence vectors of the text content are concatenated and fused by Bi-LSTM to obtain the first processing features of the first case text and the second case text.

其进一步技术方案为:所述将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征,包括:Its further technical scheme is: the first case text and the second case text are input into the network model based on the text vector of the WordGCN graph for processing, so as to obtain the second processing feature of the first case text and the second case text, including: :

通过WordGCN模型中的句子层级和语料层级中的词与词之间的关系对第一案件文本和第二案件文本的单词进行编码,以得到词向量;Encode the words of the first case text and the second case text through the relationship between words in the sentence level and corpus level in the WordGCN model to obtain word vectors;

根据词向量构建语句向量;Construct sentence vector according to word vector;

将语句向量输入到Bi-GRU进行处理,以得到第一案件文本和第二案件文本第二处理特征。Input the sentence vector into Bi-GRU for processing to obtain the second processing feature of the first case text and the second case text.

其进一步技术方案为:所述将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征,包括:Its further technical solution is: the first case text and the second case text are input into the network model based on the subject word text vector for processing, so as to obtain the third processing feature of the first case text and the second case text, including :

对第一案件文本和第二案件文本的文本中的停用词通过停用词词汇库进行过滤;Filter the stop words in the text of the first case text and the second case text through the stop word vocabulary database;

对过滤后的文本的主题词进行提取;Extract the subject words of the filtered text;

记录提取的主题词对应的位置索引和重要性程度;Record the position index and importance degree corresponding to the extracted subject words;

通过专有词词汇库对第一案件文本和第二案件文本的文本中的专有名词进行提取;Extract proper nouns in the texts of the first case text and the second case text through the proper word vocabulary database;

记录提取的专有名词对应的位置索引和重要性程度;Record the position index and importance degree corresponding to the extracted proper noun;

将主题词的重要性程度和专有名词的重要性程度进行相加运算,以得到第三处理特征。The degree of importance of the subject word and the degree of importance of proper nouns are added up to obtain the third processing feature.

其进一步技术方案为:所述对过滤后的文本的主题词进行提取,通过BERTopic模型结合LDA模型对文本的主题词进行提取。The further technical solution is as follows: the subject words of the filtered text are extracted, and the subject words of the text are extracted by combining the BERTopic model with the LDA model.

第二方面,案件相似度匹配装置,包括获取单元、生成单元、选取单元、处理单元以及判定单元;In a second aspect, a case similarity matching device includes an acquisition unit, a generation unit, a selection unit, a processing unit, and a determination unit;

所述获取单元,用于获取案件数据库中的案件判决书文本;The obtaining unit is used to obtain the case judgment text in the case database;

所述生成单元,用于从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库;The generating unit is used to collect stop words and special noun words from the text of the case judgment, and generate stop word word database and proper word word database;

所述选取单元,用于从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本;The selection unit is used to select the first case text and the second case text that need to be matched for similarity from the case judgment text;

所述处理单元,用于将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值;The processing unit is configured to input the first case text and the second case text into the twin network for processing, so as to obtain the similarity probability value of the first case text and the second case text;

所述判定单元,用于若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。The determining unit is configured to determine that the first case text and the second case text are similar cases if the similarity probability value of the first case text and the second case text meets the set similarity threshold.

第三方面,一种计算机设备,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述的案件相似度匹配方法步骤。In a third aspect, a computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the computer program to achieve a similar case as above Degree matching method steps.

第四方面,一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,使得所述处理器执行如上述的案件相似度匹配方法步骤。In a fourth aspect, a computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute similar cases as described above. Degree matching method steps.

本发明与现有技术相比的有益效果是:本发明通过采用ERNIE技术,可以对多义词很好的结合上下文理解其在文本中的含义,基于孪生神经网络进行案件相似性分析,并结合注意力机制对与案件特殊性领域名词进行重点特征分析,加大了模型对案件内容的理解,提升了案件相似度匹配的有效性和准确性。Compared with the prior art, the present invention has the following beneficial effects: by adopting the ERNIE technology, the present invention can understand the meaning of the polysemy in the text by combining the context well, perform case similarity analysis based on the twin neural network, and combine the attention The mechanism analyzes key features of terms related to the specificity of the case, which increases the model's understanding of the content of the case and improves the effectiveness and accuracy of case similarity matching.

上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明技术手段,可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征及优点能够更明显易懂,以下特举较佳实施例,详细说明如下。The above description is only an overview of the technical solution of the present invention. In order to understand the technical means of the present invention more clearly, it can be implemented in accordance with the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable, the following A preferred embodiment is given, and the detailed description is as follows.

附图说明Description of drawings

为了更清楚地说明本发明实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention, which are of great significance to the art For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明具体实施例提供的案件相似度匹配方法的流程图;1 is a flowchart of a case similarity matching method provided by a specific embodiment of the present invention;

图2为本发明具体实施例提供的案件相似度匹配装置的示意性框图;2 is a schematic block diagram of a case similarity matching apparatus provided by a specific embodiment of the present invention;

图3为本发明具体实施例提供的一种计算机设备的示意性框图。FIG. 3 is a schematic block diagram of a computer device according to a specific embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

图1为本发明具体实施例提供的案件相似度匹配方法的流程图。FIG. 1 is a flowchart of a case similarity matching method provided by a specific embodiment of the present invention.

如图1所示,案件相似度匹配方法,包括以下步骤:S10-S50。As shown in Figure 1, the case similarity matching method includes the following steps: S10-S50.

S10、获取案件数据库中的案件判决书文本。S10. Obtain the case judgment text in the case database.

在本实施例中,案件数据库指的是用于法院案件判决书的数据库。In this embodiment, the case database refers to a database used for court case judgments.

S20、从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库。S20. Collect stop words and proper noun words from the text of the case judgment, and generate a stop word word database and a proper word word database.

在本实施例中,由于法律法条的修订等因素,一些词汇会被停用,另外,对于随着技术的发展,会出现以下新的专用名词。因此,可以从案件判决书文本中收集停用词汇和专用名词词汇,并为其建立生成停用词词汇库和专有词词汇库,以方便后续文本相似度分析。In this embodiment, due to factors such as revision of laws and regulations, some words will be disabled. In addition, with the development of technology, the following new special terms will appear. Therefore, stop words and proper noun words can be collected from the text of the case judgment, and a stop word word database and a proper word word database can be established and generated for them, so as to facilitate the subsequent text similarity analysis.

S30、从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本。S30: Select the first case text and the second case text that need to be matched for similarity from the case judgment text.

在选取之前,可以设定选取条件,例如,根据案情描述和案情结论有关的内容来选取相似度匹配文本。Before selection, selection conditions can be set, for example, the similarity matching text is selected according to the content related to the description of the case and the conclusion of the case.

当然需要说明的是,第一案件文本和第二案件文本并非仅仅是指一篇第一案件文本与一篇第二案件文本进行相似度匹配,也可以是指一篇第一案件文本与多篇第二案件文本进行相似度匹配。若是一篇第一案件文本与一篇第二案件文本进行相似度匹配,可以得出这两篇是否相似,若是一篇第一案件文本与多篇第二案件文本进行相似度匹配,可以得出在多篇第二案件文本与第一案件文本相似度的接近程度,并可根据相似度的接近程度进行排序,以方便人员进行案件分析。Of course, it should be noted that the first case text and the second case text not only refer to the similarity matching between a first case text and a second case text, but also refer to a first case text and multiple The second case text is subjected to similarity matching. If a first case text and a second case text are matched for similarity, it can be concluded whether the two are similar; if a first case text is matched with multiple second case texts, it can be obtained. The degree of similarity between the multiple second case texts and the first case texts can be sorted according to the degree of similarity, so as to facilitate case analysis by personnel.

S40、将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值。S40. Input the first case text and the second case text into the twin network for processing, so as to obtain the similarity probability value of the first case text and the second case text.

在本实施例中,孪生网络分为两个相同的且并行的左右子网络,左右子网络均包括基于ERNIE的文本向量的网络模型、基于WordGCN图的文本向量的网络模型以及基于主题词的文本向量的网络模型。通过左子网络可以对第一案件文本进行处理,通过右子网络可以对第二案件文本进行处理。In this embodiment, the twin network is divided into two identical and parallel left and right sub-networks, and the left and right sub-networks both include a network model based on ERNIE text vectors, a network model based on WordGCN graph text vectors, and text based on subject words. Vector network model. The first case text can be processed through the left sub-network, and the second case text can be processed through the right sub-network.

在本实施例中,步骤S40具体包括以下步骤:S401-S408,。In this embodiment, step S40 specifically includes the following steps: S401-S408,.

S401、将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征。S401. Input the first case text and the second case text into a network model based on ERNIE text vectors for processing, so as to obtain the first processing features of the first case text and the second case text.

基于ERNIE的文本向量的网络模型是对所有的案件通过ERNIE模型进行训练得到每个词的词向量。训练方式采用fine-tunning的形式进行词向量训练。The network model based on ERNIE's text vector is to train the ERNIE model for all cases to obtain the word vector of each word. The training method adopts the form of fine-tunning for word vector training.

在一实施例中,步骤S401具体包括以下步骤:S4011-S4015。In one embodiment, step S401 specifically includes the following steps: S4011-S4015.

S4011、根据第一案件文本和第二案件文本中文本内容的断句符号进行语句切分。S4011. Perform sentence segmentation according to the sentence segmentation symbols of the text content in the first case text and the second case text.

S4012、通过分词工具并结合停用词词汇库和专有词词汇库对语句进行分词,以得到分词数据。S4012. Perform word segmentation on the sentence by using a word segmentation tool in combination with a stop word vocabulary database and a proprietary word vocabulary database to obtain word segmentation data.

S4013、通过ERNIE基于MLM对分词数据进行处理,以得到每个词的词向量。S4013 , process the word segmentation data based on MLM through ERNIE to obtain a word vector of each word.

S4014、将每一句话中的每个词的词向量进行求和运算,以得到句向量的特征向量。S4014. Perform a summation operation on the word vector of each word in each sentence to obtain a feature vector of the sentence vector.

S4015、将文本内容的所有句向量的特征向量通过Bi-LSTM进行concate融合,以得到第一案件文本和第二案件文本的第一处理特征。S4015. Concate and fuse the feature vectors of all sentence vectors of the text content through Bi-LSTM to obtain the first processing features of the first case text and the second case text.

对于步骤S4011-S4015,在本实施例中,首先基于文本中的断句符号进行语句切分,在通过分词工具对语句进行分词,语句中的序列长度选择为16词长度,超过该词汇长度的文本序列采用截断的方式,短于序列长度采用padding的方式进行补齐。同时需要注意的是对于专有名词的分析需要结合专有名词词汇库进行分词避免将专有名词内部切分。对于ERNIE中transformer机制中的位置编码如果词在文本中的位置为偶数的时候采用

Figure BDA0003684411260000081
如果词在为本中的位置为奇数的时候
Figure BDA0003684411260000082
position表示词在句子中的位置,i为词向量的维度范围,选用0至64维度。For steps S4011-S4015, in this embodiment, sentence segmentation is first performed based on the sentence break symbol in the text, and then the sentence is segmented by a word segmentation tool, and the sequence length in the sentence is selected to be 16 words long, and the text exceeding the word length is selected. The sequence is truncated, and padding is used to fill in the sequence shorter than the sequence length. At the same time, it should be noted that the analysis of proper nouns needs to be combined with the proper noun vocabulary for word segmentation to avoid internal segmentation of proper nouns. For the position encoding in the transformer mechanism in ERNIE, if the position of the word in the text is an even number, it is used.
Figure BDA0003684411260000081
if the word's position in the base is an odd number
Figure BDA0003684411260000082
position represents the position of the word in the sentence, i is the dimension range of the word vector, and the dimension is from 0 to 64.

通过ERNIE模型基于MLM(Masked Language Model)的方法进行训练即可以得到每个词的词向量。一句话有多个词,通过多个词进行相加得到句向量的特征向量,将所有句向量进行concate融合得到文本向量。文本向量的长度选用500,即选出500句话,超出该长度的语句选择截断,便可得到长度为500,维度为64的基于词向量构建的语句向量embeddingenrie-sentence。将embeddingenrie-sentence输入到Bi-LSTM中得到文本向量embeddingernie-docThe word vector of each word can be obtained by training the ERNIE model based on the MLM (Masked Language Model) method. There are multiple words in a sentence. The feature vector of the sentence vector is obtained by adding the multiple words, and all the sentence vectors are concatenated and fused to obtain the text vector. The length of the text vector is 500, that is, 500 sentences are selected, and the sentences exceeding this length are selected to be truncated, and the sentence vector embedding enrie-sentence constructed based on the word vector with a length of 500 and a dimension of 64 can be obtained. Input the embedding enrie-sentence into Bi-LSTM to get the text vector embedding ernie-doc .

S402、将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征。S402. Input the first case text and the second case text into the network model based on the text vector of the WordGCN graph for processing, so as to obtain the second processing features of the first case text and the second case text.

在一实施例中,步骤S402具体包括以下步骤:S4021-S4023。In one embodiment, step S402 specifically includes the following steps: S4021-S4023.

S4021、通过WordGCN模型中的句子层级和语料层级中的词与词之间的关系对第一案件文本和第二案件文本的单词进行编码,以得到词向量。S4021. Encode the words of the first case text and the second case text through the relationship between words in the sentence level in the WordGCN model and in the corpus level to obtain word vectors.

S4022、根据词向量构建语句向量。S4022, construct a sentence vector according to the word vector.

S4023、将语句向量输入到Bi-GRU进行处理,以得到第一案件文本和第二案件文本第二处理特征。S4023: Input the sentence vector into the Bi-GRU for processing to obtain the second processing feature of the first case text and the second case text.

对于步骤S4021-S4023,在本实施例中,选用WordGCN模型对语句模型做特征表示。通过句子层级及语料层级中的词与词之间的关系对单词进行编码。通过切词以及固定句子长度的方法来进行句子编码,设定句子长度为16个词长度,超过句子长度通过截断的方式,没超过的方式通过pandding进行填充,仍然通过单词向量相加得到句向量。对于文本向量的长度选用500,超出该长度的语句选择截断,这样便可得到长度为500,维度为64的基于词向量构建的语句向量embeddingwords。将embeddingwords输入到Bi-GRU中得到文本向量embeddinggraph-docFor steps S4021-S4023, in this embodiment, the WordGCN model is selected to represent the sentence model. Words are encoded by word-to-word relationships at the sentence level and at the corpus level. Sentence encoding is performed by word segmentation and fixed sentence length. The sentence length is set to 16 words. If the length of the sentence exceeds the length of the sentence, it will be truncated. If the length of the sentence is not exceeded, it will be filled by pandding. The sentence vector is still obtained by adding the word vector. . Select 500 for the length of the text vector, and select truncation for sentences exceeding this length, so that the sentence vector embedding words constructed based on the word vector with a length of 500 and a dimension of 64 can be obtained. Input the embedding words into Bi-GRU to get the text vector embedding graph-doc .

S403、将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征。S403: Input the first case text and the second case text into the network model based on the text vector of the subject word for processing, so as to obtain the third processing feature of the first case text and the second case text.

在一实施例中,步骤S403具体包括以下步骤:S4031-S4036。In one embodiment, step S403 specifically includes the following steps: S4031-S4036.

S4031、对第一案件文本和第二案件文本的文本中的停用词通过停用词词汇库进行过滤。S4031. Filter the stop words in the text of the first case text and the second case text through a stop word vocabulary database.

S4032、对过滤后的文本的主题词进行提取。S4032. Extract the subject words of the filtered text.

S4033、记录提取的主题词对应的位置索引和重要性程度。S4033 , record the position index and importance degree corresponding to the extracted subject word.

S4034、通过专有词词汇库对第一案件文本和第二案件文本的文本中的专有名词进行提取。S4034. Extract proper nouns in the texts of the first case text and the second case text through the proper word vocabulary database.

S4035、记录提取的专有名词对应的位置索引和重要性程度。S4035, record the position index and importance degree corresponding to the extracted proper noun.

S4036、将主题词的重要性程度和专有名词的重要性程度进行相加运算,以得到第三处理特征。S4036 , adding the importance degree of the subject word and the importance degree of the proper noun to obtain a third processing feature.

对于步骤S4031-S4036,在本实施例中,首先对于文本中的停用词通过停用词词汇库进行过滤,基于上述分词后的记录采用主题词模型对案情主题词进行提取,得到案情中的主题词。主题词词汇采用的是BERTopic,内部使用的是c-tf-idf,是基于案件的类别数进行关键词提取而非文档数,同时也使用LDA模型对案件内容进行主题词提取。基于两种方法提取的主题词,记录主题词对应的位置索引以及词汇的重要性程度,重要性程度范围为0至1,再通过案件专有词词汇库对案情文本中的专有名词进行提取,记录其专有名词对应的位置索引,专有名词的重要程度统一设置为0.5。通过主题词位置索引以及专有名词位置索引的并集生成特殊词位置索引。将这些主题词重要程度进行相加得到注意力权重矩阵embeddingattentation,该矩阵大小为16x500。For steps S4031-S4036, in this embodiment, firstly, the stop words in the text are filtered through the stop word vocabulary database, and the subject word model is used to extract the subject words of the case based on the records after the word segmentation, to obtain the subject heading. The subject word vocabulary is BERTopic, and c-tf-idf is used internally. The keyword extraction is based on the number of categories of the case instead of the number of documents. At the same time, the LDA model is used to extract the subject words of the case content. Based on the subject headings extracted by the two methods, record the position index corresponding to the subject heading and the importance of the vocabulary. , record the position index corresponding to its proper noun, and the importance of proper noun is uniformly set to 0.5. The special word position index is generated by the union of the subject word position index and the proper noun position index. Add the importance of these keywords to get the attention weight matrix embedding attentation , the matrix size is 16x500.

S404、将第一案件文本和第二案件文本的第一处理特征和第一案件文本和第二案件文本的第二处理特征进行concate合并处理,以得到第一案件文本和第二案件文本的合并特征。S404. Perform concate processing on the first processing feature of the first case text and the second case text and the second processing feature of the first case text and the second case text to obtain a merger of the first case text and the second case text feature.

在本实施例中,将embeddingernie-doc与embeddinggraph-doc进行concate得到特征embeddingmergeIn this embodiment, the feature embedding merge is obtained by concatenating the embedding ernie-doc and the embedding graph-doc .

S405、将第一案件文本和第二案件文本的合并特征输入到全连接层处理,以得到第一案件文本和第二案件文本的全连接层处理特征。S405: Input the merged features of the first case text and the second case text into the fully-connected layer for processing, so as to obtain the fully-connected layer processing features of the first case text and the second case text.

在本实施例中,将embeddingmerge输入到全连接层得到特征embeddingmerge-fcIn this embodiment, the embedding merge is input to the fully connected layer to obtain the feature embedding merge-fc .

S406、将第一案件文本和第二案件文本的全连接层处理特征与第一案件文本和第二案件文本的第三处理特征进行乘法运算,以得到第一案件文本和第二案件文本的文本语义表征特征。S406: Multiply the fully connected layer processing features of the first case text and the second case text and the third processing feature of the first case text and the second case text to obtain the texts of the first case text and the second case text Semantic representation features.

在本实施例中,将将embeddingmerge-fc矩阵与注意力矩阵embeddingattentation相乘得到最终文本语义表征embeddingdocIn this embodiment, the final text semantic representation embedding doc is obtained by multiplying the embedding merge-fc matrix and the attention matrix embedding attentation .

S407、对第一案件文本和第二案件文本的文本语义表征特征进行全连接层和激活函数处理,以得到第一案件文本和第二案件文本的文本抽象语义表征。S407. Perform full connection layer and activation function processing on the text semantic representation features of the first case text and the second case text to obtain textual abstract semantic representations of the first case text and the second case text.

在本实施例中,将语义表征embeddingdoc接上全连接层以及激活函数再接上一个全连接层得到文本抽象语义表征embeddingabs-docIn this embodiment, the semantic representation embedding doc is connected to the fully connected layer and the activation function is connected to a fully connected layer to obtain the text abstract semantic representation embedding abs-doc .

S408、将第一案件文本和第二案件文本的文本抽象语义表征经过维度为1的全连接层的矩阵以及sigmoid激活函数处理,以得到第一案件文本和第二案件文本的相似概率值。S408: Process the text abstract semantic representations of the first case text and the second case text through a matrix of a fully connected layer with a dimension of 1 and a sigmoid activation function to obtain similarity probability values of the first case text and the second case text.

在本实施例中,通过左右子网络得到的文本抽象语义表征embeddingabs-doc经过全连接层维度为1的矩阵再接上sigmoid激活函数最后输出0至1的相似概率值。In this embodiment, the text abstract semantic representation embedding abs-doc obtained through the left and right sub-networks is connected to a matrix with a dimension of 1 in the fully connected layer and then connected to a sigmoid activation function to finally output a similarity probability value of 0 to 1.

S50、若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。S50. If the similarity probability value of the first case text and the second case text meets the set similarity threshold, determine that the first case text and the second case text are similar cases.

在本实施例中,相似阈值为0.5,如果第二案件文本与第一案件文本的相似概率值大于0.5则认为两个案件的案情文本相似,如果小于0.5则认为两个案件的案情内容不相似。In this embodiment, the similarity threshold is 0.5. If the similarity probability value between the second case text and the first case text is greater than 0.5, it is considered that the case texts of the two cases are similar, and if it is less than 0.5, the case content of the two cases is considered dissimilar. .

本发明通过采用ERNIE技术,可以对多义词很好的结合上下文理解其在文本中的含义,基于孪生神经网络进行案件相似性分析,并结合注意力机制对与案件特殊性领域名词进行重点特征分析,加大了模型对案件内容的理解,提升了案件相似度匹配的有效性和准确性。By adopting the ERNIE technology, the invention can understand the meaning of the polysemy in the text well in combination with the context, analyze the similarity of the case based on the twin neural network, and combine the attention mechanism to analyze the key features of the nouns in the special field of the case. It increases the model's understanding of the case content, and improves the effectiveness and accuracy of case similarity matching.

图2是本发明具体实施例提供的案件相似度匹配装置100的示意性框图。对应于上述的案件相似度匹配方法,本发明具体实施例还提供了一种案件相似度匹配装置100。该案件相似度匹配装置100包括用于执行上述案件相似度匹配方法的单元,该装置可以被配置于服务器中。FIG. 2 is a schematic block diagram of a case similarity matching apparatus 100 provided by a specific embodiment of the present invention. Corresponding to the above-mentioned case similarity matching method, a specific embodiment of the present invention further provides a case similarity matching apparatus 100 . The case similarity matching apparatus 100 includes a unit for executing the above case similarity matching method, and the apparatus may be configured in a server.

如图2所示,案件相似度匹配装置100,包括获取单元110、生成单元120、选取单元130、处理单元140以及判定单元150。As shown in FIG. 2 , the case similarity matching apparatus 100 includes an acquisition unit 110 , a generation unit 120 , a selection unit 130 , a processing unit 140 and a determination unit 150 .

获取单元110,用于获取案件数据库中的案件判决书文本。The obtaining unit 110 is configured to obtain the case judgment text in the case database.

在本实施例中,案件数据库指的是用于法院案件判决书的数据库。In this embodiment, the case database refers to a database used for court case judgments.

生成单元120,用于从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库。The generating unit 120 is configured to collect stop words and proper noun words from the text of the case judgment, and generate a stop word word database and a proper word word database.

在本实施例中,由于法律法条的修订等因素,一些词汇会被停用,另外,对于随着技术的发展,会出现以下新的专用名词。因此,可以从案件判决书文本中收集停用词汇和专用名词词汇,并为其建立生成停用词词汇库和专有词词汇库,以方便后续文本相似度分析。In this embodiment, due to factors such as revision of laws and regulations, some words will be disabled. In addition, with the development of technology, the following new special terms will appear. Therefore, stop words and proper noun words can be collected from the text of the case judgment, and a stop word word database and a proper word word database can be established and generated for them, so as to facilitate the subsequent text similarity analysis.

选取单元130,用于从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本。The selecting unit 130 is configured to select the first case text and the second case text that need to be matched for similarity from the case judgment text.

在选取之前,可以设定选取条件,例如,根据案情描述和案情结论有关的内容来选取相似度匹配文本。Before selection, selection conditions can be set, for example, the similarity matching text is selected according to the content related to the description of the case and the conclusion of the case.

当然需要说明的是,第一案件文本和第二案件文本并非仅仅是指一篇第一案件文本与一篇第二案件文本进行相似度匹配,也可以是指一篇第一案件文本与多篇第二案件文本进行相似度匹配。若是一篇第一案件文本与一篇第二案件文本进行相似度匹配,可以得出这两篇是否相似,若是一篇第一案件文本与多篇第二案件文本进行相似度匹配,可以得出在多篇第二案件文本与第一案件文本相似度的接近程度,并可根据相似度的接近程度进行排序,以方便人员进行案件分析。Of course, it should be noted that the first case text and the second case text not only refer to the similarity matching between a first case text and a second case text, but also refer to a first case text and multiple The second case text is subjected to similarity matching. If a first case text and a second case text are matched for similarity, it can be concluded whether the two are similar; if a first case text is matched with multiple second case texts, it can be obtained. The degree of similarity between the multiple second case texts and the first case texts can be sorted according to the degree of similarity, so as to facilitate case analysis by personnel.

处理单元140,用于将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值。The processing unit 140 is configured to input the first case text and the second case text into the twin network for processing, so as to obtain the similarity probability value of the first case text and the second case text.

在本实施例中,孪生网络分为两个相同的且并行的左右子网络,左右子网络均包括基于ERNIE的文本向量的网络模型、基于WordGCN图的文本向量的网络模型以及基于主题词的文本向量的网络模型。通过左子网络可以对第一案件文本进行处理,通过右子网络可以对第二案件文本进行处理。In this embodiment, the twin network is divided into two identical and parallel left and right sub-networks, and the left and right sub-networks both include a network model based on ERNIE text vectors, a network model based on WordGCN graph text vectors, and text based on subject words. Vector network model. The first case text can be processed through the left sub-network, and the second case text can be processed through the right sub-network.

在一实施例中,处理单元140包括第一处理模块、第二处理模块、第三处理模块、合并模块、第四处理模块、运算模块、第五处理模块以及第六处理模块。In one embodiment, the processing unit 140 includes a first processing module, a second processing module, a third processing module, a combining module, a fourth processing module, an arithmetic module, a fifth processing module, and a sixth processing module.

第一处理模块,用于将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征。The first processing module is used for inputting the first case text and the second case text into the ERNIE-based text vector network model for processing, so as to obtain the first processing features of the first case text and the second case text.

基于ERNIE的文本向量的网络模型是对所有的案件通过ERNIE模型进行训练得到每个词的词向量。训练方式采用fine-tunning的形式进行词向量训练。The network model based on ERNIE's text vector is to train the ERNIE model for all cases to obtain the word vector of each word. The training method adopts the form of fine-tunning for word vector training.

在一实施例中,第一处理模块包括切分子模块、分词子模块、第一处理子模块、运算子模块以及融合子模块。In one embodiment, the first processing module includes a segmentation sub-module, a word segmentation sub-module, a first processing sub-module, an operation sub-module and a fusion sub-module.

切分子模块,用于根据第一案件文本和第二案件文本中文本内容的断句符号进行语句切分。The subsection module is used for sentence segmentation according to the sentence segmentation symbols of the text content in the first case text and the second case text.

分词子模块,用于通过分词工具并结合停用词词汇库和专有词词汇库对语句进行分词,以得到分词数据。The word segmentation sub-module is used to segment the sentence by using the word segmentation tool combined with the stop word vocabulary database and the proprietary word vocabulary database to obtain word segmentation data.

第一处理子模块,用于通过ERNIE基于MLM对分词数据进行处理,以得到每个词的词向量。The first processing sub-module is used to process the word segmentation data based on MLM through ERNIE to obtain the word vector of each word.

运算子模块,用于将每一句话中的每个词的词向量进行求和运算,以得到句向量的特征向量。The operation submodule is used for summing the word vectors of each word in each sentence to obtain the feature vector of the sentence vector.

融合子模块,用于将文本内容的所有句向量的特征向量通过Bi-LSTM进行concate融合,以得到第一案件文本和第二案件文本的第一处理特征。The fusion sub-module is used to concate and fuse the feature vectors of all sentence vectors of the text content through Bi-LSTM to obtain the first processing features of the first case text and the second case text.

对于切分子模块、分词子模块、处理子模块、运算子模块以及融合子模块,在本实施例中,首先基于文本中的断句符号进行语句切分,在通过分词工具对语句进行分词,语句中的序列长度选择为16词长度,超过该词汇长度的文本序列采用截断的方式,短于序列长度采用padding的方式进行补齐。同时需要注意的是对于专有名词的分析需要结合专有名词词汇库进行分词避免将专有名词内部切分。对于ERNIE中transformer机制中的位置编码如果词在文本中的位置为偶数的时候采用

Figure BDA0003684411260000131
如果词在为本中的位置为奇数的时候
Figure BDA0003684411260000141
position表示词在句子中的位置,i为词向量的维度范围,选用0至64维度。For the segmentation sub-module, the word segmentation sub-module, the processing sub-module, the operation sub-module and the fusion sub-module, in this embodiment, sentence segmentation is first performed based on the sentence segmentation symbols in the text. The length of the sequence of 16 words is selected as the length of 16 words. The text sequence exceeding the length of the word is truncated, and the text sequence shorter than the length of the sequence is filled by padding. At the same time, it should be noted that the analysis of proper nouns needs to be combined with the proper noun vocabulary for word segmentation to avoid internal segmentation of proper nouns. For the position encoding in the transformer mechanism in ERNIE, if the position of the word in the text is an even number, it is used.
Figure BDA0003684411260000131
if the word's position in the base is an odd number
Figure BDA0003684411260000141
position represents the position of the word in the sentence, i is the dimension range of the word vector, and the dimension is from 0 to 64.

通过ERNIE模型基于MLM(Masked Language Model)的方法进行训练即可以得到每个词的词向量。一句话有多个词,通过多个词进行相加得到句向量的特征向量,将所有句向量进行concate融合得到文本向量。文本向量的长度选用500,即选出500句话,超出该长度的语句选择截断,便可得到长度为500,维度为64的基于词向量构建的语句向量embeddingenrie-sentence。将embeddingenrie-sentence输入到Bi-LSTM中得到文本向量embeddingernie-docThe word vector of each word can be obtained by training the ERNIE model based on the MLM (Masked Language Model) method. There are multiple words in a sentence. The feature vector of the sentence vector is obtained by adding the multiple words, and all the sentence vectors are concatenated and fused to obtain the text vector. The length of the text vector is 500, that is, 500 sentences are selected, and the sentences exceeding this length are selected to be truncated, and the sentence vector embedding enrie-sentence constructed based on the word vector with a length of 500 and a dimension of 64 can be obtained. Input the embedding enrie-sentence into Bi-LSTM to get the text vector embedding ernie-doc .

第二处理模块,用于将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征。The second processing module is configured to input the first case text and the second case text into the network model based on the text vector of the WordGCN graph for processing, so as to obtain the second processing features of the first case text and the second case text.

在一实施例中,第二处理模块包括编码子模块、构建子模块以及第二处理子模块。In one embodiment, the second processing module includes an encoding sub-module, a construction sub-module, and a second processing sub-module.

编码子模块,用于通过WordGCN模型中的句子层级和语料层级中的词与词之间的关系对第一案件文本和第二案件文本的单词进行编码,以得到词向量。The encoding sub-module is used to encode the words of the first case text and the second case text through the relationship between the words in the sentence level and the corpus level in the WordGCN model to obtain word vectors.

构建子模块,用于根据词向量构建语句向量。Build submodules for building sentence vectors from word vectors.

第二处理子模块,用于将语句向量输入到Bi-GRU进行处理,以得到第一案件文本和第二案件文本第二处理特征。The second processing submodule is used to input the sentence vector into the Bi-GRU for processing, so as to obtain the second processing feature of the first case text and the second case text.

对于编码子模块、构建子模块以及第二处理子模块,在本实施例中,选用WordGCN模型对语句模型做特征表示。通过句子层级及语料层级中的词与词之间的关系对单词进行编码。通过切词以及固定句子长度的方法来进行句子编码,设定句子长度为16个词长度,超过句子长度通过截断的方式,没超过的方式通过pandding进行填充,仍然通过单词向量相加得到句向量。对于文本向量的长度选用500,超出该长度的语句选择截断,这样便可得到长度为500,维度为64的基于词向量构建的语句向量embeddingwords。将embeddingwords输入到Bi-GRU中得到文本向量embeddinggraph-docFor the encoding sub-module, the construction sub-module and the second processing sub-module, in this embodiment, the WordGCN model is selected to represent the sentence model. Words are encoded by word-to-word relationships at the sentence level and at the corpus level. Sentence encoding is performed by word segmentation and fixed sentence length. The sentence length is set to 16 words. If the length of the sentence exceeds the length of the sentence, it will be truncated. If the length of the sentence is not exceeded, it will be filled by pandding. The sentence vector is still obtained by adding the word vector. . Select 500 for the length of the text vector, and select truncation for sentences exceeding this length, so that the sentence vector embedding words constructed based on the word vector with a length of 500 and a dimension of 64 can be obtained. Input the embedding words into Bi-GRU to get the text vector embedding graph-doc .

第三处理模块,用于将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征。The third processing module is configured to input the first case text and the second case text into the network model based on the text vector of the subject word for processing, so as to obtain the third processing feature of the first case text and the second case text.

在一实施例中,第三处理模块包括过滤子模块、第一提取子模块、第一记录子模块、第二提取子模块、第二记录子模块以及计算子模块。In one embodiment, the third processing module includes a filtering sub-module, a first extraction sub-module, a first recording sub-module, a second extraction sub-module, a second recording sub-module and a calculation sub-module.

过滤子模块,用于对第一案件文本和第二案件文本的文本中的停用词通过停用词词汇库进行过滤。The filtering submodule is used for filtering the stop words in the text of the first case text and the second case text through the stop word vocabulary database.

第一提取子模块,用于过滤后的文本的主题词进行提取。The first extraction sub-module is used to extract the subject words of the filtered text.

第一记录子模块,用于记录提取的主题词对应的位置索引和重要性程度。The first recording sub-module is used to record the position index and importance degree corresponding to the extracted subject words.

第二提取子模块,用于通过专有词词汇库对第一案件文本和第二案件文本的文本中的专有名词进行提取。The second extraction submodule is used for extracting proper nouns in the texts of the first case text and the second case text through the proper word vocabulary database.

第二记录子模块,用于记录提取的专有名词对应的位置索引和重要性程度。The second recording sub-module is used to record the position index and importance degree corresponding to the extracted proper noun.

计算子模块,用于将主题词的重要性程度和专有名词的重要性程度进行相加运算,以得到第三处理特征。The calculation sub-module is used for adding the importance degree of the subject word and the importance degree of the proper noun to obtain the third processing feature.

对于过滤子模块、第一提取子模块、第一记录子模块、第二提取子模块、第二记录子模块以及计算子模块,在本实施例中,首先对于文本中的停用词通过停用词词汇库进行过滤,基于上述分词后的记录采用主题词模型对案情主题词进行提取,得到案情中的主题词。主题词词汇采用的是BERTopic,内部使用的是c-tf-idf,是基于案件的类别数进行关键词提取而非文档数,同时也使用LDA模型对案件内容进行主题词提取。基于两种方法提取的主题词,记录主题词对应的位置索引以及词汇的重要性程度,重要性程度范围为0至1,再通过案件专有词词汇库对案情文本中的专有名词进行提取,记录其专有名词对应的位置索引,专有名词的重要程度统一设置为0.5。通过主题词位置索引以及专有名词位置索引的并集生成特殊词位置索引。将这些主题词重要程度进行相加得到注意力权重矩阵embeddingattentation,该矩阵大小为16x500。For the filtering sub-module, the first extraction sub-module, the first recording sub-module, the second extraction sub-module, the second recording sub-module and the calculation sub-module, in this embodiment, the stop words in the text are The word vocabulary database is filtered, and the subject heading model is used to extract the subject heading of the case based on the records after the above word segmentation, so as to obtain the subject heading in the case. The subject word vocabulary is BERTopic, and c-tf-idf is used internally. The keyword extraction is based on the number of categories of the case instead of the number of documents. At the same time, the LDA model is used to extract the subject words of the case content. Based on the subject words extracted by the two methods, record the position index corresponding to the subject words and the importance degree of the vocabulary. , record the position index corresponding to its proper noun, and the importance of proper noun is uniformly set to 0.5. The special word position index is generated by the union of the subject word position index and the proper noun position index. Add the importance of these keywords to get the attention weight matrix embedding attentation , which is 16x500 in size.

合并模块,用于将第一案件文本和第二案件文本的第一处理特征和第一案件文本和第二案件文本的第二处理特征进行concate合并处理,以得到第一案件文本和第二案件文本的合并特征。The merging module is used to concate the first processing feature of the first case text and the second case text and the second processing feature of the first case text and the second case text to obtain the first case text and the second case text. Merged features of text.

在本实施例中,将embeddingernie-doc与embeddinggraph-doc进行concate得到特征embeddingmergeIn this embodiment, the feature embedding merge is obtained by concatenating the embedding ernie-doc and the embedding graph-doc .

第四处理模块,用于将第一案件文本和第二案件文本的合并特征输入到全连接层处理,以得到第一案件文本和第二案件文本的全连接层处理特征。The fourth processing module is configured to input the merged features of the first case text and the second case text into the fully connected layer for processing, so as to obtain the fully connected layer processing features of the first case text and the second case text.

在本实施例中,将embeddingmerge输入到全连接层得到特征embeddingmerge-fcIn this embodiment, the embedding merge is input to the fully connected layer to obtain the feature embedding merge-fc .

运算模块,用于将第一案件文本和第二案件文本的全连接层处理特征与第一案件文本和第二案件文本的第三处理特征进行乘法运算,以得到第一案件文本和第二案件文本的文本语义表征特征。The operation module is used for multiplying the fully connected layer processing features of the first case text and the second case text and the third processing feature of the first case text and the second case text to obtain the first case text and the second case text. Textual semantic representation features of text.

在本实施例中,将将embeddingmerge-fc矩阵与注意力矩阵embeddingattentation相乘得到最终文本语义表征embeddingdocIn this embodiment, the final text semantic representation embedding doc is obtained by multiplying the embedding merge-fc matrix and the attention matrix embedding attentation .

第五处理模块,用于对第一案件文本和第二案件文本的文本语义表征特征进行全连接层和激活函数处理,以得到第一案件文本和第二案件文本的文本抽象语义表征。The fifth processing module is configured to perform full connection layer and activation function processing on the text semantic representation features of the first case text and the second case text, so as to obtain the text abstract semantic representation of the first case text and the second case text.

在本实施例中,将语义表征embeddingdoc接上全连接层以及激活函数再接上一个全连接层得到文本抽象语义表征embeddingabs-docIn this embodiment, the semantic representation embedding doc is connected to the fully connected layer and the activation function is connected to a fully connected layer to obtain the text abstract semantic representation embedding abs-doc .

第六处理模块,用于将第一案件文本和第二案件文本的文本抽象语义表征经过维度为1的全连接层的矩阵以及sigmoid激活函数处理,以得到第一案件文本和第二案件文本的相似概率值。The sixth processing module is used to process the text abstract semantic representation of the first case text and the second case text through the matrix of the fully connected layer with dimension 1 and the sigmoid activation function, so as to obtain the first case text and the second case text. Similar probability value.

在本实施例中,通过左右子网络得到的文本抽象语义表征embeddingabs-doc经过全连接层维度为1的矩阵再接上sigmoid激活函数最后输出0至1的相似概率值。In this embodiment, the text abstract semantic representation embedding abs-doc obtained through the left and right sub-networks is connected to a matrix with a dimension of 1 in the fully connected layer and then connected to a sigmoid activation function to finally output a similarity probability value of 0 to 1.

判定单元150,用于若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。The determining unit 150 is configured to determine that the first case text and the second case text are similar cases if the similarity probability value of the first case text and the second case text meets the set similarity threshold.

在本实施例中,相似阈值为0.5,如果第二案件文本与第一案件文本的相似概率值大于0.5则认为两个案件的案情文本相似,如果小于0.5则认为两个案件的案情内容不相似。In this embodiment, the similarity threshold is 0.5. If the similarity probability value between the second case text and the first case text is greater than 0.5, it is considered that the case texts of the two cases are similar, and if it is less than 0.5, the case content of the two cases is considered dissimilar. .

本发明通过采用ERNIE技术,可以对多义词很好的结合上下文理解其在文本中的含义,基于孪生神经网络进行案件相似性分析,并结合注意力机制对与案件特殊性领域名词进行重点特征分析,加大了模型对案件内容的理解,提升了案件相似度匹配的有效性和准确性。By adopting the ERNIE technology, the invention can understand the meaning of the polysemy in the text well in combination with the context, analyze the similarity of the case based on the twin neural network, and combine the attention mechanism to analyze the key features of the nouns in the special field of the case. It increases the model's understanding of the case content, and improves the effectiveness and accuracy of case similarity matching.

上述案件相似度匹配装置100可以实现为计算机程序的形式,该计算机程序可以在如图3所示的计算机设备上运行。The above-mentioned case similarity matching apparatus 100 can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 3 .

请参阅图3,图3是本申请实施例提供的一种计算机设备的示意性框图。该计算机设备500可以是服务器,其中,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 3 , which is a schematic block diagram of a computer device provided by an embodiment of the present application. The computer device 500 may be a server, where the server may be an independent server or a server cluster composed of multiple servers.

如图3所示,该计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上述的案件相似度匹配方法步骤。As shown in FIG. 3 , the computer device includes a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the above steps of the case similarity matching method are implemented.

该计算机设备700可以是终端或服务器。该计算机设备700包括通过系统总线710连接的处理器720、存储器和网络接口750,其中,存储器可以包括非易失性存储介质730和内存储器740。The computer device 700 may be a terminal or a server. The computer device 700 includes a processor 720 connected by a system bus 710 , memory, and a network interface 750 , where the memory may include a non-volatile storage medium 730 and an internal memory 740 .

该非易失性存储介质730可存储操作系统731和计算机程序732。该计算机程序732被执行时,可使得处理器720执行任意一种案件相似度匹配方法。The nonvolatile storage medium 730 may store an operating system 731 and a computer program 732 . The computer program 732, when executed, can cause the processor 720 to execute any one of the case similarity matching methods.

该处理器720用于提供计算和控制能力,支撑整个计算机设备700的运行。The processor 720 is used to provide computing and control capabilities to support the operation of the entire computer device 700 .

该内存储器740为非易失性存储介质730中的计算机程序732的运行提供环境,该计算机程序732被处理器720执行时,可使得处理器720执行任意一种案件相似度匹配方法。The internal memory 740 provides an environment for running the computer program 732 in the non-volatile storage medium 730. When the computer program 732 is executed by the processor 720, the processor 720 can execute any one of the case similarity matching methods.

该网络接口750用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图3中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备700的限定,具体的计算机设备700可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。其中,所述处理器720用于运行存储在存储器中的程序代码,以实现以下步骤:The network interface 750 is used for network communication, such as sending assigned tasks and the like. Those skilled in the art can understand that the structure shown in FIG. 3 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 700 to which the solution of the present application is applied. The specific computer device 700 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components. Wherein, the processor 720 is used for running the program code stored in the memory to realize the following steps:

获取案件数据库中的案件判决书文本;Obtain the text of the case judgment in the case database;

从案件判决书文本中收集停用词汇和专用名词词汇,并生成停用词词汇库和专有词词汇库;Collect stop words and proper noun words from the text of case judgments, and generate stop word word database and proper word word database;

从案件判决书文本中选取需要进行相似度匹配的第一案件文本和第二案件文本;Select the first case text and the second case text that need similarity matching from the case judgment text;

将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值;Input the first case text and the second case text into the twin network for processing to obtain the similarity probability value of the first case text and the second case text;

若第一案件文本和第二案件文本的相似概率值满足设定的相似阈值,则判定第一案件文本和第二案件文本为相似案件。If the similarity probability value of the first case text and the second case text meets the set similarity threshold, it is determined that the first case text and the second case text are similar cases.

在一实施例中:所述将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值,所述孪生网络包括基于ERNIE的文本向量的网络模型、基于WordGCN图的文本向量的网络模型以及基于主题词的文本向量的网络模型。In one embodiment: the first case text and the second case text are input into a twin network for processing to obtain similar probability values of the first case text and the second case text, and the twin network includes an ERNIE-based The network model of text vector, the network model of text vector based on WordGCN graph, and the network model of text vector based on subject word.

在一实施例中:所述所述将第一案件文本和第二案件文本输入到孪生网络中进行处理,以得到第一案件文本和第二案件文本的相似概率值,包括:In one embodiment: the inputting the first case text and the second case text into the twin network for processing to obtain similar probability values of the first case text and the second case text, including:

将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征;Inputting the first case text and the second case text into the ERNIE-based text vector network model for processing to obtain the first processing features of the first case text and the second case text;

将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征;Inputting the first case text and the second case text into the network model based on the text vector of the WordGCN graph for processing to obtain the second processing features of the first case text and the second case text;

将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征;Inputting the first case text and the second case text into the network model based on the text vector of the subject word for processing, so as to obtain the third processing feature of the first case text and the second case text;

将第一案件文本和第二案件文本的第一处理特征和第一案件文本和第二案件文本的第二处理特征进行concate合并处理,以得到第一案件文本和第二案件文本的合并特征;Perform concate merge processing on the first processing feature of the first case text and the second case text and the second processing feature of the first case text and the second case text to obtain the merged feature of the first case text and the second case text;

将第一案件文本和第二案件文本的合并特征输入到全连接层处理,以得到第一案件文本和第二案件文本的全连接层处理特征;inputting the merged features of the first case text and the second case text into the fully connected layer processing to obtain the fully connected layer processing features of the first case text and the second case text;

将第一案件文本和第二案件文本的全连接层处理特征与第一案件文本和第二案件文本的第三处理特征进行乘法运算,以得到第一案件文本和第二案件文本的文本语义表征特征;Multiply the fully connected layer processing features of the first case text and the second case text with the third processing feature of the first case text and the second case text to obtain the text semantic representation of the first case text and the second case text feature;

对第一案件文本和第二案件文本的文本语义表征特征进行全连接层和激活函数处理,以得到第一案件文本和第二案件文本的文本抽象语义表征;Perform full connection layer and activation function processing on the text semantic representation features of the first case text and the second case text to obtain the text abstract semantic representation of the first case text and the second case text;

将第一案件文本和第二案件文本的文本抽象语义表征经过维度为1的全连接层的矩阵以及sigmoid激活函数处理,以得到第一案件文本和第二案件文本的相似概率值。The text abstract semantic representation of the first case text and the second case text is processed through a matrix of a fully connected layer with a dimension of 1 and a sigmoid activation function to obtain the similarity probability value of the first case text and the second case text.

在一实施例中:所述将第一案件文本和第二案件文本输入到基于ERNIE的文本向量的网络模型中进行处理,以得到第一案件文本和第二案件文本的第一处理特征,包括:In one embodiment: the first case text and the second case text are input into the ERNIE-based text vector network model for processing, so as to obtain the first processing feature of the first case text and the second case text, including :

根据第一案件文本和第二案件文本中文本内容的断句符号进行语句切分;Perform sentence segmentation according to the sentence segmentation symbols of the text content in the first case text and the second case text;

通过分词工具并结合停用词词汇库和专有词词汇库对语句进行分词,以得到分词数据;Tokenize the sentence by using the word segmentation tool combined with the stop word lexicon and the proprietary word lexicon to obtain word segmentation data;

通过ERNIE基于MLM对分词数据进行处理,以得到每个词的词向量;The word segmentation data is processed based on MLM through ERNIE to obtain the word vector of each word;

将每一句话中的每个词的词向量进行求和运算,以得到句向量的特征向量;The word vector of each word in each sentence is summed to obtain the feature vector of the sentence vector;

将文本内容的所有句向量的特征向量通过Bi-LSTM进行concate融合,以得到第一案件文本和第二案件文本的第一处理特征。The feature vectors of all sentence vectors of the text content are concatenated and fused by Bi-LSTM to obtain the first processing features of the first case text and the second case text.

在一实施例中:所述将第一案件文本和第二案件文本输入到基于WordGCN图的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第二处理特征,包括:In one embodiment: described inputting the first case text and the second case text to a network model based on the text vector of the WordGCN graph for processing, to obtain the second processing feature of the first case text and the second case text, including :

通过WordGCN模型中的句子层级和语料层级中的词与词之间的关系对第一案件文本和第二案件文本的单词进行编码,以得到词向量;Encode the words of the first case text and the second case text through the relationship between words in the sentence level and corpus level in the WordGCN model to obtain word vectors;

根据词向量构建语句向量;Construct sentence vector according to word vector;

将语句向量输入到Bi-GRU进行处理,以得到第一案件文本和第二案件文本第二处理特征。Input the sentence vector into Bi-GRU for processing to obtain the second processing feature of the first case text and the second case text.

在一实施例中:所述将第一案件文本和第二案件文本输入到基于主题词的文本向量的网络模型进行处理,以得到第一案件文本和第二案件文本的第三处理特征,包括:In one embodiment: the first case text and the second case text are input into the network model based on the text vector of the subject word for processing, so as to obtain the third processing feature of the first case text and the second case text, including :

对第一案件文本和第二案件文本的文本中的停用词通过停用词词汇库进行过滤;Filter the stop words in the text of the first case text and the second case text through the stop word vocabulary database;

对过滤后的文本的主题词进行提取;Extract the subject words of the filtered text;

记录提取的主题词对应的位置索引和重要性程度;Record the position index and importance degree corresponding to the extracted subject words;

通过专有词词汇库对第一案件文本和第二案件文本的文本中的专有名词进行提取;Extract proper nouns in the texts of the first case text and the second case text through the proper word vocabulary database;

记录提取的专有名词对应的位置索引和重要性程度;Record the position index and importance degree corresponding to the extracted proper noun;

将主题词的重要性程度和专有名词的重要性程度进行相加运算,以得到第三处理特征。The degree of importance of the subject word and the degree of importance of proper nouns are added up to obtain the third processing feature.

在一实施例中:所述对过滤后的文本的主题词进行提取,通过BERTopic模型结合LDA模型对文本的主题词进行提取。In one embodiment, the subject words of the filtered text are extracted by using the BERTopic model combined with the LDA model to extract the subject words of the text.

应当理解,在本申请实施例中,处理器720可以是中央处理单元(CentralProcessing Unit,CPU),该处理器720还可以是其他通用处理器、数字信号处理器(DigitalSignal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the present application, the processor 720 may be a central processing unit (Central Processing Unit, CPU), and the processor 720 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), dedicated integrated Circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.

本领域技术人员可以理解,图3中示出的计算机设备700结构并不构成对计算机设备700的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the computer device 700 shown in FIG. 3 does not constitute a limitation on the computer device 700, and may include more or less components than the one shown, or combine some components, or different components layout.

在本发明的另一实施例中提供了一种计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本发明实施例公开的案件相似度匹配方法。In another embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program implements the case similarity matching method disclosed in the embodiment of the present invention when the computer program is executed by the processor.

所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described devices, devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here. Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

Claims (10)

1. The case similarity matching method is characterized by comprising the following steps:
acquiring case judgment text in a case database;
collecting stop words and special noun words from case judgment text and generating a stop word library and a special word library;
selecting a first case text and a second case text which need to be subjected to similarity matching from case judgment text;
inputting the first case text and the second case text into a twin network for processing to obtain a similar probability value of the first case text and the second case text;
and if the similarity probability values of the first case text and the second case text meet the set similarity threshold, judging that the first case text and the second case text are similar cases.
2. The case similarity matching method according to claim 1, wherein the first case text and the second case text are input into a twin network for processing, so as to obtain the similarity probability values of the first case text and the second case text, and the twin network comprises a network model of an ERNIE-based text vector, a network model of a WordGCN-based text vector and a network model of a subject word-based text vector.
3. The case similarity matching method according to claim 2, wherein the inputting the first case text and the second case text into the twin network for processing to obtain the similarity probability value of the first case text and the second case text comprises:
inputting the first case text and the second case text into a network model of a text vector based on ERNIE for processing to obtain first processing characteristics of the first case text and the second case text;
inputting the first case text and the second case text into a network model based on a text vector of a WordGCN image for processing to obtain second processing characteristics of the first case text and the second case text;
inputting the first case text and the second case text into a network model based on text vectors of subject words for processing to obtain third processing characteristics of the first case text and the second case text;
performing concatee merging processing on the first processing characteristics of the first case text and the second processing characteristics of the first case text and the second case text to obtain merging characteristics of the first case text and the second case text;
inputting the combined features of the first case text and the second case text into a full-link layer for processing to obtain full-link layer processing features of the first case text and the second case text;
carrying out multiplication operation on the fully connected layer processing characteristics of the first case text and the second case text and the third processing characteristics of the first case text and the second case text to obtain text semantic representation characteristics of the first case text and the second case text;
carrying out full-connection layer and activation function processing on text semantic representation characteristics of the first case text and the second case text to obtain text abstract semantic representations of the first case text and the second case text;
and processing the text abstract semantic representations of the first case text and the second case text through a matrix of a full connection layer with the dimension of 1 and a sigmoid activation function to obtain the similar probability values of the first case text and the second case text.
4. The case similarity matching method according to claim 3, wherein the step of inputting the first case text and the second case text into a network model based on ERNIE text vectors for processing to obtain the first processing features of the first case text and the second case text comprises:
sentence segmentation is carried out according to sentence breaking symbols of text contents in the first case text and the second case text;
performing word segmentation on the sentence by a word segmentation tool in combination with a stop word vocabulary base and a special word vocabulary base to obtain word segmentation data;
processing the word data based on the MLM through ERNIE to obtain a word vector of each word;
summing the word vectors of each word in each sentence to obtain the feature vectors of the sentence vectors;
and carrying out concatee fusion on the feature vectors of all sentence vectors of the text content through Bi-LSTM to obtain the first processing features of the first case text and the second case text.
5. The case similarity matching method according to claim 3, wherein the step of inputting the first case text and the second case text into a network model based on text vectors of WordGCN images for processing to obtain second processing features of the first case text and the second case text comprises:
coding words of the first case text and the second case text through the relation between words in a sentence level and a corpus level in a WordGCN model to obtain word vectors;
constructing a statement vector according to the word vector;
and inputting the sentence vector into the Bi-GRU for processing to obtain a first case text and a second case text second processing characteristic.
6. The case similarity matching method according to claim 3, wherein the step of inputting the first case text and the second case text into a network model based on text vectors of subject words for processing to obtain third processing features of the first case text and the second case text comprises:
filtering stop words in the texts of the first case text and the second case text through a stop word vocabulary library;
extracting the subject terms of the filtered text;
recording the position index and the importance degree corresponding to the extracted subject term;
extracting proper nouns in the texts of the first case text and the second case text through a proper word vocabulary library;
recording the position index and the importance degree corresponding to the extracted proper nouns;
and adding the importance degree of the subject term and the importance degree of the proper noun to obtain a third processing characteristic.
7. The case similarity matching method according to claim 6, wherein the subject words of the filtered text are extracted by combining a BERTOPIC model with an LDA model.
8. The case similarity matching device is characterized by comprising an acquisition unit, a generation unit, a selection unit, a processing unit and a judgment unit;
the acquisition unit is used for acquiring case judgment text in a case database;
the generating unit is used for collecting the stop words and special noun words from the case judgment text and generating a stop word library and a special word library;
the selecting unit is used for selecting a first case text and a second case text which need to be subjected to similarity matching from case judgment text;
the processing unit is used for inputting the first case text and the second case text into the twin network for processing to obtain the similar probability values of the first case text and the second case text;
and the judging unit is used for judging that the first case text and the second case text are similar cases if the similarity probability value of the first case text and the second case text meets the set similarity threshold.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the case similarity matching method steps according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the case similarity matching method steps according to any one of claims 1 to 7.
CN202210646944.0A 2022-06-08 2022-06-08 Case similarity matching method, device, computer equipment and storage medium Active CN114881028B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210646944.0A CN114881028B (en) 2022-06-08 2022-06-08 Case similarity matching method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210646944.0A CN114881028B (en) 2022-06-08 2022-06-08 Case similarity matching method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114881028A true CN114881028A (en) 2022-08-09
CN114881028B CN114881028B (en) 2024-11-12

Family

ID=82682212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210646944.0A Active CN114881028B (en) 2022-06-08 2022-06-08 Case similarity matching method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114881028B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119226505A (en) * 2024-11-29 2024-12-31 工信人本(北京)管理咨询有限公司 Intelligent content production method and system based on user needs

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110580281A (en) * 2019-09-11 2019-12-17 江苏鸿信系统集成有限公司 similar case matching method based on semantic similarity
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 Similarity calculation method of news and case based on asymmetric twin network
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN112329429A (en) * 2020-11-30 2021-02-05 北京百度网讯科技有限公司 Text similarity learning method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110717332A (en) * 2019-07-26 2020-01-21 昆明理工大学 Similarity calculation method of news and case based on asymmetric twin network
CN110580281A (en) * 2019-09-11 2019-12-17 江苏鸿信系统集成有限公司 similar case matching method based on semantic similarity
CN111737954A (en) * 2020-06-12 2020-10-02 百度在线网络技术(北京)有限公司 Text similarity determination method, device, equipment and medium
CN112329429A (en) * 2020-11-30 2021-02-05 北京百度网讯科技有限公司 Text similarity learning method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119226505A (en) * 2024-11-29 2024-12-31 工信人本(北京)管理咨询有限公司 Intelligent content production method and system based on user needs
CN119226505B (en) * 2024-11-29 2025-03-28 工信人本(北京)管理咨询有限公司 Intelligent content production method and system based on user needs

Also Published As

Publication number Publication date
CN114881028B (en) 2024-11-12

Similar Documents

Publication Publication Date Title
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US9792280B2 (en) Context based synonym filtering for natural language processing systems
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN110347894A (en) Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN112905768A (en) Data interaction method, device and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
CN118296120A (en) Large-scale language model retrieval enhancement generation method for multi-mode multi-scale multi-channel recall
CN112395875A (en) Keyword extraction method, device, terminal and storage medium
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN112612892B (en) Special field corpus model construction method, computer equipment and storage medium
CN115795030A (en) Text classification method, device, computer equipment and storage medium
CN118333157B (en) Domain word vector construction method and system for HAZOP knowledge graph analysis
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
Kore et al. Legal document summarization using nlp and ml techniques
US11989500B2 (en) Framework agnostic summarization of multi-channel communication
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN117390169A (en) Form data question-answering method, device, equipment and storage medium
CN117828042A (en) Question and answer processing method, device, equipment and medium for financial service
CN114881028A (en) Case similarity matching method, device, computer equipment and storage medium
CN118245564B (en) Method and device for constructing feature comparison library supporting semantic review and repayment
CN114706954B (en) A method, device, equipment and readable storage medium for analyzing sentiment polarity
Rodrigues et al. Mining online product reviews and extracting product features using unsupervised method
CN115688771A (en) Document content comparison performance improving method and system
Rahman et al. ChartSumm: A large scale benchmark for Chart to Text Summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant