CN110795926B

CN110795926B - Judgment document similarity judgment method and system based on legal knowledge graph

Info

Publication number: CN110795926B
Application number: CN202010004494.6A
Authority: CN
Inventors: 翁洋; 王竹; 李鑫; 其他发明人请求不公开姓名
Original assignee: Chengdu Xingyun Law Technology Co Ltd; Sichuan University
Current assignee: Chengdu Xingyun Law Technology Co Ltd; Sichuan University
Priority date: 2020-01-03
Filing date: 2020-01-03
Publication date: 2020-04-07
Anticipated expiration: 2040-01-03
Also published as: CN110795926A

Abstract

The invention discloses a judge document similarity judgment method and a judge document similarity judgment system based on a legal knowledge base, wherein the method comprises the following steps: acquiring a referee document A and a referee document B; respectively constructing a legal knowledge base A 'corresponding to the case of the referee document A and a legal knowledge base B' corresponding to the case of the referee document B; respectively converting the legal knowledge map A 'into a vector A', and converting the legal knowledge map B 'into a vector B'; comparing the vector A 'with the vector B' to obtain a similarity judgment result between the referee document A and the referee document B; the method and the system convert the original legal knowledge map from a graph structure which is difficult to calculate into a mathematical vector through a knowledge embedding algorithm, support pairwise comparison of referee documents based on semantic similarity, and facilitate more accurate and efficient judgment of similarity between the referee documents.

Description

Judgment document similarity judgment method and system based on legal knowledge graph

Technical Field

The invention relates to the field of natural language processing, in particular to a judging document similarity judging method and system based on a legal knowledge graph.

Background

In recent years, knowledge-graph applications have been a hallmark of the big data era. The knowledge graph is essentially a semantic network, and is a data structure based on a graph, and consists of nodes (points) and edges (edges). Each node of the knowledge-graph represents an "entity" present in the real world, and each edge is an "relationship" between entities. Knowledge-graphs are the most efficient way to represent relationships. The key point of the knowledge graph is that a series of large-order structured data or unstructured data are collected, the data are analyzed and modeled based on domain professional knowledge, a rule, which is usually the rule of the domain, is found out through machine calculation, and finally the machine can recognize the rule and learn to generate a calculation rule of related data.

At present, methods for constructing a knowledge graph are various, and are generally constructed in the modes of crawler crawling, log searching (query log) or multi-class collaborative mode learning based on Bootstrapping and the like, and the methods are mainly applied to optimizing the existing search engine at present. All data around the world, up to 80% are unstructured data and most prior art techniques are unable to identify and analyze such data.

In the legal field, the referee documents are unstructured data, and the similarity between the referee documents cannot be judged directly through legal knowledge maps corresponding to the referee documents in the prior art.

Disclosure of Invention

The method and the system can convert the original legal knowledge map from a graph structure which is difficult to calculate into a mathematical vector to express through a knowledge embedding method, support pairwise comparison between referee documents based on semantic similarity, and facilitate more accurate and efficient judgment of the similarity between the referee documents.

In order to achieve the above object, the present invention provides a method for judging similarity of official documents based on legal knowledge base, comprising:

step 1: acquiring a referee document A and a referee document B;

step 2: respectively constructing a legal knowledge base A 'corresponding to the case of the referee document A and a legal knowledge base B' corresponding to the case of the referee document B;

and step 3: respectively converting the legal knowledge map A 'into a vector A', converting the legal knowledge map B 'into a vector B',

and 4, step 4: calculating cosine similarity of the vector A '' and the vector B '' to obtain a similarity judgment result of the referee document A and the referee document B.

Wherein, on the whole, the technical scheme of the invention is as follows: firstly, a legal knowledge graph based on case facts is constructed, then triples (entity 1, relation and entity 2) represented by the legal knowledge graph are expressed by vectors by adopting a knowledge embedding method (TransE, TransH, TransR, TransG, DistMult, HolE and the like), vector representation of the legal knowledge graph based on the case facts of each referee document is obtained through calculation, pairwise similarity comparison of the referee documents is supported, and the problem of comparison between the referee documents based on semantic similarity is solved.

In the invention, the case of the referee document can be constructed in various ways by a corresponding legal knowledge graph, the methods for constructing the knowledge graph are various at present, and the case is usually constructed in the ways of crawler crawling, log searching (query log) or multi-class collaborative mode learning based on Bootstrapping and the like, and can also be constructed in the way of the published patent document CN 201710339258.8.

Preferably, the step 3 uses a knowledge embedding algorithm to vector the triples represented by the legal knowledge graph.

Preferably, the knowledge embedding algorithm includes, but is not limited to: the TransE algorithm, the TransH algorithm, the TransR algorithm, the TransG algorithm, the DistMult algorithm and the Hole algorithm.

Preferably, the step 3 further includes training the knowledge embedding algorithm by using a training set, verifying the knowledge embedding algorithm by using a verification set, and testing the knowledge embedding algorithm by using a test set.

Preferably, the legal knowledge graph comprises a plurality of sets of triples (h, r, t), h representing head entities, r representing relationships, and t representing tail entities.

Preferably, the step 2 further comprises: collecting sample data, wherein the sample data comprises a plurality of referee documents, generalizing the triplets of legal knowledge maps corresponding to all the referee documents of the sample data, and defining the character roles of the cases as an original report, an announced report and a third party.

Preferably, counting the number of all entities and the number of relations in the sample data, and listing the counted number into a newly-built entity dictionary and a newly-built relation dictionary, wherein the id number takes 0 as the beginning, and the number of the entities as the end; the formats of the entity dictionary and the relation dictionary are two columns, the entities or the relations are stored in a one-to-one correspondence mode through id numbers, and a first data set and a second data set are respectively established;

replacing the triples (h, r and t) of the legal knowledge graph with the numbers of the first data set and the second data set one by one, and storing the triples (h, r and t) in a (h, t and r) structure to a third data set;

dividing the third data set into a training set, a verification set and a test set;

training the knowledge embedding algorithm by using a training set, verifying the knowledge embedding algorithm by using a verification set, and testing the knowledge embedding algorithm by using a test set.

Preferably, based on a knowledge embedding algorithm passing the test, by setting relevant dimension parameters, a vector representation of each entity and relationship of the sample data is obtained.

Preferably, the entity and the relation of each triple of the legal knowledge base of each referee document are replaced by corresponding vectors by adopting a tested knowledge embedding algorithm, and the vectors are averaged to obtain the multidimensional vector representation of the legal knowledge base of the referee document.

On the other hand, the invention also provides a judge document similarity judging system based on the legal knowledge base, which comprises:

a referee document obtaining unit for obtaining a referee document A and a referee document B;

the legal knowledge map construction unit is used for respectively constructing a legal knowledge map A 'corresponding to the case of the referee document A and a legal knowledge map B' corresponding to the case of the referee document B;

a vector conversion unit for converting the legal knowledge base A 'into a vector A', and the legal knowledge base B 'into a vector B',

and the judge document similarity judgment unit is used for calculating the cosine similarity of the vector A 'and the vector B' to obtain the similarity judgment result of the judge document A and the judge document B.

One or more technical schemes provided by the invention at least have the following technical effects or advantages:

the knowledge representation of the legal knowledge graph based on case facts realizes the graph representation of case facts, is different from the traditional keyword retrieval, completely presents the case of each case based on the knowledge representation of the legal knowledge graph based on the case facts, and embodies the relationship among entities, and comprises both the fact relationship (for example, a driving vehicle) and the legal relationship (for example, a and b form a loan relationship). Through complete presentation of cases, the defects of current keyword matching (such as incapability of distinguishing negative description and describing complete cases) are avoided when the similarity of a large number of cases is compared, and case matching in a substantial sense is realized; the method can support pairwise comparison between referee documents based on semantic similarity, and is convenient for judging the similarity between the referee documents more accurately and efficiently.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention;

FIG. 1 is a schematic flow chart of a judging method for similarity of official documents based on legal knowledge base;

FIG. 2 is a schematic diagram of a judging system for judging similarity of official documents based on legal knowledge base;

FIG. 3 is a schematic flow chart of the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflicting with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described and thus the scope of the present invention is not limited by the specific embodiments disclosed below.

The legal knowledge maps of various case bases on case facts are firstly constructed, the legal knowledge maps of the constructed specific case bases (road traffic, divorce, labor disputes and the like) are converted into vector representation, so that the internal structures of the internal entities and relations of the knowledge maps are represented by mathematical vectors, and the semantic similarity-based pairwise comparison between referee documents is supported.

Referring to fig. 1, an embodiment of the present invention provides a method for judging similarity of referee documents based on legal knowledge base, including:

step 1: acquiring a referee document A and a referee document B;

and 4, step 4: and comparing the vector A 'with the vector B' to obtain a similarity judgment result of the referee document A and the referee document B.

Referring to fig. 2, an embodiment of the present invention provides a system for judging similarity of official documents based on legal knowledge base, the system comprising:

and the judge document similarity judgment unit is used for comparing the vector A 'with the vector B' to obtain a similarity judgment result of the judge document A and the judge document B.

Referring to fig. 3, the method includes the following steps:

1. the knowledge graph is represented by a plurality of triples (h, r and t), namely a plurality of groups (h, r and t) are used for representing the knowledge graph of a case;

2. the method generalizes all triples and defines the character roles as an original notice, a defended notice and a third party;

3. and counting the number of all entities and the number of relations, and adding the entities and the relation into the newly-built entity dictionary and the newly-built relation dictionary, wherein the id number takes 0 as the beginning and takes the number of the entities as the end. The dictionary formats are two columns, are stored in a mode that id numbers correspond to entities or relationship names one by one, and respectively establish data sets of entry 2id and relation2 id;

4. and replacing the triples (h, r and t) of the case knowledge graph with the numbers in the entity2id and the relation2id one by one, and storing the triples in a (h, t and r) structure. For example, the way the triplet is finally replaced by (2, 1, 3), 2 represents the second digit of the entry 2id dictionary, 1 represents the first digit of the entry 2id dictionary, and 3 represents the third digit of the relation2id dictionary, and the triplet is saved to the new dataset;

5. dividing the new data set into a training set train2id, a verification set valid2id and a test set test2id according to the ratio of 6:2: 2;

6. training by using the data of train2id, verifying the data of valid2id and testing the data of test2id by using a type knowledge embedding algorithm such as TransE, TransH, TransR, TransG, DistMult, HolE and the like;

7. according to the knowledge embedding algorithm, vector representation about each entity and the relation is finally obtained by setting relevant dimension parameters (for example, 100 dimensions and 200 dimensions), for example, the vector representation is set to be 100 dimensions, and each entity is represented by a 100-dimensional vector;

8. replacing the entity and the relation of each triple of the legal knowledge base of each referee document with a trained vector and averaging, for example, if the vector dimension obtained by setting the entity and the relation by the previous algorithm is 100 dimensions, the vector dimension formed by (h, r and t) is 300 dimensions, and finally the legal knowledge base of each referee document is represented by a 300-dimensional vector;

9. the method converts each referee document into a vector, thereby supporting similarity comparison between every two referee documents based on knowledge map vector representation.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A judge document similarity judgment method based on a legal knowledge base is characterized by comprising the following steps:

step 1: acquiring a referee document A and a referee document B;

and 4, step 4: comparing the vector A 'with the vector B' to obtain a similarity judgment result of the referee document A and the referee document B;

the step 2 further comprises: collecting sample data, wherein the sample data comprises a plurality of referee documents, generalizing triples in legal knowledge maps corresponding to all the referee documents in the sample data, and defining character roles of cases as original reports, announcements and third parties;

counting the number of all entities and the number of relations in the sample data, and listing the counted number into a newly-built entity dictionary and a newly-built relation dictionary, wherein the id number takes 0 as the beginning, and the actual number as the end; the formats of the entity dictionary and the relation dictionary are two columns, the entities or the relations are stored in a one-to-one correspondence mode through id numbers, and a first data set and a second data set are respectively established;

replacing triples (h, r and t) in the legal knowledge graph with numbers in the first data set and the second data set one by one, and storing the triples (h, r and t) in a (h, t and r) structure to a third data set;

2. The method for judging the similarity of official documents based on legal knowledge domains as claimed in claim 1, wherein said step 3 uses knowledge embedding algorithm to vector the triplets represented by legal knowledge domains.

3. The method for judging the similarity of official documents based on legal knowledge domain as claimed in claim 2, wherein the knowledge embedding algorithm comprises: the TransE algorithm, the TransH algorithm, the TransR algorithm, the TransG algorithm, the DistMult algorithm and the Hole algorithm.

4. The method as claimed in claim 1, wherein the step 3 further comprises training the knowledge embedding algorithm with a training set, verifying the knowledge embedding algorithm with a validation set, and testing the knowledge embedding algorithm with a test set.

5. The method as claimed in claim 4, wherein the legal knowledge base comprises a plurality of triplets (h, r, t), h represents the head entity, r represents the relationship, and t represents the tail entity.

6. The method for judging the similarity of referee documents based on legal knowledge domains as claimed in claim 1, wherein the vector representation of each entity and relationship in the sample data is obtained by setting relevant dimension parameters based on a tested knowledge embedding algorithm.

7. The method for judging the similarity of referee documents based on legal knowledge domain as claimed in claim 6, wherein the entities and relations in each triple in the legal knowledge domain of each referee document are converted into corresponding vectors by a tested knowledge embedding algorithm and are averaged to obtain the multidimensional vector representation of the legal knowledge domain of the referee document.