CN113312487A

CN113312487A - Knowledge representation learning method facing legal text based on TransE model

Info

Publication number: CN113312487A
Application number: CN202110058262.3A
Authority: CN
Inventors: 李参宏
Original assignee: Jiangsu Netmarch Technologies Co ltd
Current assignee: Jiangsu Netmarch Technologies Co ltd
Priority date: 2021-01-16
Filing date: 2021-01-16
Publication date: 2021-08-27

Abstract

The invention discloses a knowledge representation learning method facing legal texts based on a TransE model, which comprises the following steps: s1: acquiring a legal industry training text by using a mask language model; s2: dividing entities according to the obtained legal industry training texts, extracting corresponding relations among the entities, and storing the defined data to a graph database in a triple form; s3: matching the industry word entities in the training text by using the graph database, and defining a target function of a knowledge representation learning TransE model; and training the model by fusing the entity vector in the training text and the structural information in the graph database, and learning the expression of the entity vector and the relationship vector. The invention solves the problem that the traditional knowledge representation learning method only utilizes structural information and does not utilize a plurality of additional information, so that the knowledge representation fused with text information can better represent the complex relation in a knowledge base.

Description

Knowledge representation learning method facing legal text based on TransE model

Technical Field

The invention relates to the field of legal knowledge maps, in particular to a knowledge representation learning method facing legal texts based on a TransE model.

Background

The method is limited by a plurality of defects of the current deep learning technology, only the work of regular data and simpler task can be solved, and in some situations with complex structures, the deep learning technology is difficult to play, and still needs to be distinguished by human experience.

Under the condition, the rise of the knowledge graph technology provides a very convenient and efficient feasible scheme. The knowledge graph stores an event as triple data of an entity-relation-entity, so that the limitation of the traditional database is broken through, and the problem of searching related data is greatly simplified.

The knowledge base may refer to a database in which related data is stored in order. In general, a knowledge base may be represented in a network form, with nodes representing entities and edges representing relationships between entities. In a network-form representation, it is usually necessary to design a specialized knowledge-graph computation storage and to utilize a knowledge base. The design of the knowledge graph not only has the defects of time and labor waste, but also is troubled by the problem of data sparsity. Therefore, a representation learning technique typified by deep learning has attracted attention. Representation learning aims at identifying semantic information of a study object as a dense low-dimensional real-valued vector.

A translation-based model is a typical knowledge representation method, and the model regards relationships as translation operations between entities, i.e., a relationship vector can be represented as a difference between a tail entity vector and a head entity vector. When the relation between the entities is missing, the relation vector can be calculated through the difference of the entity vectors, and the relation corresponding to the relation vector is found out to complete the relation. The model has extremely high accuracy in the knowledge base completion experiment. However, most of the existing translation-based models only use structural information in the knowledge base, and ignore additional information such as relationship path information, type information, entity description information and the like in the knowledge base.

In the field of NLP (Natural Language Processing), pre-training Language models all show excellent effects on multiple NLP tasks. In addition, the pre-trained language model also performs well on tasks that need real-world description and knowledge reasoning, such as many reading understanding tasks, information extraction tasks, and the like, which shows that the pre-trained language model has better knowledge acquisition capability and can be used for better learning knowledge representation.

Therefore, the knowledge base is modeled by combining the text information with the powerful knowledge acquisition capability and the context analysis capability of the pre-training language model, the problem that only structured information is used and multiple kinds of additional information are not used in the traditional knowledge representation learning method can be solved, and the complex relation in the knowledge base can be better represented by the knowledge representation obtained by combining the text information.

Therefore, there is a need to provide a knowledge representation learning method facing legal text based on the TransE model to solve the above problems.

Disclosure of Invention

The invention aims to provide a knowledge representation learning method facing legal texts based on a TransE model, which comprehensively considers the structural information and the text description information of knowledge and improves the accuracy of knowledge representation.

In order to achieve the purpose, the invention provides the following technical scheme: a knowledge representation learning method facing legal text based on a TransE model comprises the following steps: s1: acquiring a legal industry training text by using a mask language model; s2: dividing entities according to the obtained legal industry training texts, extracting corresponding relations among the entities, and storing the defined data to a graph database in a triple form; s3: matching the industry word entities in the training text by using the graph database, and defining a target function of a knowledge representation learning TransE model; and training the model by fusing the entity vector in the training text and the structural information in the graph database, and learning the expression of the entity vector and the relationship vector.

The step of S3 includes the following steps, (1) utilizing h_sAnd t_sRespectively representing entity vectors in a graph database, and establishing representations of a head entity and a tail entity from the perspective of structural information, wherein the representations are the same as those of a TransE model; (2) matching the industry word entities in the training text by using a graph database to obtain an entity vector representation h based on the training text_wAnd t_wBuilding representations of head entities and tail entities from the perspective of the training text; (3) forming an energy function of the optimized knowledge representation learning model:

E(h,r,t)＝||h_s+r-t_s||+||h_w+r-t_w||+||h_s+r-t_w||+ ||h_w+r-t_s||

wherein, the first part is an energy function based on the structural representation, the second part is an energy function based on the text information, and the third part and the fourth part are energy functions based on the fusion of the structural information and the text information; (4) obtaining embedded representations of entities and relationships through an objective function of a knowledge representation learning model that exhibits characteristics of the entities and relationships in a knowledge graph, the objective function being:

in the formula, gamma is an edge hyper-parameter, T is a training set, and T' is a negative sample set of T; (5) h obtained by pre-training a language model_wAnd t_wRandomly initialized h_s、r、t_sAnd as the initial input of a knowledge representation learning model, optimizing an objective function by using a random gradient descent method according to the training mode of a TransE model, training and solving the model, and learning the representation of an entity vector and a relation vector.

Compared with the prior art, the knowledge representation learning method facing the legal text based on the TransE model comprehensively considers the structural information and the text description information of knowledge, improves the accuracy of knowledge representation, and aims to more fully represent the characteristics of entities and relations in a knowledge graph so as to obtain more efficient embedded representation of the entities and relations.

The method utilizes the strong knowledge acquisition capability and the context analysis capability of the pre-training language model, integrates text information to model the knowledge base, and solves the problem that the traditional knowledge representation learning method only utilizes structural information and does not utilize a plurality of additional information, so that the knowledge representation integrated with the text information can better represent the complex relation in the knowledge base.

Drawings

FIG. 1 is a flow chart diagram of a knowledge representation learning method facing legal text based on a TransE model;

FIG. 2 is a schematic flow chart of training of knowledge representation learning in the knowledge representation learning method for legal text according to the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a knowledge representation learning method for legal text based on a TransE model, which includes the following steps:

s1: acquiring a legal industry training text by using a mask language model; specifically, a mask language model is used for masking industry words in a training text to obtain a mask training text; inputting the mask training text into a pre-training language model, and learning to obtain knowledge representation of an industry word in a legal text;

the customizing step of the mask language model comprises the following steps:

collecting the industry corpora;

performing word segmentation on the industry linguistic data to obtain a word set;

counting the distribution of the words in the word set to obtain a distribution result;

and selecting words from the word set based on the distribution result, and generating an industry dictionary as the mask language model.

The acquiring of the industry training text comprises:

collecting an industrial question and answer corpus;

taking the industry question-answer corpus as a formal industry training text;

and scattering question sentences and answer sentences in the industry question-answer corpus to generate a negative example industry training text.

Wherein the industry is a legal industry, and the obtaining an industry training text comprises:

collecting legal decision corpus; and deleting case routing information in the legal decision book corpus to generate a legal industry training text.

Wherein, the acquiring of the industry training text further comprises:

inserting a first preset character into the head of the legal industry training text, dividing the legal industry training text according to a fixed character length, and inserting a second preset character into the tail of each divided part;

and inputting the mask training text into a pre-training language model, and learning to obtain knowledge representation of the industry words in the industry training text and case relations to which the legal industry training text belongs.

S2: dividing entities according to the obtained legal industry training texts, extracting corresponding relations among the entities, and storing the defined data to a database in a triple form;

triples, entities in a knowledge graph refer to words having a concrete or abstract meaning, relationships refer to associations between different entities, which are typically stored in the form of triples (head entity h, relationship between head and tail entities r, tail entity t);

the knowledge representation of the industry words refers to that the mask language model is used for masking the industry words in the legal industry texts to obtain mask training texts, and then the mask training texts are input into the pre-training language model to obtain the knowledge representation of entity (industry word) vectors in the legal texts.

Specifically, the division entity includes five categories of entities, namely, a person, a case (event), an article, a place and an organization.

The graph database is used for storing various relational graphs, each node in the graph represents an entity, and edges between the nodes represent relations, so that defined data can be stored into the graph database in a node-edge-node mode. The step of constructing the graph database comprises:

acquiring all relevant data of the broken cases, including case time, case place, case article, case personnel and all personnel data related to the case article, dividing the data into five types of entities of personnel, cases, articles, places and mechanisms, and extracting the relationship among the five types of entities;

and storing the extracted events in a form of entity, relation and entity as a format of a triple, and marking the symbol as (h, r, t), wherein h represents a main entity of the event, r represents the relation, and t represents a guest entity of the event.

S3: matching the industry word entities in the training text by using a graph database, and defining a target function of a knowledge representation learning TransE model; and training the model by fusing the entity vector in the training text and the structural information in the graph database, and learning the expression of the entity vector and the relationship vector.

Fig. 2 is a schematic flow chart of training of knowledge representation learning in the knowledge representation learning method for legal text according to the present invention. The training of knowledge representation learning is to adopt an optimized TransE model to fuse the training text and graph database structure information of the legal industry, and the method comprises the following steps:

(1) by using h_sAnd t_sRespectively representing entity vectors in a graph database, and establishing representations of a head entity and a tail entity from the perspective of structural information, wherein the representations are the same as those of a TransE model;

(2) matching the industry word entities in the training text by using a graph database to obtain an entity vector representation h based on the training text_wAnd t_wBuilding representations of head entities and tail entities from the perspective of the training text;

the method for obtaining the entity vector representation by matching the pre-training language model with the graph database comprises the following steps:

and (2.1) acquiring a legal industry training text. Typically, industry words will be present in industry training text. An industry word may be a word that is unique to the industry, i.e., a word that has only a specific meaning in the industry. For example, "taurine particles" are an industry word for the medical industry.

(2.2) masking the industry words in the legal industry training text by using a mask language model to obtain a mask training text;

(2.3) inputting the mask training text into a pre-training language model, and learning to obtain knowledge representation of the industry words in the industry training text;

(2.4) matching the entity in the graph database with the legal training text to further obtain the entity vector representation, namely h, learned by the pre-training language model_wAnd t_w。

(3) Forming an energy function of the optimized knowledge representation learning model:

E(h,r,t)＝||h_s+r-t_s||+||h_w+r-t_w||+||h_s+r-t_w||+ ||h_w+r-t_s||

in the formula, the first part is an energy function based on the structural representation, the second part is an energy function based on the text information, and the third part and the fourth part are energy functions based on the fusion of the structural information and the text information. The energy function maps the two types of entity representations to the same vector space containing all four energy function relational representations, which are facilitated by each other;

text information can be well fused through the learning model; and further preparing expression entities and relationship vectors.

The mask language model can be used for mask industry words and is customized by utilizing a large-scale industry dictionary, and the large-scale industry dictionary can be mined from mass industry linguistic data by utilizing a data mining technology. The industry dictionary mining method comprises the following steps:

firstly, collecting industry linguistic data;

secondly, performing word segmentation on the industry linguistic data to obtain a word set;

then, the distribution of the words in the word set is counted to obtain a distribution result. Here, the frequency of each word appearing in the line corpus is counted;

and finally, selecting words from the word set based on the distribution result to generate an industry dictionary. Wherein, the industry dictionary can be used as a mask language model. Here, the word with high frequency of occurrence is selected firstly, and then the universal word is deleted manually, so that the industry dictionary can be obtained.

Specifically, an industry word in the legal industry training text is recognized by using a Mask Language Model (Mask Language Model), and then the industry word is shielded.

(4) The embedded representation of the entity and the relation is obtained through the objective function of the knowledge representation learning model, the characteristics of the entity and the relation in the knowledge map are more fully represented, and the method has the advantages that: thereby obtaining a more efficient embedded representation of entities and relationships. The objective function is as follows:

in the formula, gamma is an edge hyper-parameter, T is a training set, and T' is a negative sample set of T;

(5) h obtained by pre-training a language model_wAnd t_wRandomly initialized h_s、r、 t_sAnd as initial input of a knowledge representation learning model, optimizing an objective function by using a random gradient descent method according to a training mode of a TransE model, training and solving the model, and learning representation of entity vectors and relationship vectors.

The specific training method of the knowledge representation learning model comprises the following steps:

(5.1) determining a training set, a hyper-parameter gamma, a learning rate lambda, an embedded dimension k and a pre-trained text entity vector h_wAnd t_w；

(5.2) initializing relationship vectors and entity vectors in the graph database, for each dimension of each vector, in

A value is randomly selected, k is the dimension of the low-dimensional vector, and normalization is carried out after all vectors are initialized;

(5.3) entering a circulation: by adopting minipatch, training of one batch of data can accelerate the training speed, each batch of data is subjected to negative sampling (one entity of a triplet in a training set is randomly replaced), T _ batch is initially an empty list, and then a list consisting of tuple pairs (original triplets, smashed triplets) is added to the empty list:

T_batch＝[([h,r,t],[h',r,t']),……]

and training after the T _ batch is obtained, adjusting parameters by adopting gradient descent, and optimizing the target function until the iteration number reaches the preset maximum iteration number, so as to realize the learning of knowledge expression.

The invention collaboratively learns the structural information and the training text description information in the knowledge base, and can obtain the following beneficial effects by the technical scheme of the knowledge representation learning method facing legal texts, wherein the technical scheme is conceived by the knowledge representation learning method facing the legal texts:

the knowledge representation learning algorithm is used for embedding entities and relations in the triples, training and learning are carried out on the basis of the established knowledge graph by combining with the text description information, and the knowledge representation learning algorithm has important significance for reasoning work.

The calculation efficiency is remarkably improved, the semantic and reasoning relation among the entities is calculated by adopting a graph algorithm for the knowledge graph in the prior art, the calculation complexity is high, the expandability is poor, the distributed representation obtained by representation learning can be realized, and the operations such as semantic similarity calculation and the like can be realized efficiently.

The problem of data sparsity is effectively relieved, entities are projected to a uniform low-dimensional space through representation and learning, each entity corresponds to one dense vector, and therefore semantic similarity between any two entities can be measured.

Heterogeneous information fusion is realized, entities from different sources are projected into the same semantic space through a representation learning model, a uniform representation space is established, semantic similarity correlation calculation among heterogeneous entities is realized, and information fusion of multiple knowledge bases is realized.

Claims

1. A knowledge representation learning method facing legal text based on TransE model is characterized by comprising the following steps:

s1: acquiring a legal industry training text by using a mask language model;

s2: dividing entities according to the obtained legal industry training texts, extracting corresponding relations among the entities, and storing the defined data to a graph database in a triple form;

s3: matching the industry word entities in the training text by using the graph database, and defining a target function of a knowledge representation learning TransE model; and training the model by fusing the entity vector in the training text and the structural information in the graph database, and learning the expression of the entity vector and the relationship vector.

2. The method of learning legal text-oriented knowledge representation based on TransE model according to claim 1, wherein the step S3 includes the steps of,

E(h,r,t)＝‖h_s+r-t_s‖+‖h_w+r-t_w‖+‖h_s+r-t_w‖+‖h_w+r-t_s‖

wherein, the first part is an energy function based on the structural representation, the second part is an energy function based on the text information, and the third part and the fourth part are energy functions based on the fusion of the structural information and the text information;

(4) obtaining embedded representations of entities and relationships through an objective function of a knowledge representation learning model that exhibits characteristics of the entities and relationships in a knowledge graph, the objective function being:

(5) h obtained by pre-training a language model_wAnd t_wRandomly initialized h_s、r、t_sAnd as the initial input of a knowledge representation learning model, optimizing an objective function by using a random gradient descent method according to the training mode of a TransE model, training and solving the model, and learning the representation of an entity vector and a relation vector.

3. The method of learning legal text-oriented knowledge representation based on TransE model according to claim 2, wherein the method of matching a pre-trained language model with a graph database to obtain an entity vector representation comprises the steps of:

obtaining a legal industry training text;

masking the industry words in the legal industry training text by using a mask language model to obtain a mask training text;

inputting the mask training text into a pre-training language model, and learning to obtain knowledge representation of industry words in the industry training text;

matching the entity in the graph database with the legal training text to obtain the entity vector representation, namely h, learned by the pre-training language model_wAnd t_w。

4. The method of claim 3, wherein the specific training method of the knowledge representation learning model comprises:

determining a training set, a hyper-parameter gamma, a learning rate lambda, an embedded dimension k and a pre-trained text entity vector h_wAnd t_w；

Initializing relationship vectors and entity vectors in a graph database for each dimension of each vector

Taking a value at random, k being the dimension of the low-dimensional vector, for all directionsNormalization is performed after quantity initialization;

entering a circulation: with minipatch, training of a batch of data can be accelerated, negative sampling is performed on each batch of data, T _ batch is initially an empty list, and then a list consisting of tuple groups is added thereto:

T_batch＝[([h,r,t],[h',r,t']),([],[]),……]；

5. The method of learning legal text-oriented knowledge representation based on TransE model according to claim 1, wherein the step of customizing the mask language model comprises:

collecting the industry corpora;

6. The method of learning legal text-oriented knowledge representation based on TransE model according to claim 1, wherein the obtaining of industry training text comprises:

collecting an industrial question and answer corpus;

taking the industry question-answer corpus as a formal industry training text;

7. The method of learning legal text-oriented knowledge representation based on TransE model according to claim 6, wherein the obtaining of industry training text further comprises:

inserting first preset characters into the heads of the positive example industry training texts and the negative example industry training texts, and inserting second preset characters into the ends of the question sentences and the answer sentences;

and inputting the mask training text into a pre-training language model, and learning to obtain the knowledge representation of the industry words in the industry training text and the positive and negative case prediction values of the industry training text.