CN112001171A

CN112001171A - Case-related property knowledge base entity identification method based on ensemble learning

Info

Publication number: CN112001171A
Application number: CN202010825763.5A
Authority: CN
Inventors: 林锋; 蒋宗神; 李攀峰; 李元豪
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-27

Abstract

The invention discloses a case-related property knowledge base entity identification method based on ensemble learning, which comprises the following steps: carrying out training set pretreatment on a plurality of randomly selected legal documents related to the related assets according to entity categories; training the T learners according to the obtained training set to obtain learners; randomly selecting two related legal documents of the related property which are not in the test set, and constructing a development set; calculating the classification accuracy of the corpora in the development set by using the trained learners, and constructing the weight of each learner by using the classification accuracy of each learner; dividing words of the legal documents related to the property to construct a test set, and classifying samples in the test set by each learner; and combining the classification results of all the learners, and obtaining a final entity identification result by adopting a weighted voting method. The entity identification problem of a small-scale corpus and high accuracy requirement can be solved, and the relevant knowledge fusion of the processing of the property involved in the case can be automatically completed according to the existing legal provisions.

Description

Case-related property knowledge base entity identification method based on ensemble learning

Technical Field

The invention belongs to the technical field of entity identification of a property-related knowledge base, and particularly relates to a method for identifying a property-related knowledge base entity based on ensemble learning.

Background

The knowledge base can describe concepts, entities and relations thereof in the objective world in a structured form, and effective organization, management and understanding of mass information are completed. The potential of the knowledge base system in the applications of knowledge fusion, intelligent question answering, big data decision making and the like is widely concerned. The knowledge base is a huge network with entities as nodes, and comprises the entities, entity attributes and relationships among the entities. Entity identification is a core technology for knowledge base construction.

Entity identification refers to identifying entities with specific meanings from text and determining categories for the entities. Entity recognition plays an important role in a variety of natural language processing applications, such as information extraction, information retrieval, automatic text summarization, machine translation, knowledge bases, and the like. With respect to entity identification, considerable research has been conducted at home and abroad, and methods for entity identification can be roughly classified into three types: rule-based methods, traditional machine learning-based methods, and deep learning-based methods. Rule-based methods rely on a large number of manual rules and do not require corpus labeling. However, the rule making is time-consuming and labor-consuming, and needs to be supported by professional knowledge in some professional fields. The portability of rule-based approaches is limited and good performance needs to be achieved by updating the rules for text from new domains. Thus, this method is now slowly used less often. With the development of the traditional machine learning, a plurality of traditional machine learning methods are successfully applied to the entity recognition task, such as hidden markov models, maximum entropy models, conditional random fields, and the like. In addition to using machine learning algorithms alone, multiple methods may be combined to accomplish the entity recognition task. Deep learning-based methods, such as bidirectional long-and short-term memory neural network models, have also been successfully applied to entity recognition tasks. Compared with the traditional machine learning-based method, the deep learning-based method does not need elaborate feature engineering, can automatically capture the context dependence in the input text, and can be well represented.

However, the challenge faced by entity identification in the construction process of the property knowledge base involved in the case is different from the general entity identification method, and becomes a unique and challenging problem. Due to the particularity of the domain knowledge base, the following challenges are faced in completing entity identification during the process of creating the knowledge base: (1) the training corpus is few and single. The construction target of the case-related property knowledge base is to automatically complete knowledge extraction based on legal rules, and the corpus is mainly derived from the legal rules related to case-related property treatment in the formally implemented legal rules, so that the training corpus is far less than a general knowledge base and is also less than a general legal knowledge base when entity identification is carried out; (2) the requirement on identification accuracy is high. The application target of the property-related knowledge base is to provide support for front-line case handling personnel in judicial practice, which puts extremely high requirements on the correctness and accuracy of knowledge in the knowledge base. In order to ensure the correctness of the knowledge base and reduce subsequent work, the identification accuracy of the entity identification algorithm is much higher than that of the general knowledge base.

Disclosure of Invention

In order to solve the problems, the invention provides an entity identification method of a case-related property knowledge base based on integrated learning, which can solve the entity identification problems of a small-scale corpus and high accuracy requirements, and can automatically complete the relevant knowledge fusion of case-related property disposal in criminal cases according to the existing legal provisions.

In order to achieve the purpose, the invention adopts the technical scheme that: a method for identifying entities in a knowledge base of related to property based on ensemble learning comprises the steps of obtaining a legal document set related to the property, constructing a corpus according to the legal document related to the property, and dividing the corpus into a training set, a development set and a test set, wherein the entity identification process comprises the following steps:

step 1: training a learner, namely performing training set pretreatment on a plurality of legal documents related to the property randomly selected from the legal document set related to the property according to entity categories; training T learners according to the obtained training set to obtain a learner h_i,i＝1...T；

Step 2: learner weight determination: randomly selecting two related legal documents of the related property which are not in the test set, and constructing a development set; using a trained learner h_iT, calculating the classification accuracy of the corpora in the development set, and constructing the weight of each learner by using the classification accuracy of each learner;

and step 3: entity identification: dividing words of the legal documents related to the property to construct a test set, and classifying samples in the test set by each learner; and combining the classification results of all the learners, and obtaining a final entity identification result by adopting a weighted voting method.

Further, the entity classification includes a disposal unit, a worker of the disposal unit, a case-related person, a document, a property involved in the case, a disposal action, and a term or a title of a legal document.

Further, the training set preprocessing according to the selected legal documents related to the property related to the case comprises the following steps: and taking the legal documents related to the property involved in the case as a training set, performing word segmentation by using a Chinese word segmentation tool, and manually labeling the result after word segmentation according to the entity category to construct a corpus.

Further, in the learner training process: selecting T learners, and obtaining T sampling sets containing m training samples by using a self-service sampling method for a given training data set containing m samples; then training a learner h based on each sampling set_i,i＝1...T。

Further, 4 learners are adopted in the learner training process, including: hidden Markov model, conditional random field, maximum entropy model, and two-way long-short term memory neural network model.

Further, in the development set building process: and randomly selecting two related legal documents of the property concerned which are not in the test set, segmenting the selected documents, manually labeling segmentation results, and constructing a development set.

Further, in the learner weight determination process, the method includes the steps of:

2.1. randomly selecting two related legal documents of the related property which are not in the test set, and constructing a development set;

2.2. using a trained learner h_iT classifying corpora in the development set respectively;

2.3. respectively calculating the learner h according to the classification result and the manual labeling result_iT, the classification accuracy on the development set, the classification accuracy calculation formula is:

where N is the total number of samples in the development set, M_iFor learning device h_iThe number of samples with wrong classification results;

2.4. using the classification accuracy p of each learner_iThe weight of the learner is constructed, and the weight calculation formula is as follows:

further, in the step 3, in obtaining a final entity identification result by using a weighted voting method, a calculation formula of the weighted voting method is:

wherein h is_iFor the learner, x is the test sample, c_jIn the form of an output tag, the tag is,

is h_iAt label c_jAn output of w_iIs a learning device h_iThe weight of (c).

The beneficial effects of the technical scheme are as follows:

the method can solve the entity identification problem of small-scale corpus and high accuracy requirement. The scheme provides that a learner obtained by training a plurality of existing entity recognition algorithms on a training set carries out weighted voting. The result obtained by the method is better than that obtained by an independent entity identification method, and the identification effect is improved. A plurality of existing identification schemes are selected to respectively complete entity identification tasks, and then parallel integration is carried out to improve identification accuracy. The invention can automatically complete the fusion of the related knowledge of the handling of the property involved in the criminal case according to the prior law provision.

Drawings

FIG. 1 is a schematic flow chart of an integrated learning-based identification method for an entity of a property knowledge base involved in a case;

fig. 2 is a flowchart illustrating a learner weight determination process according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.

In this embodiment, referring to fig. 1, the present invention provides an entity identification method for a knowledge base of assets involved in a case based on ensemble learning, which includes the steps of obtaining a legal document set related to the assets involved in the case, constructing a corpus according to the legal documents related to the assets involved in the case, and dividing the corpus into a training set, a development set and a test set, wherein the entity identification process includes the steps of:

The acquired legal document set related to the property involved in the case comprises 14 policy documents and laws and regulations related to the management and disposal of the property involved in the case.

In the above legal documents, the entity to be identified is classified to include a disposal unit, a worker of the disposal unit, a person related to a case, a document, a property related to the case, a disposal action and a provision or a title of the legal document.

(1) Treatment unit: including and not limited to: the national court of people, the national inspection institute, the public security organization, the ministry of public security, the department of justice, the ministry of finance, the department of finance, the highest national court of people, the basic national court of people, the national security organization, the inspection institute, the national library, the central national library, the case handling department, the custody department, the case handling unit, the higher government and law organization, the Chinese people's bank, the court, the council, the inspection and observation committee of the national inspection institute, the committee of trial, the legal assistance mechanism, the prison, the custody, the community correction mechanism, the guard house, the customs and the like.

(2) Staff/affiliated staff of the disposal unit: including and not limited to: case handling personnel, custodian personnel, supervisor personnel, inspection personnel, judicial staff, trial personnel, reconnaissance personnel, civilian accompanying and reviewing personnel, courtyard, inspection chief, public security organization responsible personnel, bookkeeping personnel, trial chief and the like.

(3) Case-related personnel: including and not limited to: parties, defendents, defendees, offsite, relatives, victims, litigants, prosecutes, criminal suspects, crimes, plaints, attorneys, legal agents, litigant participants, disputes, testifiers, appraisers, translators, conspires, stakeholders, attorneys on duty, guardians, reporters, referees, reporters, critiques, current criminals, major suspects, and the like.

(4) Document: including and not limited to: a decision, a notice, a certificate, a police officer's approval of an arrest, a national institute of quarantine prosecution, a national court decision, a detainment, a release certificate, an arrest, a promissory note, a notice, a decision, a case decision, a search certificate, a wanted statue, a legal document, etc.

(5) Relating to property: including and not limited to: property, property-related, document, mail, telegraph, deposit, remittance, bond, stock share, fund share, property, contraband, legal property, article, automobile, boat, money, case-related, deposit voucher, power certificate, payment voucher, money order, book order, cheque, gold and silver, jewelry, famous calligraphy and painting, valuables, real estate, equipment, precious animals and their products, precious plants and their products, drugs and the like.

(6) The treatment action is as follows: including and not limited to: checking, detaining, freezing, keeping, paying, liability refunding, paying, impound, and returning.

(7) Clause or legal document title: including and not limited to: criminal litigation law, twenty-fourth and thirty-sixth criminal litigation law, etc.

As an optimization scheme 1 of the embodiment, the method for preprocessing the training set according to the selected legal documents related to the property related to the case comprises the following steps: and taking the legal documents related to the property involved in the case as a training set, performing word segmentation by using a Chinese word segmentation tool, and manually labeling the result after word segmentation according to the entity category to construct a corpus.

In the learner training process: selecting T learners, and obtaining T sampling sets containing m training samples by using a self-service sampling method for a given training data set containing m samples; then training a learner h based on each sampling set_i,i＝1...T。

Preferably, 4 learners are used in the learner training process, including: hidden Markov model, conditional random field, maximum entropy model, and two-way long-short term memory neural network model.

As an optimization scheme 2 of the above embodiment, in the development set construction process: and randomly selecting two related legal documents of the property concerned which are not in the test set, segmenting the selected documents, manually labeling segmentation results, and constructing a development set.

In the learner weight determination process, as shown in fig. 2, comprising the steps of:

as an optimization scheme 3 of the above embodiment, in the step 3, in obtaining a final entity identification result by using a weighted voting method, a calculation formula of the weighted voting method is as follows:

is h_iAt label c_jAn output of w_iIs a learning device h_iThe weight of (c).

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for identifying an entity of a knowledge base of related to property based on ensemble learning is characterized by acquiring a legal document set related to the property, constructing a corpus according to the legal document related to the property, and dividing the corpus into a training set, a development set and a test set, wherein the entity identification process comprises the following steps:

Step 2: learner weight determination: randomly selecting two related legal documents of the related property which are not in the test set, and constructing a development set; using a trained learner h_iT, calculating the classification accuracy of the corpora in the development set; constructing the weight of each learner by using the classification accuracy of each learner;

2. The integrated learning-based case-related property knowledge base entity identification method as claimed in claim 1, wherein the entity classification comprises a disposal unit, a worker of the disposal unit, a case-related person, a document, a case-related property, a disposal action and a term or a title of a legal document.

3. The integrated learning-based case-related property knowledge base entity identification method as claimed in claim 1, wherein the training set preprocessing according to the selected case-related property-related legal documents comprises the steps of: and taking the legal documents related to the property involved in the case as a training set, performing word segmentation by using a Chinese word segmentation tool, and manually labeling the result after word segmentation according to the entity category to construct a corpus.

4. The integrated learning-based identification method for the property-related knowledge base entity as claimed in claim 3, wherein in the learning training process: selecting T learners, and obtaining T sampling sets containing m training samples by using a self-service sampling method for a given training data set containing m samples; then training a learner h based on each sampling set_i,i＝1...T。

5. The integrated learning-based identification method for the property-related knowledge base entity as claimed in claim 4, wherein 4 learners are adopted in the learner training process, and the method comprises the following steps: hidden Markov model, conditional random field, maximum entropy model, and two-way long-short term memory neural network model.

6. The integrated learning-based identification method for the property-related knowledge base entity, according to claim 1, wherein in the development set construction process: and randomly selecting two related legal documents of the property concerned which are not in the test set, segmenting the selected documents, manually labeling segmentation results, and constructing a development set.

7. The integrated learning-based identification method for the property-involved knowledge base entity, as claimed in claim 6, wherein in the learner weight determination process, the method comprises the following steps:

8. the integrated learning-based identification method for the property-involved knowledge base entity as claimed in claim 1, wherein in the step 3, a weighted voting method is adopted to obtain the final entity identification result, and the calculation formula of the weighted voting method is as follows:

is h_iAt label c_jAn output of w_iIs a learning device h_iThe weight of (c).