CN113435164B

CN113435164B - Automatic labeling and extracting method and device for Mongolian arbitration document information

Info

Publication number: CN113435164B
Application number: CN202110532905.3A
Authority: CN
Inventors: 赵小兵; 张亮
Original assignee: Minzu University of China
Current assignee: Minzu University of China
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2024-02-13
Anticipated expiration: 2041-05-17
Also published as: CN113435164A

Abstract

The invention provides a method and a device for automatically labeling and extracting key information from Mongolian judgment documents, and relates to the technical field of text processing. According to the method, original data of Mongolian judgment documents are obtained; preprocessing the original data of the Mongolian judgment document; the method comprises the steps of marking key elements of preprocessed Mongolian judgment document original data according to a preset attribute tag system, and obtaining a marked document, wherein the preset attribute tag system is constructed based on Chinese judgment documents; and extracting information from the marked document by adopting the regular expression to obtain key information. Aiming at the situation that the comprehensive attribute labels are difficult to obtain in the direct Mongolian judgment document, the invention adopts the method for obtaining the more comprehensive attribute labels from the large-scale Chinese judgment document and constructs a system according to the more comprehensive attribute labels. And then, the constructed system is applied to the Mongolian judgment document, so that automatic annotation extraction of the Mongolian judgment document is realized, and the annotation efficiency and the accuracy are improved.

Description

Automatic labeling and extracting method and device for Mongolian arbitration document information

Technical Field

The invention relates to the technical field of text processing, in particular to a method and a device for automatically labeling and extracting Mongolian arbitration document information.

Background

Along with the development of society, legal system is perfected continuously, and the legal consciousness of masses is improved continuously. As the number of various cases increases, the number of various case decisions or decisions increases. In the face of such situations, on the one hand, law practitioners need to continuously review a large amount of related cases and related laws and regulations in the process of knowing the cases so as to grasp the actual situations of the cases, and then develop further work. This has increased the demands on law practitioners, making their tasks increasingly more demanding, not only detrimental to efficiency improvements, but also increasing the risk of errors during operation. The judgment document is marked, so that legal practitioners can know the case conveniently.

The traditional Mongolian arbitration document mainly takes a manual labeling mode, key information in legal texts is extracted, then the key information is marked with corresponding label attributes, and the legal arbitration document labeled in the mode is high in accuracy and good in readability. However, on one hand, the labeling mode has high requirements on labeling people, and the labeling people have certain legal knowledge to successfully finish labeling tasks. On the other hand, as the data volume of legal documents increases, time and effort are consumed in a manual mode, uniformity is poor, and error rate in the manual marking process is high. The existing Mongolian arbitration document labeling method is low in efficiency.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a method and a device for automatically labeling and extracting Mongolian arbitration document information, and solves the technical problem of low efficiency of the existing Mongolian arbitration document labeling method.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a method for automatically labeling and extracting Mongolian arbitration document information, the method comprising:

s1, acquiring original data of a Mongolian judgment document;

s2, preprocessing the original data of the Mongolian judgment document;

s3, marking key elements of the preprocessed Mongolian judgment document original data according to a preset Chinese attribute tag system, so as to obtain a marked document, wherein the preset Chinese attribute tag system is constructed based on the Chinese judgment document;

and S4, extracting information from the labeling document by adopting a regular expression to obtain key information.

Preferably, the method further comprises:

and S5, storing the key information into a text with a preset structural rule.

Preferably, the preprocessing the original data of the mongolian decision document includes:

s201, converting Mongolian judgment documents from Meng Keli codes to international standard codes;

s202, uniformly converting the deformation controller and the additional components;

s203, converting the full-angle character into the half-angle character, and deleting the page number and the redundant paragraph characters.

Preferably, the construction process of the preset Chinese attribute tag system comprises the following steps:

setting a fixed category label for the external attribute label of the Chinese judgment document, marking the Chinese judgment document by the fixed category label, splitting the marked Chinese judgment document according to the external label, and extracting an attribute label system from a term and French knowledge base; for the label labeling principle, the following rules are followed:

automatic labeling by a machine;

based on automatic labeling of the machine, a manual checking mode is adopted;

for unstructured parts of Chinese decision documents, the unstructured parts are converted into structured texts, and the conversion steps are as follows:

a. analyzing the head-tail structural characteristics, researching a head-tail attribution representation method of a judgment document based on structural relation, and constructing a structural attribute tag matching rule;

b. analyzing the basic information in the text and the content characteristics of the judgment result, researching an attribute representation method of the judgment book based on rules, selecting related information of keywords by combining a professional term library to formulate rules, and constructing attribute tag matching rules of unstructured texts.

Preferably, the extracting information from the labeling document by using a regular expression to obtain key information includes:

and (3) automatically extracting the labeling document in the step (S3) by adopting a regular expression character string matching mode to obtain key information, and forming an XML template file.

Preferably, the storing the key information in a text of a rule with a preset structure includes:

s501, writing a Python program, and extracting key information from an XML template file by using a regular matching algorithm;

s502, writing the extracted key information into a txt text file.

In a second aspect, the present invention provides an automatic labeling and extracting device for mongolian arbitration document information, the device comprising:

the data acquisition module is used for acquiring original data of the Mongolian judgment document;

the preprocessing module is used for preprocessing the original data of the Mongolian judgment document;

the marking module is used for marking key elements of the preprocessed Mongolian judgment document original data according to a preset Chinese attribute tag system to obtain a marked document, and the preset Chinese attribute tag system is constructed based on the Chinese judgment document;

and the extraction module is used for extracting information from the marked document by adopting the regular expression to obtain key information.

Preferably, the apparatus further comprises:

and the rule text module is used for storing the key information into a text of a rule with a preset structure.

In a third aspect, the present invention provides a computer readable storage medium storing a computer program for automatic labeling and extraction of mongolian arbitration document information, wherein the computer program causes a computer to execute the method for automatic labeling and extraction of mongolian arbitration document information as described above.

In a third aspect, the present invention provides an electronic device comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the automatic mongolian arbitration document information labeling and extraction method described above.

(III) beneficial effects

The invention provides a method and a device for automatically labeling and extracting Mongolian arbitration document information. Compared with the prior art, the method has the following beneficial effects:

aiming at the situation that the comprehensive attribute labels are difficult to obtain in the direct Mongolian judgment document, the invention adopts the method for obtaining the more comprehensive attribute labels from the large-scale Chinese judgment document and constructs a system according to the more comprehensive attribute labels. And then, the constructed system is applied to the Mongolian judgment document, so that automatic annotation extraction of the Mongolian judgment document is realized, and the annotation efficiency and the accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a method for automatically labeling and extracting Mongolian arbitration document information according to an embodiment of the present invention;

FIG. 2 is a partially labeled Mongolian decision document;

fig. 3 is a schematic diagram of an XML template file.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to the method and device for automatically labeling and extracting the Mongolian arbitration document information, the technical problem that an existing Mongolian arbitration document labeling method is low in efficiency is solved, automatic labeling and extracting of Mongolian arbitration documents are achieved, and labeling effect and accuracy are improved.

The technical scheme in the embodiment of the application aims to solve the technical problems, and the overall thought is as follows:

aiming at the defect that the traditional judgment document adopts a manual labeling mode to consume time and labor, the embodiment of the invention realizes an automatic labeling extraction method based on rules based on a designed attribute label system, extracts key information labels in the Chinese and Mongolian judgment document to form a rule text, thereby constructing a corpus for auxiliary judgment prediction tasks. And applying the acquired Chinese judgment document attribute tag system to a Mongolian judgment document tag system. Aiming at the situation that the comprehensive attribute labels are difficult to obtain in the direct Mongolian judgment document, the system is constructed by adopting the method that the more comprehensive attribute labels are obtained from the large-scale Chinese judgment document. And then, the constructed system is applied to the Mongolian judgment document, so that automatic annotation extraction of the Mongolian judgment document is realized, and the annotation efficiency and the accuracy are improved.

In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a method for automatically labeling and extracting Mongolian arbitration document information, which is executed by a computer, as shown in fig. 1, and comprises the following steps:

s1, acquiring original data of a Mongolian judgment document;

s2, preprocessing original data of the Mongolian judgment document;

and S4, extracting information from the marked document by adopting a regular expression to obtain key information.

The embodiment of the invention applies the Chinese judgment document attribute tag system to the Mongolian judgment document tag system. Aiming at the situation that the comprehensive attribute labels are difficult to obtain in the direct Mongolian judgment document, the system is constructed by adopting the method that the more comprehensive attribute labels are obtained from the large-scale Chinese judgment document. And then, the constructed system is applied to the Mongolian judgment document, so that automatic annotation extraction of the Mongolian judgment document is realized, and the annotation efficiency and the accuracy are improved.

The following describes the steps in detail:

in step S1, original data of a mongolian decision document is acquired. The specific implementation process is as follows:

the original data of the Mongolian judgment document is obtained through a web crawler technology or other methods, and in the embodiment of the invention, the Mongolian judgment document is obtained from the national language document column of the national judgment document network (https:// wenchu. Kurt. Cn /), so as to obtain the original data of the Mongolian judgment document.

In step S2, preprocessing is performed on the original data of the mongolian decision document. The specific implementation process is as follows:

preprocessing Mongolian language features.

S201, code conversion, namely Meng Keli codes are adopted in Mongolian judgment documents instead of international codes, so that the text is required to be converted from Meng Keli codes to international standard codes.

S202, correcting, namely, in Mongolian, part of deformation control symbols (U180B, U180C, U D) and additional components are included, so that unified conversion is needed, the Mongolian international coding standard is met, and part of words in Mongolian are corrected in a dictionary and rule mode.

S203, aiming at full-angle characters, page numbers and redundant paragraph characters in Mongolian, the full-angle characters are uniformly converted into half-angle characters, and the page numbers and the redundant paragraph characters are directly deleted.

In step S3, key element labeling is carried out on the preprocessed original data of the Mongolian judgment document according to a preset Chinese attribute label system, and a labeling document is obtained. The specific implementation process is as follows:

in the embodiment of the invention, a preset Chinese attribute label system is pre-constructed according to a Chinese judgment document, and the construction process is as follows:

firstly, in order to facilitate the inquiry of users and realize the functions of statistics and the like, a fixed category label is set for the external attribute label of the judgment document: a first-level label (a head part) and a text (a tail part) are set, wherein the text is taken as a core, and the text is disassembled into labels such as basic information and judgment result. And then, marking the Chinese judgment document, splitting the external label according to the written guideline of the conventional judge document of the national institutes of China, and extracting a set of basically perfect attribute label system from a term and French knowledge base. For the label labeling principle, mainly follow the following:

1. the authenticity of the judgment document is ensured to the greatest extent by adopting a machine labeling mode;

2. on the basis of automatic labeling of a machine, in order to further improve accuracy, a manual correction mode is also adopted;

3. for criminal judgment documents, structured texts are generally adopted, but unstructured examples exist, and inherent element properties of the documents are utilized to describe inherent properties (writing specifications, text structures, words and the like) so as to convert the documents from unstructured to structured examples. And for the structural part, labeling an attribute label system by using a criminal judgment document, and dividing the internal structure by using a regular matching method. For unstructured parts, it is necessary to convert them into structured text, the conversion steps are as follows:

b. analyzing the content characteristics of the basic information and the judgment result in the text, researching an attribute representation method of a rule-based judgment book, selecting related keyword information by combining a professional term library to formulate rules, and constructing an attribute tag matching rule of the unstructured text.

And then, according to the constructed label marking system of the judgment document, a series of rules are revised manually according to legal specialists, and each entity label in the judgment document is automatically marked by adopting a regular expression construction mode matching method, wherein an example after marking part labels is shown in figure 2.

In step S4, information extraction is carried out on the marked document by adopting a regular expression, and key information is obtained. The specific implementation process is as follows:

after the result of the automatic labeling of the partial labels in the following fig. 2 is obtained, the partial labels of the labeling document in the step S3 are automatically extracted by adopting a regular expression character string matching mode, so as to obtain key information, and finally, the XML template file in fig. 3 is formed.

In an embodiment of the present invention, in order to expand the corpus, the method further includes: and S5, saving the key information into a text with a preset structure rule. The specific implementation process is as follows:

in order to use the key information extracted in the step S4 for judging and predicting tasks, extracting part of the information in the key information and storing the part of the information into a text, the specific steps are as follows:

s501, writing a Python program, and extracting the field meaning corresponding to the table 1 from the XML template file in the step S4 by using a regular matching algorithm;

s502, writing the extracted fields in the table 1 into a txt text file, wherein one row represents information of a decision document.

TABLE 1 meaning of each field in rule text

Fields	Meaning of field
		Fact	Description of case facts
Meta	Case attributes
		punish_of_money	Fine (Unit: yuan)
Accusation	Crime name
		relevant_articles	Correlation method
Criminals	Interviewee
		term_of_imprisonment	Criminal period related attributes
death_penalty	Whether or not to death
		Imprisonment	Criminal period (unit: month) of non-dead criminal
life_imprisonment	Whether or not to convicte in the future

The embodiment of the invention provides an automatic labeling and extracting device for Mongolian arbitration document information, which comprises the following components:

It can be understood that the automatic labeling and extracting device for mongolian arbitration document information provided by the embodiment of the invention corresponds to the automatic labeling and extracting method for mongolian arbitration document information, and the explanation, the example, the beneficial effects and other parts of the related content can refer to the corresponding content in the automatic labeling and extracting method for mongolian arbitration document information, which is not repeated here.

The embodiment of the invention also provides a computer readable storage medium which stores a computer program for automatic labeling and extracting of Mongolian arbitration document information, wherein the computer program enables a computer to execute the method for automatic labeling and extracting of Mongolian arbitration document information.

The embodiment of the invention also provides electronic equipment, which comprises:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the automatic Mongolian arbitration document information labeling and extraction method as described above.

In summary, compared with the prior art, the method has the following beneficial effects:

1. the embodiment of the invention applies the Chinese judgment document attribute tag system to the Mongolian judgment document tag system. Aiming at the situation that the comprehensive attribute labels are difficult to obtain in the direct Mongolian judgment document, the system is constructed by adopting the method that the more comprehensive attribute labels are obtained from the large-scale Chinese judgment document. And then, the constructed system is applied to the Mongolian judgment document, so that automatic annotation extraction of the Mongolian judgment document is realized, and the annotation efficiency and the accuracy are improved.

2. The embodiment of the invention realizes an automatic annotation extraction method based on rules, extracts the key information annotation in the Mongolian judgment document to form a rule text, thereby constructing a corpus for auxiliary judgment prediction tasks.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An automatic labeling and extracting method for Mongolian arbitration document information is characterized by comprising the following steps:

s1, acquiring original data of a Mongolian judgment document;

s2, preprocessing the original data of the Mongolian judgment document;

s4, extracting information from the labeling document by adopting a regular expression to obtain key information;

the construction process of the preset Chinese attribute label system comprises the following steps:

automatic labeling by a machine;

based on automatic labeling of the machine, a manual checking mode is adopted;

2. The method for automatically labeling and extracting mongolian arbitration document information as recited in claim 1, further comprising:

and S5, storing the key information into a text with a preset structural rule.

3. The automatic labeling and extracting method for mongolian decision document information according to any one of claims 1 to 2, wherein the preprocessing of the mongolian decision document raw data comprises:

4. The automatic labeling and extracting method for Mongolian arbitration document information according to any one of claims 1-2, wherein the extracting of information from the labeling document by using regular expressions to obtain key information comprises:

5. The automatic labeling and extracting method for mongolian arbitration document information according to claim 2, wherein the storing the key information in a text with a preset structure rule comprises:

s502, writing the extracted key information into a txt text file.

6. An automatic labeling and extracting device for Mongolian arbitration document information, which is characterized by comprising:

the extraction module is used for extracting information from the marked document by adopting the regular expression to obtain key information;

automatic labeling by a machine;

based on automatic labeling of the machine, a manual checking mode is adopted;

7. The automatic mongolian arbitration document information labeling and extracting device of claim 6, further comprising:

8. A computer-readable storage medium storing a computer program for automatic labeling and extraction of mongolian arbitration document information, wherein the computer program causes a computer to execute the method for automatic labeling and extraction of mongolian arbitration document information according to any one of claims 1 to 5.

9. An electronic device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the automatic mongolian arbitration document information labeling and extraction method of any one of claims 1-5.