CN110765889A - Legal document feature extraction method, related device and storage medium - Google Patents

Legal document feature extraction method, related device and storage medium Download PDF

Info

Publication number
CN110765889A
CN110765889A CN201910936787.5A CN201910936787A CN110765889A CN 110765889 A CN110765889 A CN 110765889A CN 201910936787 A CN201910936787 A CN 201910936787A CN 110765889 A CN110765889 A CN 110765889A
Authority
CN
China
Prior art keywords
document
paragraph
legal
feature extraction
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910936787.5A
Other languages
Chinese (zh)
Other versions
CN110765889B (en
Inventor
何芳芳
邵博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Zhitong Consulting Co Ltd Shanghai Branch
Original Assignee
Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Zhitong Consulting Co Ltd Shanghai Branch filed Critical Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority to CN201910936787.5A priority Critical patent/CN110765889B/en
Publication of CN110765889A publication Critical patent/CN110765889A/en
Application granted granted Critical
Publication of CN110765889B publication Critical patent/CN110765889B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Technology Law (AREA)
  • Economics (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Machine Translation (AREA)

Abstract

A legal document feature extraction method, a related device and a storage medium are provided, wherein the legal document is pre-identified, and a paragraph division model and a feature extraction model corresponding to the legal document are determined; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements; performing document paragraph segmentation on the legal document through the paragraph segmentation model; and extracting document elements corresponding to the document paragraphs from the legal documents divided from the document paragraphs through the feature extraction model, and outputting the extraction results of the document elements.

Description

Legal document feature extraction method, related device and storage medium
Technical Field
The present application relates to the field of electronic technologies, and in particular, to a method for extracting features of a legal document, a related apparatus, and a storage medium.
Background
With the continuous perfection of the law system in China, the right-maintaining consciousness of people is increasingly improved, the law service plays a very important role in daily life, the law service is an important component in various industries in the society, and various internet and law platforms are created and operated on line like bamboo shoots in spring after rain. However, legal services, as an industry with strong individuation and specialization, have higher requirements on the internet +.
Legal documents contain rich legal concepts and legal logic. By deconstructing the case, the case requesting element can be assisted to be rapidly mastered by the user.
In the prior art, legal documents are deconstructed, deconstructed elements are simple, only simple document type classification can be realized, complete legal logic is lacked, and effective case combing information is difficult to provide.
Disclosure of Invention
The embodiment of the application provides a legal document feature extraction method, an electronic device and a computer-readable storage medium, which are used for deconstructing the content of specific document elements of a legal document.
A first aspect of the embodiments of the present application provides a method for extracting features of a legal document, including:
pre-identifying a legal document, and determining a paragraph division model and a feature extraction model corresponding to the legal document; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements;
performing document paragraph segmentation on the legal document through the paragraph segmentation model;
and extracting document elements corresponding to the document paragraphs from the legal documents divided from the document paragraphs through the feature extraction model, and outputting the extraction results of the document elements.
In an implementation manner of the embodiment of the present application, before the pre-identifying the legal document, the method further includes:
pre-processing the legal instrument, the pre-processing comprising at least one of:
abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
In an implementation manner of the embodiment of the present application, the pre-identifying the legal document and determining the paragraph segmentation model and the feature extraction model corresponding to the legal document includes:
identifying a document title of the legal document;
determining the document type corresponding to the legal document according to the document title;
and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In an implementation manner of the embodiment of the present application, the extracting, by the feature extraction model, document elements from a legal document obtained by segmenting a document paragraph includes:
obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
screening a document element rule corresponding to the document paragraph in the feature extraction model according to the document paragraph after paragraph division;
reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences by using the document element rule corresponding to the document paragraphs; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
In an implementation manner of the embodiment of the present application, the feature extraction model includes: a TextCNN network, a TextRNN network, and a TextRCNN network;
the TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
the input information of the three networks is consistent and is a symbolic document paragraph; and the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model. A second aspect of the embodiments of the present application provides a feature extraction device for a legal document, including:
the system comprises a pre-recognition unit, a feature extraction unit and a classification unit, wherein the pre-recognition unit is used for pre-recognizing a legal document and determining a paragraph division model and a feature extraction model corresponding to the legal document; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements;
a paragraph dividing unit for performing document paragraph division on the legal document through the paragraph dividing model;
and the feature extraction unit is used for extracting document elements corresponding to the document paragraphs from the legal documents divided by the document paragraphs through the feature extraction model and outputting the extraction results of the document elements.
In an implementation manner of the embodiment of the present application, the apparatus further includes: a pre-processing unit;
the preprocessing unit is used for preprocessing the legal documents, and the preprocessing comprises at least one of the following steps:
abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
In an implementation manner of the embodiment of the present application, the feature extraction unit is specifically configured to:
obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences according to the document element rule; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
A third aspect of the embodiments of the present application provides another electronic apparatus, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the feature extraction method of the legal document provided by the first aspect of the embodiment of the application.
A fourth aspect of the embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for extracting features of a legal document provided in the first aspect of the embodiments of the present application.
In the scheme, the legal document is pre-identified, and the paragraph division model and the feature extraction model corresponding to the legal document are determined; then, document paragraph division is carried out on the legal document through the paragraph division model, and finally document elements are extracted from the legal document after the document paragraph division through the feature extraction model; the embodiment of the application utilizes the strong relevance between the document paragraphs and the document features (namely, some document elements exist in specific document paragraphs with high probability), so that for the content of complex document elements, the document elements can be quickly positioned in a plurality of document paragraphs with probable document element probability through a paragraph division mode, and then the document elements are extracted in the document paragraphs in a targeted manner, thereby improving the feature extraction efficiency of legal documents.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a method for extracting features of a legal document according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a feature extraction device for legal documents according to an embodiment of the present application;
fig. 3 is a schematic diagram of a hardware structure of an electronic device according to another embodiment of the present disclosure.
Detailed Description
In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
The embodiment of the present application provides a method for extracting features of a legal document, where the method for extracting features of a legal document is applied to an electronic device, the electronic device may be a smart phone, a tablet computer, a computer, or a device with an application program installable thereon, and an operating system of the electronic device may be an ios, an android, a windows system, or another operating system, which is not limited herein.
Referring to fig. 1, the method for extracting features of legal documents mainly includes the following steps:
101. pre-identifying a legal document, and determining a paragraph division model and a feature extraction model corresponding to the legal document;
pre-identifying a legal document, and determining a paragraph division model and a feature extraction model corresponding to the legal document; the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements.
Illustratively, the pre-recognition may be: identifying a document title of the legal document; determining the document type corresponding to the legal document according to the document title; and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In practice, there are many types of legal documents, such as law cases (civil, criminal, and administrative), arbitration, referees, and the like. Different legal documents correspond to different general deconstruction features and individual deconstruction features. Therefore, before the legal documents are deconstructed, the embodiment of the application can logically sort the legal documents into different legal documents, and determine the corresponding general deconstruction characteristics of the different legal documents; and setting a corresponding model for deconstruction according to the general deconstruction characteristics.
In the embodiment of the present application, different legal documents may correspond to different paragraph segmentation models and feature extraction models. Specifically, the embodiment of the present application trains different types of legal documents in a machine learning manner, such as "manner of paragraph division" and "what document elements are extracted in what document paragraph". The specific legal document has specific document paragraph characteristics (for example, the document paragraph characteristics can be divided into 5 paragraphs, and what is the paragraph content corresponding to the 5 paragraphs), and the mapping relationship is established through machine learning and is stored in a local processing terminal.
Illustratively, when a legal document to be identified is taken, a paragraph division model corresponding to the legal document and used for paragraph division and a feature extraction model (used for extracting document elements) corresponding to the paragraph division model are determined through identification of a specific position (such as a title position) in the legal document. There are two sets of mapping relationships, which are: the processing terminal locally stores two groups of mapping relations, and can acquire the paragraph division model and the feature extraction model corresponding to the legal document to be identified by searching the character identification and the mapping relation of the document title.
In practical applications, in order to improve the efficiency and accuracy of legal document processing, the legal document may be preprocessed before the legal document is pre-identified, the preprocessing including at least one of: abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
102. Performing document paragraph segmentation on the legal document through the paragraph segmentation model;
and carrying out document paragraph division on the legal document after document pretreatment through a paragraph division model. The present application is described with referee documents (an examination and judgment) as an application example.
Illustratively, the paragraph categories to be divided include: title section, litigation subject section, trial passage section, litigation section, dialectic section, evidence section, trial finding section, court opinion section, decision result section, court staff section.
For example, the paragraph segmentation model for the pre-segmentation class may be a rule model in practical application, and the rule model belongs to a machine learning model of an unsupervised algorithm class. A crf (conditional random field) model, which belongs to a markov probabilistic graphical model and is widely used in the text sequence labeling problem, may also be used.
Regarding document paragraph division, in practical applications, a target paragraph category to be divided may be preset as an N (N is an integer greater than one) category, corresponding to N paragraph extraction rules, where the target paragraph and the paragraph extraction rules are in one-to-one correspondence. For example, there are 5 paragraph types in the "referee document", which correspond to 5 "paragraph extraction rules". An example of a "paragraph extraction rule": the paragraph start feature may be "what we believe in the hospital", the paragraph end feature may be "decide as follows", and the paragraph end feature may be "start feature of another paragraph extraction rule".
In practical application, paragraph features of the referee document are obvious and can be exhausted, and the rule extraction model is preferentially used. Paragraph extraction rules may be set in the paragraph segmentation model, the paragraph extraction rules including a paragraph start feature and a paragraph end feature. Taking "the section is regarded as the section in the hospital" as an example, the section start feature may be "regarded as the section in the hospital", the section end feature may be "decided as follows", and the section end feature may be "a start feature of another section extraction rule".
103. And extracting document elements corresponding to the document paragraphs from the legal documents divided from the document paragraphs through the feature extraction model, and outputting the extraction results of the document elements.
After the processing of step 102, N paragraphs divided according to the paragraph extraction rule are obtained, and M types of document elements are extracted according to the N paragraphs, where M is an integer greater than N.
Specifically, the document elements are logical elements related to cases in the legal document. Illustratively, M may be 7, and the document elements specifically include: parties, appeal items, dispute items, evidence items, fact items, dispute focus, court opinions, decision items.
Illustratively, the document elements of the party (citizens, companies, law entity information, etc.) may be extracted from litigation subject paragraphs; the paperwork elements of the main appeal items can be extracted from the appeal paragraphs; the document elements of the resolution item can be extracted from the resolution paragraph; document elements of the evidence items can be extracted from the evidence passage; document elements that can extract factual items from the trial finding paragraphs; document elements of dispute focus and court opinions can be extracted from the sections considered by the institute; the document elements of the decision items may be extracted from the decision results.
The document elements in the embodiment of the present application are mainly implemented by a machine learning model for rule extraction (see the following second embodiment), "rule extraction" is applicable to an obvious and exhaustible text feature extraction scenario, such as an explicit paragraph starting text (for example, "thought in the home institution"). For some unobvious and inexhaustible text feature extraction scenes, semantic recognition is required, and a supervised neural network model can be used for implementation (see the third embodiment below).
In the scheme, the legal document is pre-identified, and the paragraph division model and the feature extraction model corresponding to the legal document are determined; then, document paragraph division is carried out on the legal document through the paragraph division model, and finally document elements are extracted from the legal document after the document paragraph division through the feature extraction model; the embodiment of the application utilizes the strong relevance between the document paragraphs and the document features (namely, some document elements exist in specific document paragraphs with high probability), so that for the content of complex document elements, the document elements can be quickly positioned in a plurality of document paragraphs with probable document element probability through a paragraph division mode, and then the document elements are extracted in the document paragraphs in a targeted manner, thereby improving the feature extraction efficiency of legal documents.
Example two
The embodiment of the present application mainly describes a scheme for implementing document element extraction through a rule-extracted machine learning model, where each document element corresponds to a document element rule, and the document element rule is used to read a feature corresponding to the document element in a sentence.
Step 1, obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Step 2, segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
step 3, according to the document paragraphs obtained after paragraph division, screening document element rules corresponding to the document paragraphs in the feature extraction model;
step 4, reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences by using the document element rule corresponding to the document paragraphs; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
Illustratively, the text elements such as "court opinions" are extracted, in the "what is considered by the institute" paragraph, in punctuation. And? | The! "sentence division, reading sentences one by one according to the corresponding document element rule of" court view ", locating to the sentence containing" judge as follows "(" court view "one of the document element rules), for example" through the judgment of the symposium, according to the regulations of the one hundred ninety six article of the contract law of the people's republic of China, the 6 th article of the opinions of the highest people's court about the loan of people's court trial, the one hundred forty-two article, the one hundred forty-four article and the one hundred fifty-two article of the litigation of people's republic of China "regulation of the one hundred forty-four article, the one hundred fifty-two article: "extraction of citation of law and regulation sentences in court opinions can be realized. Further, the law sentence can be extracted through the ' plus ' characteristic, and the extraction of the law items ' the one hundred and ninety six items of the contract law of the people's republic of China ', ' the several opinions of the highest people's court about the trial loan case of the people's court ' item ' 6 th item, ' the one hundred and forty two items of the law of the people's republic of China ', the one hundred and forty four items of the people's republic of China, and the one hundred and fifty items of the fifty items ' can be realized.
Illustratively, as extraction of the principal categories: plaintiff, defendant, plaintiff agent, defendant agent, statutory representative, etc. Such as for the principal paragraph "original: tang Xiao Lin, women, Han nationality. \ n proxy agent: zhongchunfang, Hubei law firm lawyer. \ n is reported: wuhan lucky real estate development and construction Limited, and Hubei province of residence, Hubei province, Wuhan city, Hanyang district Von Families. \ n statutory representatives: liu increased, the president of the company. \ n proxy agent: yunheng, lawyer of law firm of Beijing Yingke (Wuhan). \ n is reported: fengwangjie, male, Han nationality. \ n is reported: zhang Youling, male, Han nationality. \ n is reported: increased Liu Kai, male and Han nationality. \ n proxy agent: yunheng, lawyer of law firm of Beijing Yingke (Wuhan). The entity category corresponding to the 'down forest' is 'original' and the extracted rule is 'paragraph statement division + location of the entity starting position in the statement + search of the statement (first position, entity starting position)', and finally the punctuation mark in the middle is removed to confirm that the category is 'original'.
If the processing targets are:
"original notice: miss Wednesday;
is informed: mr. Lin "
"paragraph statement segmentation" refers to paragraph segmentation features, e.g., the introductory descriptions of the above two human categories are segmented, and therefore need to identify paragraph segmentation features, e.g., segmenters. The "entity start position" refers to a place with characters, i.e. without using punctuation marks as start bits. "removing the middle punctuation" means, for example, that the identified feature is ": miss weekly "removes the symbol of": and the remaining characters are reported.
Exemplary, court view: supporting original announcement, supporting defendant, refuting original announcement, and refuting defendant. Mainly by rules. Such as: "because the small Tang forest does not submit evidence to the home to verify that 48.8 ten thousand yuan of interest is derived from interest components respectively generated by borrowing the principal from 3000 ten thousand yuan and 1240 ten thousand yuan, the small Tang forest bears a litigation request with responsibility for all the interest generated by 4240 ten thousand yuan about Liuhua Kai, and has no evidence, and the home carries out refution according to law. "first, the entity type" source of Tang Xiaolin "," Liu Hua Kai | quilt "appearing in the sentence is identified, and the key sentence" so that Tang Xiaolin takes the litigation request with responsibility for all interest generated by 4240 ten thousand yuan in relation to Liu Hua Kai, there is no real basis, and the court law is refuted. The "rule feature appearing in" Tang Xiaolin "+" litigation request "+" refute ", and is classified as" refute Notice ".
EXAMPLE III
The embodiment of the present application mainly describes a scheme for realizing document element extraction by using a supervised neural network model, and specifically includes:
firstly, training a neural network model;
the neural network model of the embodiment of the application integrates three single models, 1: based on the TextCNN network, BatchNormal is added, and two fully connected layers are used in classification. Model 2: based on a TextRNN network, using bidirectional Long-Short Term Memory (LSTM), and classifying after K-Max boosting the hidden vector during classification; model 3: TextRCNN network. And (3) fusing the three model networks, namely adding the outputs of the three networks, and performing model training by using corresponding legal document samples (marked with document elements).
The TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
the input information of the three networks is consistent and is a symbolic document paragraph. For example, in practical applications, before the text data is input into the neural network, the text data is subjected to sentence breaking, word segmentation and part-of-speech analysis, and then related words are carried by using numeric characters, such as "1" for "you", "2" for "at" and "3" for "there". The symbolized text "where you are" is "123".
And the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model. For example, it is assumed that the feature extraction model presets document elements requiring labels in 6 classes, and the label identification result may be a 6-bit number sequence, which respectively corresponds to the document elements in 6 classes. Wherein each bit represents a probability that the currently identified content is a particular document element; the three tag identification results are added, namely the probabilities on the corresponding bits are added, and then the results after addition on each bit are averaged.
Taking the document elements of the appeal items as an example, firstly dividing document paragraphs of each sample in legal document samples, then labeling the document paragraph contents possibly with the appeal items, and putting the labeled document paragraphs into a neural network model for training.
Secondly, extracting document elements;
and acquiring a document paragraph related to the appeal item, and inputting the document paragraph related to the appeal item into the neural network model to extract document elements.
For example, for the category "request repayment principal", the category-related complaint items mentioned in the small number of annotation cases are expressed as "request: 1. the mascot company is judged to immediately repay 4240 ten thousand yuan of borrowing principal of Tang Xiaolin and 965.2 ten thousand yuan of interest on the day of ending the appeal, and the mascot company is judged to repay the interest from the day of starting the appeal to the day of clearing the debt; ". Marking 10% of cases, learning the category characteristics in the related appeal item expression through a classification algorithm, and marking category labels on appeal items of other unmarked cases by using the trained characteristic model.
Example four
Please refer to fig. 2, which provides a feature extraction apparatus for legal documents according to an embodiment of the present application. The electronic device can be used for realizing the feature extraction method of the legal document provided by the embodiment shown in the figure 1. As shown in fig. 2, the feature extraction device of the legal document mainly includes:
the pre-recognition unit 201 is configured to pre-recognize a legal document, and determine a paragraph segmentation model and a feature extraction model corresponding to the legal document; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements;
a paragraph dividing unit 202, configured to perform document paragraph division on the legal document through the paragraph dividing model;
a feature extraction unit 203, configured to extract, by using the feature extraction model, a document element corresponding to a document paragraph from a legal document into which the document paragraph is divided, and output an extraction result of the document element.
In an implementation manner of the embodiment of the present application, the apparatus further includes: a preprocessing unit 204;
the preprocessing unit 204 is configured to preprocess the legal document, where the preprocessing includes at least one of:
abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
In an implementation manner of the embodiment of the present application, the pre-recognition unit 201 is specifically configured to:
identifying a document title of the legal document;
determining the document type corresponding to the legal document according to the document title;
and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
In an implementation manner of the embodiment of the present application, the feature extraction unit 203 is specifically configured to:
obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
screening a document element rule corresponding to the document paragraph in the feature extraction model according to the document paragraph after paragraph division;
reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences by using the document element rule corresponding to the document paragraphs; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
It should be noted that, in the embodiment of the electronic device illustrated in fig. 2, the division of the functional modules is only an example, and in practical applications, the above functions may be distributed by different functional modules according to needs, for example, configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the electronic device is divided into different functional modules to complete all or part of the functions described above. In practical applications, the corresponding functional modules in this embodiment may be implemented by corresponding hardware, or may be implemented by corresponding hardware executing corresponding software. The above description principles can be applied to various embodiments provided in the present specification, and are not described in detail below.
For a specific process of each function module in the electronic device provided in this embodiment to implement each function, please refer to the specific content described in the embodiment shown in fig. 1, which is not described herein again.
EXAMPLE five
An embodiment of the present application provides an electronic device, please refer to fig. 3, which includes:
a memory 301, a processor 302 and a computer program stored in the memory 301 and capable of running on the processor 302, wherein the processor 302 executes the computer program to implement the method for extracting the features of the legal document described in the embodiment shown in fig. 1.
Further, the electronic device further includes:
at least one input device 303 and at least one output device 304.
The memory 301, the processor 302, the input device 303, and the output device 304 are connected via a bus 305.
The input device 303 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 304 may specifically be a display screen.
The Memory 301 may be a Random Access Memory (RAM) Memory or a non-volatile Memory (non-volatile Memory), such as a magnetic disk Memory. The memory 301 is used to store a set of executable program code, and the processor 302 is coupled to the memory 301.
Further, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium may be provided in an electronic device in the foregoing embodiments, and the computer-readable storage medium may be the memory in the foregoing embodiment shown in fig. 3. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method of feature extraction for legal documents described in the embodiment shown in fig. 1 above. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The above description of the method for extracting features of legal documents, the electronic device and the computer-readable storage medium provided in this application will be apparent to those skilled in the art from the following description, wherein all changes can be made in the embodiments and applications of the method according to the teachings of the present application.

Claims (10)

1. A method for extracting features of legal documents is characterized by comprising the following steps:
pre-identifying a legal document, and determining a paragraph division model and a feature extraction model corresponding to the legal document; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements;
performing document paragraph segmentation on the legal document through the paragraph segmentation model;
and extracting document elements corresponding to the document paragraphs from the legal documents divided from the document paragraphs through the feature extraction model, and outputting the extraction results of the document elements.
2. The method of extracting features of legal documents according to claim 1,
before the pre-recognition of the legal document, the method further comprises the following steps:
pre-processing the legal instrument, the pre-processing comprising at least one of:
abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
3. The method of extracting features of legal documents according to claim 1,
the pre-recognition of the legal document and the determination of the paragraph segmentation model and the feature extraction model corresponding to the legal document comprise:
identifying a document title of the legal document;
determining the document type corresponding to the legal document according to the document title;
and determining a paragraph division model corresponding to the document type and a feature extraction model corresponding to the paragraph division model.
4. The method of extracting features of legal documents according to claim 1,
the method for extracting the document elements from the legal document after document paragraph division through the feature extraction model comprises the following steps:
obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
screening a document element rule corresponding to the document paragraph in the feature extraction model according to the document paragraph after paragraph division;
reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences by using the document element rule corresponding to the document paragraphs; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
5. The method of extracting features of legal documents according to claim 1,
the feature extraction model includes: a TextCNN network, a TextRNN network, and a TextRCNN network;
the TextCNN network, the TextRNN network and the TextRCNN network are arranged in parallel, and the output ends of the three networks are connected;
the input information of the three networks is consistent and is a symbolic document paragraph; and the output information of the three networks is a label identification result, and the three label identification results are added and averaged to obtain the output of the feature extraction model.
6. A feature extraction device of a legal document, comprising:
the system comprises a pre-recognition unit, a feature extraction unit and a classification unit, wherein the pre-recognition unit is used for pre-recognizing a legal document and determining a paragraph division model and a feature extraction model corresponding to the legal document; wherein, the feature extraction model comprises the corresponding relation between the document paragraphs and the document elements;
a paragraph dividing unit for performing document paragraph division on the legal document through the paragraph dividing model;
and the feature extraction unit is used for extracting document elements corresponding to the document paragraphs from the legal documents divided by the document paragraphs through the feature extraction model and outputting the extraction results of the document elements.
7. The legal document feature extraction apparatus of claim 6,
the device further comprises: a pre-processing unit;
the preprocessing unit is used for preprocessing the legal documents, and the preprocessing comprises at least one of the following steps:
abnormal line feed processing, Chinese amount processing, Chinese number to Arabic number conversion, punctuation format unification, illegal character replacement and wrongly written or mispronounced character processing.
8. The legal document feature extraction apparatus of claim 6,
the feature extraction unit is specifically configured to:
obtaining a document paragraph obtained after paragraph division is performed on the legal document, and taking the document paragraph as an input object of the feature extraction model; the feature extraction model comprises a plurality of document element rules.
Segmenting the text paragraphs according to punctuation marks, and cutting a plurality of sentences to form a sentence sequence;
screening a document element rule corresponding to the document paragraph in the feature extraction model according to the document paragraph after paragraph division;
reading sentences one by one in sequence according to the sentence sequence, and performing feature matching on the read sentences by using the document element rule corresponding to the document paragraphs; and outputting the corresponding document element after matching a document element rule successfully, and matching the next sentence until all sentences in the sentence sequence are matched.
9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 5.
CN201910936787.5A 2019-09-29 2019-09-29 Feature extraction method, related device and storage medium for legal document Active CN110765889B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910936787.5A CN110765889B (en) 2019-09-29 2019-09-29 Feature extraction method, related device and storage medium for legal document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910936787.5A CN110765889B (en) 2019-09-29 2019-09-29 Feature extraction method, related device and storage medium for legal document

Publications (2)

Publication Number Publication Date
CN110765889A true CN110765889A (en) 2020-02-07
CN110765889B CN110765889B (en) 2024-06-25

Family

ID=69329135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910936787.5A Active CN110765889B (en) 2019-09-29 2019-09-29 Feature extraction method, related device and storage medium for legal document

Country Status (1)

Country Link
CN (1) CN110765889B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428484A (en) * 2020-04-14 2020-07-17 广州云从鼎望科技有限公司 Information management method, system, device and medium
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN112686012A (en) * 2020-11-11 2021-04-20 福建亿榕信息技术有限公司 Document feature extraction method, device, equipment and medium
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium
WO2021232293A1 (en) * 2020-05-20 2021-11-25 Accenture Global Solutions Limited Contract recommendation platform
CN114138928A (en) * 2021-09-27 2022-03-04 平安国际智慧城市科技股份有限公司 Method, system, device, electronic equipment and medium for extracting text content

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10320409A (en) * 1997-05-19 1998-12-04 Seiko Epson Corp Method and device for extracting document information and storage medium storing document extracting process program
JP2000324394A (en) * 1999-05-07 2000-11-24 Telecommunication Advancement Organization Of Japan Method for automatically dividing title character text
US6772149B1 (en) * 1999-09-23 2004-08-03 Lexis-Nexis Group System and method for identifying facts and legal discussion in court case law documents
US20050108010A1 (en) * 2003-10-01 2005-05-19 Dictaphone Corporation System and method for post processing speech recognition output
US20060177118A1 (en) * 2005-02-09 2006-08-10 Jp Morgan Chase Bank Method and system for extracting information from documents by document segregation
US20160103823A1 (en) * 2014-10-10 2016-04-14 The Trustees Of Columbia University In The City Of New York Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents
WO2017092555A1 (en) * 2015-12-01 2017-06-08 北京国双科技有限公司 Method and device for parsing amount of money in judgement document
CN106815208A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 The analysis method and device of law judgement document
CN107590131A (en) * 2017-10-16 2018-01-16 北京神州泰岳软件股份有限公司 A kind of specification document processing method, apparatus and system
CN107783946A (en) * 2016-08-27 2018-03-09 上海卓易电子科技有限公司 Text display method and text display
CN107832360A (en) * 2017-10-24 2018-03-23 广东欧珀移动通信有限公司 Comment processing method and relevant device
CN108874814A (en) * 2017-05-10 2018-11-23 北京国双科技有限公司 The processing method and processing device of legal documents
CN109213864A (en) * 2018-08-30 2019-01-15 广州慧睿思通信息科技有限公司 Criminal case anticipation system and its building and pre-judging method based on deep learning
CN109359288A (en) * 2018-08-16 2019-02-19 上海绿狮智能信息科技股份有限公司 A method of for law works field document quantitative evaluation
CN109376240A (en) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 A kind of text analyzing method and terminal
CN109446511A (en) * 2018-09-10 2019-03-08 平安科技(深圳)有限公司 Judgement document's processing method, device, computer equipment and storage medium
CN109753647A (en) * 2017-11-07 2019-05-14 北京国双科技有限公司 The partitioning method and device of paragraph
CN109933768A (en) * 2019-03-11 2019-06-25 徐鹏 A kind of legal documents Intelligent treatment, write method and system
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10320409A (en) * 1997-05-19 1998-12-04 Seiko Epson Corp Method and device for extracting document information and storage medium storing document extracting process program
JP2000324394A (en) * 1999-05-07 2000-11-24 Telecommunication Advancement Organization Of Japan Method for automatically dividing title character text
US6772149B1 (en) * 1999-09-23 2004-08-03 Lexis-Nexis Group System and method for identifying facts and legal discussion in court case law documents
US20050108010A1 (en) * 2003-10-01 2005-05-19 Dictaphone Corporation System and method for post processing speech recognition output
US20060177118A1 (en) * 2005-02-09 2006-08-10 Jp Morgan Chase Bank Method and system for extracting information from documents by document segregation
US20160103823A1 (en) * 2014-10-10 2016-04-14 The Trustees Of Columbia University In The City Of New York Machine Learning Extraction of Free-Form Textual Rules and Provisions From Legal Documents
WO2017092555A1 (en) * 2015-12-01 2017-06-08 北京国双科技有限公司 Method and device for parsing amount of money in judgement document
CN106815203A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 A kind of amount of money analysis method and device in judgement document
CN106815208A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 The analysis method and device of law judgement document
CN107783946A (en) * 2016-08-27 2018-03-09 上海卓易电子科技有限公司 Text display method and text display
CN108874814A (en) * 2017-05-10 2018-11-23 北京国双科技有限公司 The processing method and processing device of legal documents
CN107590131A (en) * 2017-10-16 2018-01-16 北京神州泰岳软件股份有限公司 A kind of specification document processing method, apparatus and system
CN107832360A (en) * 2017-10-24 2018-03-23 广东欧珀移动通信有限公司 Comment processing method and relevant device
CN109753647A (en) * 2017-11-07 2019-05-14 北京国双科技有限公司 The partitioning method and device of paragraph
CN109359288A (en) * 2018-08-16 2019-02-19 上海绿狮智能信息科技股份有限公司 A method of for law works field document quantitative evaluation
CN109213864A (en) * 2018-08-30 2019-01-15 广州慧睿思通信息科技有限公司 Criminal case anticipation system and its building and pre-judging method based on deep learning
CN109446511A (en) * 2018-09-10 2019-03-08 平安科技(深圳)有限公司 Judgement document's processing method, device, computer equipment and storage medium
CN109376240A (en) * 2018-10-11 2019-02-22 平安科技(深圳)有限公司 A kind of text analyzing method and terminal
CN109933768A (en) * 2019-03-11 2019-06-25 徐鹏 A kind of legal documents Intelligent treatment, write method and system
CN110147445A (en) * 2019-04-09 2019-08-20 平安科技(深圳)有限公司 Intension recognizing method, device, equipment and storage medium based on text classification

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476034A (en) * 2020-04-07 2020-07-31 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111428484A (en) * 2020-04-14 2020-07-17 广州云从鼎望科技有限公司 Information management method, system, device and medium
WO2021232293A1 (en) * 2020-05-20 2021-11-25 Accenture Global Solutions Limited Contract recommendation platform
CN112686012A (en) * 2020-11-11 2021-04-20 福建亿榕信息技术有限公司 Document feature extraction method, device, equipment and medium
CN113673255A (en) * 2021-08-25 2021-11-19 北京市律典通科技有限公司 Text function region splitting method and device, computer equipment and storage medium
CN113673255B (en) * 2021-08-25 2023-06-30 北京市律典通科技有限公司 Text function area splitting method and device, computer equipment and storage medium
CN114138928A (en) * 2021-09-27 2022-03-04 平安国际智慧城市科技股份有限公司 Method, system, device, electronic equipment and medium for extracting text content

Also Published As

Publication number Publication date
CN110765889B (en) 2024-06-25

Similar Documents

Publication Publication Date Title
CN110765889A (en) Legal document feature extraction method, related device and storage medium
US20210157975A1 (en) Device, system, and method for extracting named entities from sectioned documents
CN109460551B (en) Signature information extraction method and device
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
Tran et al. Understanding what the users say in chatbots: A case study for the Vietnamese language
CN110532386A (en) Text sentiment classification method, device, electronic equipment and storage medium
CA3048356A1 (en) Unstructured data parsing for structured information
CN109783636B (en) Automobile comment theme extraction method based on classifier chain
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
CN112395421B (en) Course label generation method and device, computer equipment and medium
CN113268615A (en) Resource label generation method and device, electronic equipment and storage medium
CN112100384B (en) Data viewpoint extraction method, device, equipment and storage medium
CN112257444B (en) Financial information negative entity discovery method, device, electronic equipment and storage medium
CN111783471A (en) Semantic recognition method, device, equipment and storage medium of natural language
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN111178080B (en) Named entity identification method and system based on structured information
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
CN114372470A (en) Chinese legal text entity identification method based on boundary detection and prompt learning
CN115525750A (en) Robot phonetics detection visualization method and device, electronic equipment and storage medium
CN113515587A (en) Object information extraction method and device, computer equipment and storage medium
CN111382243A (en) Text category matching method, text category matching device and terminal
CN112597299A (en) Text entity classification method and device, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant