CN118504559A - Intelligent extraction method and system for legal and legal annotation files - Google Patents

Intelligent extraction method and system for legal and legal annotation files Download PDF

Info

Publication number
CN118504559A
CN118504559A CN202410484659.2A CN202410484659A CN118504559A CN 118504559 A CN118504559 A CN 118504559A CN 202410484659 A CN202410484659 A CN 202410484659A CN 118504559 A CN118504559 A CN 118504559A
Authority
CN
China
Prior art keywords
legal
text
matching
feature
annotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410484659.2A
Other languages
Chinese (zh)
Inventor
刘沐元
潘晓岚
李凤娇
朴文玉
李凯
赵晓海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Beida Yinghua Technology Co ltd
Original Assignee
Beijing Beida Yinghua Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Beida Yinghua Technology Co ltd filed Critical Beijing Beida Yinghua Technology Co ltd
Publication of CN118504559A publication Critical patent/CN118504559A/en
Pending legal-status Critical Current

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Input (AREA)

Abstract

The invention discloses an intelligent extraction method and system for a law and regulation annotation file, comprising the steps of collecting a text with a law and regulation annotation basis as an input text with the original law and regulation annotation basis, and preprocessing data of the text with the original law and regulation annotation basis; realizing feature construction based on feature engineering, and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature; and extracting key information of the text according to the original legal regulation annotation by using the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information. Compared with the prior art, the method and the device can improve the accuracy of 2D gaze point estimation.

Description

Intelligent extraction method and system for legal and legal annotation files
Technical Field
The invention belongs to the technical field of artificial intelligence and natural language processing related to computers, and particularly relates to a cross-domain legislation theme perception oriented generating type intelligent agent technology.
Background
Before automatic extraction techniques have not been popular, manual screening and sorting is the primary means of extracting annotated documents under law and regulation. Especially in the face of complex legal text information data, the process of manual screening and sorting requires a lot of time and effort and may be affected by subjective preferences of the screeners. The cost and difficulty of manual screening and finishing would increase significantly in the face of large amounts of legal regulations. After the automatic text collecting and analyzing system is added, a large amount of texts can be automatically processed in batches, and the labor cost is reduced. Currently, an information extraction technology for rapidly and accurately extracting key information is urgently required in numerous laws and regulations.
Disclosure of Invention
The invention aims to provide an intelligent extraction method and system for legal and legal annotation files, which automatically capture text information of the legal and legal annotation files and realize intelligent information extraction for eliminating subjective factors of the legal and legal annotation files.
The invention is realized by the following technical scheme:
an intelligent extraction method for legal and legal annotation files comprises the following steps:
Step 1, collecting a text with a legal regulation annotation basis as an input text of an original legal regulation annotation basis, and carrying out data preprocessing on the text of the original legal regulation annotation basis to form clear and structured data;
step 2, realizing feature construction based on feature engineering, wherein the construction comprises but is not limited to title features, table text features, non-table text features and symbolic features;
And 3, extracting key information of the text according to which the original legal regulation notes are made by utilizing the features constructed in the step 2, and automatically identifying key entity information in the legal regulation text according to the extracted key information through text scanning, splitting, feature comparison and regular matching.
An intelligent extraction system for legal and legal annotation files, comprising: the device comprises a preprocessing module, a characteristic construction module and an extraction module which are sequentially connected; wherein:
The preprocessing module is used for collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and preprocessing data of the text with the original legal regulation annotation basis to form clear and structured data;
The feature construction module is used for realizing feature construction based on feature engineering and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature;
The extraction module is used for extracting key information of the text according to which the original legal regulation notes are made by utilizing the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information.
Compared with the traditional manual method, the invention can achieve the following beneficial technical effects:
1) The intelligent automatic extraction for processing a large amount of legal and legal texts is realized, and subjective influence caused by manual intervention is reduced; is not easy to adapt to large-scale treatment
2) Key information of the legal document can be automatically captured from the text, so that manual errors are reduced, and accuracy is improved;
3) The text is automatically processed by the computer, so that the information extraction speed is greatly improved, and the processing efficiency is improved; and is more beneficial to adapting to large-scale text processing.
Drawings
FIG. 1 is a flow chart of an intelligent extraction method for legal and legal annotation files;
FIG. 2 is a block diagram of an intelligent extraction system for legal and legal annotation files according to the present invention;
FIG. 3 is a block diagram of a model of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.
Description of related terms:
The law and regulation annotation basis is the basis of the interpretation and the explanation of the law and regulation provision and the department, time and timeliness of the law and regulation release, and the law and regulation annotation is the interpretation of the specific terms and words of the law and regulation provision.
Timeliness is the current timeliness of law regulations, such as current validity, modification, failure, partial failure.
As shown in fig. 1, the overall flow of the present invention includes the following steps:
step1, collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and performing data preprocessing on the text with the original legal regulation annotation basis to form clear and structured data, wherein in the step, the preprocessing operations include:
(1.1) performing removal operations of irrelevant symbols such as & nbsp, & quot, @ and the like on the text according to which the input original legal and legal notes are based;
(1.2) performing a removal operation of irrelevant labels such as </span >, </script >, < style >, etc. on the text according to which the input original legal and legal notes are based;
(1.3) performing label marking operations of labels such as clauses and segmentation words such as < table >, </p >, </br > on key nodes of the text according to which the input original legal and legal notes are annotated, extracting the text according to label marks by using a Pattern formula, or judging whether the obtained text accords with information of the annotated file such as a title, a text number and the like after the text is analyzed according to punctuation and segmentation of punctuation marks of the conventional text;
(1.4) adding non-printed characters such as \r, \n, etc. to key nodes of the text according to which the input original legal and legal notes are based;
(1.5), full-angle symbols such as the input original legal and legal texts. Operations of performing half angle conversion of "," (",") "," [ and "";
Step 2, realizing feature construction based on feature engineering, and screening out texts capable of automatically analyzing processing logic and texts not capable of automatically analyzing processing logic by comparing the texts according to the original legal and legal notes with a standard format, namely verifying legality of the contents of the texts according to the original legal and legal notes in an HTML format according to a Pattern formula; the text of the automatically resolvable logic refers to whether the matching is performed according to the existing features, such as whether the time-efficiency marks (such as validity and invalidation mentioned in the formula) exist in the header information of the table (similar to the XXX file name and XXX text number, the non-automatically resolvable text refers to whether the feature library cannot be matched; for example:
(2.1) comparing the input text according to the original legal and legal notes according to the characteristics of the title, screening out the characteristics of the text of the non-automatic analysis processing logic, such as the characteristics of the notes according to the title including the blank characters at the beginning, failure, modification, quasi-modification, applicable and the like, wherein the blank characters at the beginning represent the title, and the failure, quasi-modification, applicable include the timeliness representing the annotated file;
The title features of the composition are represented as feature formulas Pattern1, pattern2, as follows:
pattern1 = pattern.com ("cancel|failure|revocation| stopping execution of active continue active reserved
Pattern2 = pattern.com (part (valid |failure|content revocation|content failure|clause revocation')
(2.2) Comparing text in the input text according to the original legal and legal notes according to text characteristics such as a table label, a pre label, a tr label, a td label and the like, verifying the integrity of the label according to the data structure characteristics of a java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4 as follows:
Pattern3=Pattern.compile("<table.*?</table>")
Pattern4=Pattern.compile("<tr.*?</tr>");Pattern.compile("<td.*?</td>")
In HTML text, for example, a < table > tag represents that the current HTML text content is tabular, a < tr > tag represents each row in the table, and a < td > tag represents a small column of each row;
Wherein, for Pattern3 = Pattern.com ("< table) Pattern.comp (" < table):
"? "means:
"in the table label means that any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents as few matches as possible when any character except for a line change appears zero or multiple times in the table label, and matching is finished quickly;
Wherein, for the Pattern 4=pattern.com ("< tr.; attrn.com ("< td.? </td >") is defined as follows:
first "? "means:
"in tr" indicates any character other than a line feed "x" indicates that the preceding element matches zero or more times "? "represents non-greedy matching (as few matches as possible) overall represents that in tr tag, any character except for line feed is as few matches as possible when zero or more times occur, and matching is terminated quickly;
second "? "means:
"in td tag means any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents that in td tag, any character except for line feed is as few matches as possible when zero or more times occur, and matching is terminated quickly;
Third "? "means:
"in the table label means that any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents as few matches as possible when any character except for a line change appears zero or multiple times in the table label, and matching is finished quickly;
(2.3) comparing non-tabular text in the text according to which the original legal and legal notes are input according to text characteristics: whether the paragraph level (e.g., 1,2,3,4, a, b, c, d., in case) can be split numerically or alphabetically, and the time-efficient features are used to compare within the paragraph level, such as valid, reserved, to-be-modified, to continue execution, etc. feature words; the non-tabular textual feature of the composition is denoted Pattern5, as follows:
Pattern5 = Pattern. (]? five six eight nine zero ten hundred, o 0123456789] +), a step of + (? u4E00 \\u9FFF [) ]);
Wherein, for the Pattern 5=pattern. Three, four, seven, eight, nine, zero, ten hundred good o 0123456789] + [)), a step of + (? 00- \\u9FFF ] [) ]) is defined as follows:
"A" of Pattern5 indicates the start position, [ ] indicates the set of characters, ((indicates matching Chinese and English left brackets,;
Pattern 5? Is? Is? Is? :
"left brackets matching the Chinese title number,"? Is? ": indicating that a match is zero-order or once followed by an arbitrary character (except for a line-feed) and a non-greedy matching pattern is employed. I.e., matches one or zero characters until the next condition is met;
"the right brackets matching the Chinese title number. "? Is? ": indicating that matching again zero or once left brackets is followed by any character (except for a linefeed) and a non-greedy matching pattern is employed. I.e. match one or zero characters again until the next condition is met
Pattern5 overall represents: matching content contained between Chinese title numbers ", wherein the content inside the title numbers may be empty or contain arbitrary characters, and the whole adopts a non-greedy matching mode;
(2.4), comparing the symbolic features of the text according to the input original legal and legal notes, such as symbols of "", "< >", and the like with symbols of ",". Performing interactive matching by using' and the like as features, and verifying the integrity of the features according to the data structure characteristics of the java stack; the interactive matching is the nested matching of the symbols, such as that the signature number is also included in the signature number or a bracket, or the signature number is included in the bracket in a crossing way, the interactive matching is the interactive matching, and the analysis process is to analyze the content through the matching of the characters according to the characteristics of the symbols; the symbolic feature of the composition is denoted Pattern6, as follows:
Pattern6 = Pattern. (number? (]); pattern. Combile @ "[ (] (." [ ("A") ] x (x)
The above symbols are used for parsing each sentence of text, and as the most basic parsing units, the title and the letter number of the annotated document, such as the title number and brackets, are extracted from the basic parsing units as the primary conditions for judging the title and the letter number
Wherein, for the Pattern 6=pattern (number? Pattern.com ("[ A ] (] (? 9] + number) [) ]) is defined as follows:
"number? "
"Number": representing the matching character "number", the whole represents a character string that matches the end of the character "number", and the content of the matching may be any character, but matches as few characters as possible;
The "[ (] (.?"
Strings beginning with full angle brackets in chinese or small brackets in english are matched and any content inside the brackets is matched, but as few characters as possible are matched.
And 3, extracting key information of the text according to the original legal regulation annotation by utilizing the characteristics constructed in the step 2, wherein the key information is specifically as follows: extracting key information according to the title feature, the table text feature, the non-table text feature and the symbolic feature from the text according to the original legal and legal notes, wherein the key information at least comprises labels, symbols and dependency relations thereof in the text according to the legal and legal notes, and the key information is set as a high-weight timeliness vocabulary; according to the extracted key information, automatically identifying key entity information in legal and legal texts through text scanning, splitting, feature comparison, regular matching and the like, such as titles, letter numbers, issuing departments, issuing time, issuing sources and the like of related legal and legal laws, and establishing a relation to the entity information through annotating key features according to the texts:
(3.1) analyzing the key information such as table >, </tr >, </td > labels, symbols such as "", "()", and the like of each row in the table text according to the original legal and legal notes through the text characteristics of the table formed in the step 2, and arranging the dependency relationship of the key information, including the position, the symbol nesting and the cross relationship of the text; the key information refers to the title and the text number of the annotated document, and whether the same annotated document corresponds to other information such as the matched title and text number is judged;
3.2, analyzing each row of key information in the table text according to the original legal regulation annotation by the timeliness features formed in the step 2, and analyzing timeliness words with high weight such as 'no longer executing', 'stopping applying', 'file cleaning', 'current validity', 'continuing validity' and the like;
For example: comparing each line of analyzed key information, and analyzing the release time of the annotated file, such as (year, month, day, yyyy year, MM, dd, yyyyMMdd) and the like; by comparing each line of parsed key information through the characteristics of the annotated file, key content titles, letter numbers, issuing departments of the annotated file are parsed, such as (file numbers, titles, uniform numbers, clauses, implementations, registration numbers, etc.).
As shown in FIG. 2, the intelligent extraction system for the legal and legal annotation files comprises a preprocessing module, a feature construction module and an extraction module.
In conclusion, after the automatic text collecting and analyzing system is added, the defects of low efficiency, limited accuracy, easiness in being influenced by subjective factors, difficulty in adapting to large-scale processing and the like in the existing manual screening and sorting scheme can be overcome, and the efficiency and accuracy of processing important information of laws and regulations are improved.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, the present invention is not limited to the description of the above-described technical solutions, and various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the present invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (9)

1. An intelligent extraction method for a legal and legal annotation file is characterized by comprising the following steps:
Step 1, collecting a text with a legal regulation annotation basis as an input text of an original legal regulation annotation basis, and carrying out data preprocessing on the text of the original legal regulation annotation basis to form clear and structured data;
step 2, realizing feature construction based on feature engineering, wherein the construction comprises but is not limited to title features, table text features, non-table text features and symbolic features;
And 3, extracting key information of the text according to which the original legal regulation notes are made by utilizing the features constructed in the step 2, and automatically identifying key entity information in the legal regulation text according to the extracted key information through text scanning, splitting, feature comparison and regular matching.
2. The method for intelligently extracting the legal and legal annotation files according to claim 1, wherein the step2 further comprises the following steps:
Step 2.1, comparing the input text according to the original legal and legal notes according to title characteristics, screening out characteristics of the text of the non-automatic analysis processing logic including but not limited to blank characters at the beginning and end, failure, modification, quasi-modification and applicability, wherein the formed title characteristics are expressed as characteristic formulas Pattern1 and Pattern2;
Step 2.2, comparing text-based text characteristics including but not limited to a table tag, a pre tag, a tr tag and a td tag in the input text based on the original legal and legal notes, verifying the integrity of the tag according to the data structure characteristics of the java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4;
Step 2.3, comparing non-form text in the text according to which the original legal and legal notes are input according to text characteristics: whether the paragraph level can be broken down numerically or alphabetically and aligned within the paragraph level with time-efficient features including, but not limited to, valid, reserved, to-be-modified, continuing to perform feature words; the non-tabular textual feature of the composition is denoted Pattern5;
Step 2.4, performing interactive matching of symbolic features of the text according to which the original legal and legal notes are input, further realizing feature comparison, and verifying the integrity of the features according to the data structure characteristics of the java stack; the symbolic feature of the composition is denoted Pattern6.
3. The intelligent extraction method for the legal and legal annotation file according to claim 2, wherein Pattern3 adopts non-greedy matching, and Pattern3 indicates that matching is performed only zero times or multiple times on any character outside the line change, so as to finish the matching rapidly.
4. The intelligent extraction method for the legal and legal annotation file according to claim 1, wherein Pattern4 adopts non-greedy matching, and Pattern4 indicates that only any character except for line feed is matched zero times or multiple times so as to finish matching rapidly.
5. The intelligent extraction method for the legal and legal annotation files according to claim 2, wherein Pattern5 adopts non-greedy matching, and Pattern5 indicates matching of contents contained between Chinese signature numbers, wherein the contents inside the signature numbers are empty or contain arbitrary characters.
6. The intelligent extraction method for legal annotation files according to claim 2, wherein Pattern6 uses non-greedy matching, pattern6 represents matching character strings beginning with full-angle brackets in chinese or small brackets in english, and matches any content inside brackets.
7. An intelligent extraction system for legal and legal annotation files, which is characterized by comprising: the device comprises a preprocessing module, a characteristic construction module and an extraction module which are sequentially connected; wherein:
The preprocessing module is used for collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and preprocessing data of the text with the original legal regulation annotation basis to form clear and structured data;
The feature construction module is used for realizing feature construction based on feature engineering and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature;
The extraction module is used for extracting key information of the text according to which the original legal regulation notes are made by utilizing the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison and regular matching according to the extracted key information.
8. The intelligent extraction system for legal regulations annotation files of claim 7, wherein the feature construction module further comprises:
comparing the input text according to the original legal and legal notes according to the title characteristics, and screening out non-executable text
Automatically analyzing the characteristics of the text of the processing logic, wherein the formed title characteristics are expressed as characteristic formulas Pattern1 and Pattern2;
comparing text in the input text according to the original legal and legal notes according to text characteristics, verifying the integrity of the label according to the data structure characteristics of the java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4;
Comparing non-form text in the text according to the original legal and legal notes according to the text characteristics: whether the paragraph level can be divided by numbers or letters or not, and comparing the paragraph level by using time-effect characteristics, wherein the non-table text characteristics are expressed as Pattern5; performing interactive matching of symbolic features of the text according to the input original legal and legal notes, further realizing feature comparison, and verifying the integrity of the features according to the data structure characteristics of the java stack; the symbolic feature of the composition is denoted Pattern6.
9. The intelligent extraction system for legal and legal annotation files according to claim 7, where Pattern3, pattern4, pattern5 and Pattern6 all use non-greedy matching, pattern3 means that only zero or multiple times of matching occurs on any character outside the line change, so as to finish the matching quickly; pattern4 indicates that matching is performed only zero times or multiple times on any character except for line feed so as to finish matching rapidly; pattern5 represents matching content contained between Chinese signature numbers, wherein the content inside the signature numbers is empty or contains arbitrary characters; pattern6 represents matching strings starting with full angle brackets in Chinese or small brackets in English, and matching any content inside the brackets.
CN202410484659.2A 2023-10-19 2024-04-22 Intelligent extraction method and system for legal and legal annotation files Pending CN118504559A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2023113615766 2023-10-19
CN202311361576 2023-10-19

Publications (1)

Publication Number Publication Date
CN118504559A true CN118504559A (en) 2024-08-16

Family

ID=92245814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410484659.2A Pending CN118504559A (en) 2023-10-19 2024-04-22 Intelligent extraction method and system for legal and legal annotation files

Country Status (1)

Country Link
CN (1) CN118504559A (en)

Similar Documents

Publication Publication Date Title
Drobac et al. Optical character recognition with neural networks and post-correction with finite state methods
CN107392143B (en) Resume accurate analysis method based on SVM text classification
CN107145479B (en) Text semantic-based chapter structure analysis method
US7984076B2 (en) Document processing apparatus, document processing method, document processing program and recording medium
CN107358208B (en) A kind of PDF document structured message extracting method and device
CN106446072B (en) The treating method and apparatus of web page contents
CN103530430B (en) A kind of html rich text data containing form across label processing method and system
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
JPH07325827A (en) Automatic hyper text generator
CN110837788A (en) PDF document processing method and device
CN110704570A (en) Continuous page layout document structured information extraction method
CN110770735A (en) Transcoding of documents with embedded mathematical expressions
GB2487600A (en) System for extracting data from an electronic document
CN107145591B (en) Title-based webpage effective metadata content extraction method
Baron Dealing with spelling variation in Early Modern English texts
CN114970502A (en) Text error correction method applied to digital government
CN110688842B (en) Analysis method, device and server for document title level
Couasnon et al. Making handwritten archives documents accessible to public with a generic system of document image analysis
Darģis et al. Lessons learned from creating a balanced corpus from online data
Hocking et al. Optical character recognition for South African languages
CN118504559A (en) Intelligent extraction method and system for legal and legal annotation files
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
CN112183032B (en) Text processing method and device
Clérice et al. CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond
CN103646058B (en) Method and system for identifying key words in technical documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination