CN118504559A - Intelligent extraction method and system for legal and legal annotation files - Google Patents
Intelligent extraction method and system for legal and legal annotation files Download PDFInfo
- Publication number
- CN118504559A CN118504559A CN202410484659.2A CN202410484659A CN118504559A CN 118504559 A CN118504559 A CN 118504559A CN 202410484659 A CN202410484659 A CN 202410484659A CN 118504559 A CN118504559 A CN 118504559A
- Authority
- CN
- China
- Prior art keywords
- legal
- text
- matching
- feature
- annotation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 27
- 238000010276 construction Methods 0.000 claims abstract description 14
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 6
- 238000012986 modification Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 8
- 230000002452 interceptive effect Effects 0.000 claims description 6
- 230000004048 modification Effects 0.000 claims description 5
- 230000001550 time effect Effects 0.000 claims 1
- 206010033307 Overweight Diseases 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Input (AREA)
Abstract
The invention discloses an intelligent extraction method and system for a law and regulation annotation file, comprising the steps of collecting a text with a law and regulation annotation basis as an input text with the original law and regulation annotation basis, and preprocessing data of the text with the original law and regulation annotation basis; realizing feature construction based on feature engineering, and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature; and extracting key information of the text according to the original legal regulation annotation by using the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information. Compared with the prior art, the method and the device can improve the accuracy of 2D gaze point estimation.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence and natural language processing related to computers, and particularly relates to a cross-domain legislation theme perception oriented generating type intelligent agent technology.
Background
Before automatic extraction techniques have not been popular, manual screening and sorting is the primary means of extracting annotated documents under law and regulation. Especially in the face of complex legal text information data, the process of manual screening and sorting requires a lot of time and effort and may be affected by subjective preferences of the screeners. The cost and difficulty of manual screening and finishing would increase significantly in the face of large amounts of legal regulations. After the automatic text collecting and analyzing system is added, a large amount of texts can be automatically processed in batches, and the labor cost is reduced. Currently, an information extraction technology for rapidly and accurately extracting key information is urgently required in numerous laws and regulations.
Disclosure of Invention
The invention aims to provide an intelligent extraction method and system for legal and legal annotation files, which automatically capture text information of the legal and legal annotation files and realize intelligent information extraction for eliminating subjective factors of the legal and legal annotation files.
The invention is realized by the following technical scheme:
an intelligent extraction method for legal and legal annotation files comprises the following steps:
Step 1, collecting a text with a legal regulation annotation basis as an input text of an original legal regulation annotation basis, and carrying out data preprocessing on the text of the original legal regulation annotation basis to form clear and structured data;
step 2, realizing feature construction based on feature engineering, wherein the construction comprises but is not limited to title features, table text features, non-table text features and symbolic features;
And 3, extracting key information of the text according to which the original legal regulation notes are made by utilizing the features constructed in the step 2, and automatically identifying key entity information in the legal regulation text according to the extracted key information through text scanning, splitting, feature comparison and regular matching.
An intelligent extraction system for legal and legal annotation files, comprising: the device comprises a preprocessing module, a characteristic construction module and an extraction module which are sequentially connected; wherein:
The preprocessing module is used for collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and preprocessing data of the text with the original legal regulation annotation basis to form clear and structured data;
The feature construction module is used for realizing feature construction based on feature engineering and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature;
The extraction module is used for extracting key information of the text according to which the original legal regulation notes are made by utilizing the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison, regular matching and the like according to the extracted key information.
Compared with the traditional manual method, the invention can achieve the following beneficial technical effects:
1) The intelligent automatic extraction for processing a large amount of legal and legal texts is realized, and subjective influence caused by manual intervention is reduced; is not easy to adapt to large-scale treatment
2) Key information of the legal document can be automatically captured from the text, so that manual errors are reduced, and accuracy is improved;
3) The text is automatically processed by the computer, so that the information extraction speed is greatly improved, and the processing efficiency is improved; and is more beneficial to adapting to large-scale text processing.
Drawings
FIG. 1 is a flow chart of an intelligent extraction method for legal and legal annotation files;
FIG. 2 is a block diagram of an intelligent extraction system for legal and legal annotation files according to the present invention;
FIG. 3 is a block diagram of a model of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention.
Description of related terms:
The law and regulation annotation basis is the basis of the interpretation and the explanation of the law and regulation provision and the department, time and timeliness of the law and regulation release, and the law and regulation annotation is the interpretation of the specific terms and words of the law and regulation provision.
Timeliness is the current timeliness of law regulations, such as current validity, modification, failure, partial failure.
As shown in fig. 1, the overall flow of the present invention includes the following steps:
step1, collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and performing data preprocessing on the text with the original legal regulation annotation basis to form clear and structured data, wherein in the step, the preprocessing operations include:
(1.1) performing removal operations of irrelevant symbols such as & nbsp, & quot, @ and the like on the text according to which the input original legal and legal notes are based;
(1.2) performing a removal operation of irrelevant labels such as </span >, </script >, < style >, etc. on the text according to which the input original legal and legal notes are based;
(1.3) performing label marking operations of labels such as clauses and segmentation words such as < table >, </p >, </br > on key nodes of the text according to which the input original legal and legal notes are annotated, extracting the text according to label marks by using a Pattern formula, or judging whether the obtained text accords with information of the annotated file such as a title, a text number and the like after the text is analyzed according to punctuation and segmentation of punctuation marks of the conventional text;
(1.4) adding non-printed characters such as \r, \n, etc. to key nodes of the text according to which the input original legal and legal notes are based;
(1.5), full-angle symbols such as the input original legal and legal texts. Operations of performing half angle conversion of "," (",") "," [ and "";
Step 2, realizing feature construction based on feature engineering, and screening out texts capable of automatically analyzing processing logic and texts not capable of automatically analyzing processing logic by comparing the texts according to the original legal and legal notes with a standard format, namely verifying legality of the contents of the texts according to the original legal and legal notes in an HTML format according to a Pattern formula; the text of the automatically resolvable logic refers to whether the matching is performed according to the existing features, such as whether the time-efficiency marks (such as validity and invalidation mentioned in the formula) exist in the header information of the table (similar to the XXX file name and XXX text number, the non-automatically resolvable text refers to whether the feature library cannot be matched; for example:
(2.1) comparing the input text according to the original legal and legal notes according to the characteristics of the title, screening out the characteristics of the text of the non-automatic analysis processing logic, such as the characteristics of the notes according to the title including the blank characters at the beginning, failure, modification, quasi-modification, applicable and the like, wherein the blank characters at the beginning represent the title, and the failure, quasi-modification, applicable include the timeliness representing the annotated file;
The title features of the composition are represented as feature formulas Pattern1, pattern2, as follows:
pattern1 = pattern.com ("cancel|failure|revocation| stopping execution of active continue active reserved
Pattern2 = pattern.com (part (valid |failure|content revocation|content failure|clause revocation')
(2.2) Comparing text in the input text according to the original legal and legal notes according to text characteristics such as a table label, a pre label, a tr label, a td label and the like, verifying the integrity of the label according to the data structure characteristics of a java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4 as follows:
Pattern3=Pattern.compile("<table.*?</table>")
Pattern4=Pattern.compile("<tr.*?</tr>");Pattern.compile("<td.*?</td>")
In HTML text, for example, a < table > tag represents that the current HTML text content is tabular, a < tr > tag represents each row in the table, and a < td > tag represents a small column of each row;
Wherein, for Pattern3 = Pattern.com ("< table) Pattern.comp (" < table):
"? "means:
"in the table label means that any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents as few matches as possible when any character except for a line change appears zero or multiple times in the table label, and matching is finished quickly;
Wherein, for the Pattern 4=pattern.com ("< tr.; attrn.com ("< td.? </td >") is defined as follows:
first "? "means:
"in tr" indicates any character other than a line feed "x" indicates that the preceding element matches zero or more times "? "represents non-greedy matching (as few matches as possible) overall represents that in tr tag, any character except for line feed is as few matches as possible when zero or more times occur, and matching is terminated quickly;
second "? "means:
"in td tag means any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents that in td tag, any character except for line feed is as few matches as possible when zero or more times occur, and matching is terminated quickly;
Third "? "means:
"in the table label means that any character except a line feed" x "means that the preceding element matches zero or more times"? "represents non-greedy matching (as few matches as possible) overall represents as few matches as possible when any character except for a line change appears zero or multiple times in the table label, and matching is finished quickly;
(2.3) comparing non-tabular text in the text according to which the original legal and legal notes are input according to text characteristics: whether the paragraph level (e.g., 1,2,3,4, a, b, c, d., in case) can be split numerically or alphabetically, and the time-efficient features are used to compare within the paragraph level, such as valid, reserved, to-be-modified, to continue execution, etc. feature words; the non-tabular textual feature of the composition is denoted Pattern5, as follows:
Pattern5 = Pattern. (]? five six eight nine zero ten hundred, o 0123456789] +), a step of + (? u4E00 \\u9FFF [) ]);
Wherein, for the Pattern 5=pattern. Three, four, seven, eight, nine, zero, ten hundred good o 0123456789] + [)), a step of + (? 00- \\u9FFF ] [) ]) is defined as follows:
"A" of Pattern5 indicates the start position, [ ] indicates the set of characters, ((indicates matching Chinese and English left brackets,;
Pattern 5? Is? Is? Is? :
"left brackets matching the Chinese title number,"? Is? ": indicating that a match is zero-order or once followed by an arbitrary character (except for a line-feed) and a non-greedy matching pattern is employed. I.e., matches one or zero characters until the next condition is met;
"the right brackets matching the Chinese title number. "? Is? ": indicating that matching again zero or once left brackets is followed by any character (except for a linefeed) and a non-greedy matching pattern is employed. I.e. match one or zero characters again until the next condition is met
Pattern5 overall represents: matching content contained between Chinese title numbers ", wherein the content inside the title numbers may be empty or contain arbitrary characters, and the whole adopts a non-greedy matching mode;
(2.4), comparing the symbolic features of the text according to the input original legal and legal notes, such as symbols of "", "< >", and the like with symbols of ",". Performing interactive matching by using' and the like as features, and verifying the integrity of the features according to the data structure characteristics of the java stack; the interactive matching is the nested matching of the symbols, such as that the signature number is also included in the signature number or a bracket, or the signature number is included in the bracket in a crossing way, the interactive matching is the interactive matching, and the analysis process is to analyze the content through the matching of the characters according to the characteristics of the symbols; the symbolic feature of the composition is denoted Pattern6, as follows:
Pattern6 = Pattern. (number? (]); pattern. Combile @ "[ (] (." [ ("A") ] x (x)
The above symbols are used for parsing each sentence of text, and as the most basic parsing units, the title and the letter number of the annotated document, such as the title number and brackets, are extracted from the basic parsing units as the primary conditions for judging the title and the letter number
Wherein, for the Pattern 6=pattern (number? Pattern.com ("[ A ] (] (? 9] + number) [) ]) is defined as follows:
"number? "
"Number": representing the matching character "number", the whole represents a character string that matches the end of the character "number", and the content of the matching may be any character, but matches as few characters as possible;
The "[ (] (.?"
Strings beginning with full angle brackets in chinese or small brackets in english are matched and any content inside the brackets is matched, but as few characters as possible are matched.
And 3, extracting key information of the text according to the original legal regulation annotation by utilizing the characteristics constructed in the step 2, wherein the key information is specifically as follows: extracting key information according to the title feature, the table text feature, the non-table text feature and the symbolic feature from the text according to the original legal and legal notes, wherein the key information at least comprises labels, symbols and dependency relations thereof in the text according to the legal and legal notes, and the key information is set as a high-weight timeliness vocabulary; according to the extracted key information, automatically identifying key entity information in legal and legal texts through text scanning, splitting, feature comparison, regular matching and the like, such as titles, letter numbers, issuing departments, issuing time, issuing sources and the like of related legal and legal laws, and establishing a relation to the entity information through annotating key features according to the texts:
(3.1) analyzing the key information such as table >, </tr >, </td > labels, symbols such as "", "()", and the like of each row in the table text according to the original legal and legal notes through the text characteristics of the table formed in the step 2, and arranging the dependency relationship of the key information, including the position, the symbol nesting and the cross relationship of the text; the key information refers to the title and the text number of the annotated document, and whether the same annotated document corresponds to other information such as the matched title and text number is judged;
3.2, analyzing each row of key information in the table text according to the original legal regulation annotation by the timeliness features formed in the step 2, and analyzing timeliness words with high weight such as 'no longer executing', 'stopping applying', 'file cleaning', 'current validity', 'continuing validity' and the like;
For example: comparing each line of analyzed key information, and analyzing the release time of the annotated file, such as (year, month, day, yyyy year, MM, dd, yyyyMMdd) and the like; by comparing each line of parsed key information through the characteristics of the annotated file, key content titles, letter numbers, issuing departments of the annotated file are parsed, such as (file numbers, titles, uniform numbers, clauses, implementations, registration numbers, etc.).
As shown in FIG. 2, the intelligent extraction system for the legal and legal annotation files comprises a preprocessing module, a feature construction module and an extraction module.
In conclusion, after the automatic text collecting and analyzing system is added, the defects of low efficiency, limited accuracy, easiness in being influenced by subjective factors, difficulty in adapting to large-scale processing and the like in the existing manual screening and sorting scheme can be overcome, and the efficiency and accuracy of processing important information of laws and regulations are improved.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, the present invention is not limited to the description of the above-described technical solutions, and various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the present invention, and such modifications and variations fall within the scope defined by the appended claims.
Claims (9)
1. An intelligent extraction method for a legal and legal annotation file is characterized by comprising the following steps:
Step 1, collecting a text with a legal regulation annotation basis as an input text of an original legal regulation annotation basis, and carrying out data preprocessing on the text of the original legal regulation annotation basis to form clear and structured data;
step 2, realizing feature construction based on feature engineering, wherein the construction comprises but is not limited to title features, table text features, non-table text features and symbolic features;
And 3, extracting key information of the text according to which the original legal regulation notes are made by utilizing the features constructed in the step 2, and automatically identifying key entity information in the legal regulation text according to the extracted key information through text scanning, splitting, feature comparison and regular matching.
2. The method for intelligently extracting the legal and legal annotation files according to claim 1, wherein the step2 further comprises the following steps:
Step 2.1, comparing the input text according to the original legal and legal notes according to title characteristics, screening out characteristics of the text of the non-automatic analysis processing logic including but not limited to blank characters at the beginning and end, failure, modification, quasi-modification and applicability, wherein the formed title characteristics are expressed as characteristic formulas Pattern1 and Pattern2;
Step 2.2, comparing text-based text characteristics including but not limited to a table tag, a pre tag, a tr tag and a td tag in the input text based on the original legal and legal notes, verifying the integrity of the tag according to the data structure characteristics of the java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4;
Step 2.3, comparing non-form text in the text according to which the original legal and legal notes are input according to text characteristics: whether the paragraph level can be broken down numerically or alphabetically and aligned within the paragraph level with time-efficient features including, but not limited to, valid, reserved, to-be-modified, continuing to perform feature words; the non-tabular textual feature of the composition is denoted Pattern5;
Step 2.4, performing interactive matching of symbolic features of the text according to which the original legal and legal notes are input, further realizing feature comparison, and verifying the integrity of the features according to the data structure characteristics of the java stack; the symbolic feature of the composition is denoted Pattern6.
3. The intelligent extraction method for the legal and legal annotation file according to claim 2, wherein Pattern3 adopts non-greedy matching, and Pattern3 indicates that matching is performed only zero times or multiple times on any character outside the line change, so as to finish the matching rapidly.
4. The intelligent extraction method for the legal and legal annotation file according to claim 1, wherein Pattern4 adopts non-greedy matching, and Pattern4 indicates that only any character except for line feed is matched zero times or multiple times so as to finish matching rapidly.
5. The intelligent extraction method for the legal and legal annotation files according to claim 2, wherein Pattern5 adopts non-greedy matching, and Pattern5 indicates matching of contents contained between Chinese signature numbers, wherein the contents inside the signature numbers are empty or contain arbitrary characters.
6. The intelligent extraction method for legal annotation files according to claim 2, wherein Pattern6 uses non-greedy matching, pattern6 represents matching character strings beginning with full-angle brackets in chinese or small brackets in english, and matches any content inside brackets.
7. An intelligent extraction system for legal and legal annotation files, which is characterized by comprising: the device comprises a preprocessing module, a characteristic construction module and an extraction module which are sequentially connected; wherein:
The preprocessing module is used for collecting a text with a legal regulation annotation basis as an input text with the original legal regulation annotation basis, and preprocessing data of the text with the original legal regulation annotation basis to form clear and structured data;
The feature construction module is used for realizing feature construction based on feature engineering and at least forming the title feature, the table text feature, the non-table text feature and the symbol feature;
The extraction module is used for extracting key information of the text according to which the original legal regulation notes are made by utilizing the constructed features, and automatically identifying key entity information in the legal regulation text through text scanning, splitting, feature comparison and regular matching according to the extracted key information.
8. The intelligent extraction system for legal regulations annotation files of claim 7, wherein the feature construction module further comprises:
comparing the input text according to the original legal and legal notes according to the title characteristics, and screening out non-executable text
Automatically analyzing the characteristics of the text of the processing logic, wherein the formed title characteristics are expressed as characteristic formulas Pattern1 and Pattern2;
comparing text in the input text according to the original legal and legal notes according to text characteristics, verifying the integrity of the label according to the data structure characteristics of the java stack, wherein the formed text characteristics of the table are expressed as characteristic formulas Pattern3 and Pattern4;
Comparing non-form text in the text according to the original legal and legal notes according to the text characteristics: whether the paragraph level can be divided by numbers or letters or not, and comparing the paragraph level by using time-effect characteristics, wherein the non-table text characteristics are expressed as Pattern5; performing interactive matching of symbolic features of the text according to the input original legal and legal notes, further realizing feature comparison, and verifying the integrity of the features according to the data structure characteristics of the java stack; the symbolic feature of the composition is denoted Pattern6.
9. The intelligent extraction system for legal and legal annotation files according to claim 7, where Pattern3, pattern4, pattern5 and Pattern6 all use non-greedy matching, pattern3 means that only zero or multiple times of matching occurs on any character outside the line change, so as to finish the matching quickly; pattern4 indicates that matching is performed only zero times or multiple times on any character except for line feed so as to finish matching rapidly; pattern5 represents matching content contained between Chinese signature numbers, wherein the content inside the signature numbers is empty or contains arbitrary characters; pattern6 represents matching strings starting with full angle brackets in Chinese or small brackets in English, and matching any content inside the brackets.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2023113615766 | 2023-10-19 | ||
CN202311361576 | 2023-10-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118504559A true CN118504559A (en) | 2024-08-16 |
Family
ID=92245814
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410484659.2A Pending CN118504559A (en) | 2023-10-19 | 2024-04-22 | Intelligent extraction method and system for legal and legal annotation files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118504559A (en) |
-
2024
- 2024-04-22 CN CN202410484659.2A patent/CN118504559A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Drobac et al. | Optical character recognition with neural networks and post-correction with finite state methods | |
CN107392143B (en) | Resume accurate analysis method based on SVM text classification | |
CN107145479B (en) | Text semantic-based chapter structure analysis method | |
US7984076B2 (en) | Document processing apparatus, document processing method, document processing program and recording medium | |
CN107358208B (en) | A kind of PDF document structured message extracting method and device | |
CN106446072B (en) | The treating method and apparatus of web page contents | |
CN103530430B (en) | A kind of html rich text data containing form across label processing method and system | |
CN110609998A (en) | Data extraction method of electronic document information, electronic equipment and storage medium | |
JPH07325827A (en) | Automatic hyper text generator | |
CN110837788A (en) | PDF document processing method and device | |
CN110704570A (en) | Continuous page layout document structured information extraction method | |
CN110770735A (en) | Transcoding of documents with embedded mathematical expressions | |
GB2487600A (en) | System for extracting data from an electronic document | |
CN107145591B (en) | Title-based webpage effective metadata content extraction method | |
Baron | Dealing with spelling variation in Early Modern English texts | |
CN114970502A (en) | Text error correction method applied to digital government | |
CN110688842B (en) | Analysis method, device and server for document title level | |
Couasnon et al. | Making handwritten archives documents accessible to public with a generic system of document image analysis | |
Darģis et al. | Lessons learned from creating a balanced corpus from online data | |
Hocking et al. | Optical character recognition for South African languages | |
CN118504559A (en) | Intelligent extraction method and system for legal and legal annotation files | |
Souza et al. | ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF | |
CN112183032B (en) | Text processing method and device | |
Clérice et al. | CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond | |
CN103646058B (en) | Method and system for identifying key words in technical documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |