CN111078839A - Structured processing method and processing device for referee document - Google Patents

Structured processing method and processing device for referee document Download PDF

Info

Publication number
CN111078839A
CN111078839A CN201911333386.7A CN201911333386A CN111078839A CN 111078839 A CN111078839 A CN 111078839A CN 201911333386 A CN201911333386 A CN 201911333386A CN 111078839 A CN111078839 A CN 111078839A
Authority
CN
China
Prior art keywords
column
text
line
information
extracting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911333386.7A
Other languages
Chinese (zh)
Inventor
王可佳
张树军
尹士朝
谭宁
张贵森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Pci Data Service Co ltd
Original Assignee
Guangzhou Pci Data Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Pci Data Service Co ltd filed Critical Guangzhou Pci Data Service Co ltd
Priority to CN201911333386.7A priority Critical patent/CN111078839A/en
Publication of CN111078839A publication Critical patent/CN111078839A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Health & Medical Sciences (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a structured processing method for a referee document, which comprises the following steps: s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data; s2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library; s3: extracting each line of text information contained in the text data by using the text data to obtain line text module data; s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching; s5: extracting keyword position information after matching is successful, and determining a column dereferencing rule so as to obtain text content and detailed information of the column; the method can store the structured referee document into the big data platform, thereby providing a way for conveniently and quickly acquiring document information for searching by professional persons or parties in the legal industry.

Description

Structured processing method and processing device for referee document
Technical Field
The invention relates to the technical field of early warning systems, in particular to a structured processing method and a processing device for a referee document.
Background
When criminal law workers handle cases, the criminal law workers often need to comprehensively consider the names of the criminal offences, the types of the criminal offence positions, the amount of the criminal offence positions, the law rules according to which the cases are judged and the like so as to use the criminal law workers as work reference data in actual work. The sources of the reference data are generally a large number of cases already judged and disclosed by the national court, and the results are obtained by performing big data analysis statistics on the cases.
In the related art, when a case is subjected to a big data analysis system, all case-related legal referee documents are traversed temporarily to obtain keywords contained in the case. The criminal penalty information points related in the criminal case examined and judged by the people's court are many, the information amount is large, the content is complex, the representation is diversified, and the judgment is based on more types of laws, for example, the criminal case is composed of a plurality of types of criminal names, and the criminal penalty types and the sentencing amount are different according to different criminal name positions. Therefore, when the method is used for inquiring the data of the legal referee document data set, the server is stressed greatly and the time is consumed very long due to the fact that full text is searched word by word; the results of the temporary search cannot reflect the correlation between keywords (such as the correlation between penalty information), and are not beneficial to the big data statistical analysis.
Therefore, how to provide a method for processing official document information capable of solving the above problems is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the invention provides a structured processing method and a processing device for a referee document, and aims to perform more refined analysis processing on the referee document in the legal industry, mine more dimensions, finer granularity and more accurate information, store the information into a big data platform after the structured processing, and provide a way for conveniently and quickly acquiring document information for search of professional persons or parties in the legal industry.
In order to achieve the purpose, the invention adopts the following technical scheme:
a structured processing method for referee documents, comprising:
s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data;
s2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library;
s3: extracting each line of text information contained in the text data by using the text data to obtain line text module data;
s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching;
s5: and after matching is successful, extracting the position information of the key words, and determining a column dereferencing rule so as to obtain the text content and the detailed information of the column.
Preferably, the method further comprises the following steps:
s6: updating the structure information of the line text module data and the column retrieval data structure;
s7: reading column storage data structure, storing or updating column database, and making correlation with case information and document file.
Preferably, the step S4 includes:
s41: presetting a keyword corresponding to each column attribute, and analyzing the extracted attribute according to the characteristic of each column;
s42: and extracting the content of each column of the referee document and preprocessing the content.
Preferably, the step S4 further includes:
s43: matching keywords sentence by sentence, extracting column attributes, or analyzing a word segmentation structure and a lexical label, and marking segmented words or phrases into column attributes;
s44: and extracting text content corresponding to the column attribute, and writing the text content into a database.
Preferably, the line text data structure includes: line number, line text content, line state, search keyword and position of the search keyword.
Preferably, the column retrieval data structure includes: column name, column content, and text position of column content.
An information processing apparatus using any one of the above structuring methods, comprising:
an acquisition unit for reading the text content of the target referee document;
the extraction unit is used for extracting at least one target keyword from the matched text content; and
the first detection unit is used for traversing the line text module data and performing line-by-line matching;
and the storage unit is used for storing the at least one target keyword to a database.
According to the technical scheme, compared with the prior art, the structured processing method and the processing device for the referee document are disclosed and provided, the document can be dynamically processed by presetting a document column search library, the position of column content is obtained according to the keywords, the column content is extracted according to the value-taking rule, the matching speed is increased according to the search rule, and the matching failure rate is reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flow chart of a structured processing method for referee documents according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an embodiment of the present invention discloses a method for structured processing of a referee document, including:
s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data;
the preprocessing refers to reading the text information of the referee document from a document database and processing special characters.
S2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library;
the columns are general names of valuable information to be extracted from the referee document, each column has certain characteristics, column names, keywords, retrieval rules and value-taking rules are abstracted, and programs can be processed uniformly; the column analysis rule can be obtained by a machine learning method.
S3: extracting each line of text information contained in the text data by using the text data to obtain line text module data;
s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching;
s5: and after matching is successful, extracting the position information of the key words, and determining a column dereferencing rule so as to obtain the text content and the detailed information of the column.
The method for matching the keywords comprises the steps of matching with a single keyword, matching with the same semantics with the single keyword, matching with a regular expression, and forming a new long short sentence with two or more keywords to perform advanced matching.
Specifically, the method presets keywords, retrieval rules and value rules for each column, configures weights for the keywords, and can effectively improve the resolution accuracy rate in a weight presetting mode when a plurality of columns preset the same keyword, a plurality of lines of texts are matched with the same keyword, and one line of texts is matched with the keywords of the plurality of columns.
In a specific embodiment, the method further comprises the following steps:
s6: updating the structure information of the line text module data and the column retrieval data structure;
the content of the updated line text module data comprises keywords, keyword positions, keyword times and line states, and the content updated by the column storage data structure comprises column names, column contents and column content text positions, so that other columns can be retrieved and valued according to the information.
S7: reading a column storage data structure, storing or updating a column database, and making an association relation with case information and document files so as to facilitate the subsequent inquiry of an application system.
In a specific embodiment, the step S4 includes:
s41: presetting a keyword corresponding to each column attribute, and analyzing the extracted attribute according to the characteristic of each column;
wherein, the column attribute may include: extracting information of each role of the principal aiming at the principal, such as the type of the principal (original/appetizer, announced/appetizer, attorney, council attorney, legal representative, and the like), name, gender, birth date, ethnicity, address, unit/law, and the like;
presetting key words or word segmentation labels corresponding to each column attribute: analyzing the extractable attribute according to the characteristics of each column, and presetting a mode for extracting the attribute, such as matching and extracting according to key words, analyzing word by word and labeling and the like; specifically, the part-of-speech tags include names, organization names, sexes, dates, addresses and the like, and the matching success rate can be increased by expanding the word stock and the part-of-speech.
S42: and extracting the content of each column of the referee document and preprocessing the content.
In a specific embodiment, the step S4 further includes:
s43: matching keywords sentence by sentence, extracting column attributes, or analyzing a word segmentation structure and a lexical label, and marking segmented words or phrases into column attributes;
s44: and extracting text content corresponding to the column attribute, and writing the text content into a database.
In a specific embodiment, the line text data structure includes: line number, line text content, line state, search keyword and position of the search keyword.
In a specific embodiment, the column retrieval data structure includes: column name, column content, and column content text location.
An information processing apparatus using the structured processing method of any one of the above embodiments, comprising:
an acquisition unit for reading the text content of the target referee document;
the extraction unit is used for extracting at least one target keyword from the matched text content; and
the first detection unit is used for traversing the line text module data and performing line-by-line matching;
and the storage unit is used for storing the at least one target keyword to a database.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (7)

1. A method for structured processing of official documents, comprising:
s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data;
s2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library;
s3: extracting each line of text information contained in the text data by using the text data to obtain line text module data;
s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching;
s5: and after matching is successful, extracting the position information of the key words, and determining a column dereferencing rule so as to obtain the text content and the detailed information of the column.
2. A structured processing method for official documents according to claim 1, characterized by further comprising:
s6: updating the structure information of the line text module data and the column retrieval data structure;
s7: reading column storage data structure, storing or updating column database, and making correlation with case information and document file.
3. The information structuring processing method for referee document according to claim 1, wherein the step S4 comprises:
s41: presetting a keyword corresponding to each column attribute, and analyzing the extracted attribute according to the characteristic of each column;
s42: and extracting the content of each column of the referee document and preprocessing the content.
4. The information structuring processing method for referee document according to claim 3, wherein the step S4 further comprises:
s43: matching keywords sentence by sentence, extracting column attributes, or analyzing a word segmentation structure and a lexical label, and marking segmented words or phrases into column attributes;
s44: and extracting text content corresponding to the column attribute, and writing the text content into a database.
5. An information structuring method for official document according to any one of claims 1 to 4, characterized in that said line text data structure comprises: line number, line text content, line state, search keyword and position of the search keyword.
6. The information structuring processing method for referee document according to any one of claims 1 to 4, wherein the column retrieval data structure comprises: column name, column content, and text position of column content.
7. An information processing apparatus using the structuring processing method according to any one of claims 1 to 6, comprising:
an acquisition unit for reading the text content of the target referee document;
the extraction unit is used for extracting at least one target keyword from the matched text content; and
the first detection unit is used for traversing the line text module data and performing line-by-line matching; and the storage unit is used for storing the at least one target keyword to a database.
CN201911333386.7A 2019-12-19 2019-12-19 Structured processing method and processing device for referee document Pending CN111078839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911333386.7A CN111078839A (en) 2019-12-19 2019-12-19 Structured processing method and processing device for referee document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911333386.7A CN111078839A (en) 2019-12-19 2019-12-19 Structured processing method and processing device for referee document

Publications (1)

Publication Number Publication Date
CN111078839A true CN111078839A (en) 2020-04-28

Family

ID=70316683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911333386.7A Pending CN111078839A (en) 2019-12-19 2019-12-19 Structured processing method and processing device for referee document

Country Status (1)

Country Link
CN (1) CN111078839A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950253A (en) * 2020-08-28 2020-11-17 鼎富智能科技有限公司 Evidence information extraction method and device for referee document
CN113239682A (en) * 2021-05-06 2021-08-10 吉林大学 Method and device for correcting errors of referee documents
CN113422671A (en) * 2021-06-30 2021-09-21 北京交通大学 Verification method for judicial public internal and external network data consistency
CN115640380A (en) * 2022-12-14 2023-01-24 北京航空航天大学 Diagnosable hierarchical element information extraction method for fault diagnosis algorithm recommendation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN108897770A (en) * 2018-05-25 2018-11-27 南京大学 A kind of law article name authority and case towards judgement document is by being associated with statistical method with law article

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN108763483A (en) * 2018-05-25 2018-11-06 南京大学 A kind of Text Information Extraction method towards judgement document
CN108897770A (en) * 2018-05-25 2018-11-27 南京大学 A kind of law article name authority and case towards judgement document is by being associated with statistical method with law article

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950253A (en) * 2020-08-28 2020-11-17 鼎富智能科技有限公司 Evidence information extraction method and device for referee document
CN111950253B (en) * 2020-08-28 2023-12-08 鼎富智能科技有限公司 Evidence information extraction method and device for referee document
CN113239682A (en) * 2021-05-06 2021-08-10 吉林大学 Method and device for correcting errors of referee documents
CN113239682B (en) * 2021-05-06 2022-11-01 吉林大学 Method and device for correcting errors of referee documents
CN113422671A (en) * 2021-06-30 2021-09-21 北京交通大学 Verification method for judicial public internal and external network data consistency
CN113422671B (en) * 2021-06-30 2022-08-02 北京交通大学 Verification method for judicial public internal and external network data consistency
CN115640380A (en) * 2022-12-14 2023-01-24 北京航空航天大学 Diagnosable hierarchical element information extraction method for fault diagnosis algorithm recommendation

Similar Documents

Publication Publication Date Title
CN108874928B (en) Resume data information analysis processing method, device, equipment and storage medium
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
CN109992645B (en) Data management system and method based on text data
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
CN107045496B (en) Error correction method and error correction device for text after voice recognition
CN106649818B (en) Application search intention identification method and device, application search method and server
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN111078839A (en) Structured processing method and processing device for referee document
US8073877B2 (en) Scalable semi-structured named entity detection
CN106934069B (en) Data retrieval method and system
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
CN111475603A (en) Enterprise identifier identification method and device, computer equipment and storage medium
CN111241230A (en) Method and system for identifying string mark risk based on text mining
CN110851598A (en) Text classification method and device, terminal equipment and storage medium
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN115249007A (en) Method and device for detecting enclosing and bidding behavior based on electronic bidding document comparison
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device
CN112330501A (en) Document processing method and device, electronic equipment and storage medium
CN101894158B (en) Intelligent retrieval system
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
Shahbazi et al. Computing focus time of paragraph using deep learning
CN115129864A (en) Text classification method and device, computer equipment and storage medium
Modi et al. Multimodal web content mining to filter non-learning sites using NLP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428