CN111078839A

CN111078839A - Structured processing method and processing device for referee document

Info

Publication number: CN111078839A
Application number: CN201911333386.7A
Authority: CN
Inventors: 王可佳; 张树军; 尹士朝; 谭宁; 张贵森
Original assignee: Guangzhou Pci Data Service Co ltd
Current assignee: Guangzhou Pci Data Service Co ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-04-28

Abstract

The invention discloses a structured processing method for a referee document, which comprises the following steps: s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data; s2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library; s3: extracting each line of text information contained in the text data by using the text data to obtain line text module data; s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching; s5: extracting keyword position information after matching is successful, and determining a column dereferencing rule so as to obtain text content and detailed information of the column; the method can store the structured referee document into the big data platform, thereby providing a way for conveniently and quickly acquiring document information for searching by professional persons or parties in the legal industry.

Description

Structured processing method and processing device for referee document

Technical Field

The invention relates to the technical field of early warning systems, in particular to a structured processing method and a processing device for a referee document.

Background

When criminal law workers handle cases, the criminal law workers often need to comprehensively consider the names of the criminal offences, the types of the criminal offence positions, the amount of the criminal offence positions, the law rules according to which the cases are judged and the like so as to use the criminal law workers as work reference data in actual work. The sources of the reference data are generally a large number of cases already judged and disclosed by the national court, and the results are obtained by performing big data analysis statistics on the cases.

In the related art, when a case is subjected to a big data analysis system, all case-related legal referee documents are traversed temporarily to obtain keywords contained in the case. The criminal penalty information points related in the criminal case examined and judged by the people's court are many, the information amount is large, the content is complex, the representation is diversified, and the judgment is based on more types of laws, for example, the criminal case is composed of a plurality of types of criminal names, and the criminal penalty types and the sentencing amount are different according to different criminal name positions. Therefore, when the method is used for inquiring the data of the legal referee document data set, the server is stressed greatly and the time is consumed very long due to the fact that full text is searched word by word; the results of the temporary search cannot reflect the correlation between keywords (such as the correlation between penalty information), and are not beneficial to the big data statistical analysis.

Therefore, how to provide a method for processing official document information capable of solving the above problems is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the invention provides a structured processing method and a processing device for a referee document, and aims to perform more refined analysis processing on the referee document in the legal industry, mine more dimensions, finer granularity and more accurate information, store the information into a big data platform after the structured processing, and provide a way for conveniently and quickly acquiring document information for search of professional persons or parties in the legal industry.

In order to achieve the purpose, the invention adopts the following technical scheme:

a structured processing method for referee documents, comprising:

s1: reading a document file and obtaining text contents, and preprocessing the text contents to obtain text data;

s2: setting column analysis rules and initializing a column retrieval data structure to generate a column retrieval library;

s3: extracting each line of text information contained in the text data by using the text data to obtain line text module data;

s4: extracting keywords from each column, traversing the line text module data and performing line-by-line keyword matching;

s5: and after matching is successful, extracting the position information of the key words, and determining a column dereferencing rule so as to obtain the text content and the detailed information of the column.

Preferably, the method further comprises the following steps:

s6: updating the structure information of the line text module data and the column retrieval data structure;

s7: reading column storage data structure, storing or updating column database, and making correlation with case information and document file.

Preferably, the step S4 includes:

s41: presetting a keyword corresponding to each column attribute, and analyzing the extracted attribute according to the characteristic of each column;

s42: and extracting the content of each column of the referee document and preprocessing the content.

Preferably, the step S4 further includes:

s43: matching keywords sentence by sentence, extracting column attributes, or analyzing a word segmentation structure and a lexical label, and marking segmented words or phrases into column attributes;

s44: and extracting text content corresponding to the column attribute, and writing the text content into a database.

Preferably, the line text data structure includes: line number, line text content, line state, search keyword and position of the search keyword.

Preferably, the column retrieval data structure includes: column name, column content, and text position of column content.

An information processing apparatus using any one of the above structuring methods, comprising:

an acquisition unit for reading the text content of the target referee document;

the extraction unit is used for extracting at least one target keyword from the matched text content; and

the first detection unit is used for traversing the line text module data and performing line-by-line matching;

and the storage unit is used for storing the at least one target keyword to a database.

According to the technical scheme, compared with the prior art, the structured processing method and the processing device for the referee document are disclosed and provided, the document can be dynamically processed by presetting a document column search library, the position of column content is obtained according to the keywords, the column content is extracted according to the value-taking rule, the matching speed is increased according to the search rule, and the matching failure rate is reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flow chart of a structured processing method for referee documents according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention discloses a method for structured processing of a referee document, including:

the preprocessing refers to reading the text information of the referee document from a document database and processing special characters.

the columns are general names of valuable information to be extracted from the referee document, each column has certain characteristics, column names, keywords, retrieval rules and value-taking rules are abstracted, and programs can be processed uniformly; the column analysis rule can be obtained by a machine learning method.

The method for matching the keywords comprises the steps of matching with a single keyword, matching with the same semantics with the single keyword, matching with a regular expression, and forming a new long short sentence with two or more keywords to perform advanced matching.

Specifically, the method presets keywords, retrieval rules and value rules for each column, configures weights for the keywords, and can effectively improve the resolution accuracy rate in a weight presetting mode when a plurality of columns preset the same keyword, a plurality of lines of texts are matched with the same keyword, and one line of texts is matched with the keywords of the plurality of columns.

In a specific embodiment, the method further comprises the following steps:

the content of the updated line text module data comprises keywords, keyword positions, keyword times and line states, and the content updated by the column storage data structure comprises column names, column contents and column content text positions, so that other columns can be retrieved and valued according to the information.

S7: reading a column storage data structure, storing or updating a column database, and making an association relation with case information and document files so as to facilitate the subsequent inquiry of an application system.

In a specific embodiment, the step S4 includes:

wherein, the column attribute may include: extracting information of each role of the principal aiming at the principal, such as the type of the principal (original/appetizer, announced/appetizer, attorney, council attorney, legal representative, and the like), name, gender, birth date, ethnicity, address, unit/law, and the like;

presetting key words or word segmentation labels corresponding to each column attribute: analyzing the extractable attribute according to the characteristics of each column, and presetting a mode for extracting the attribute, such as matching and extracting according to key words, analyzing word by word and labeling and the like; specifically, the part-of-speech tags include names, organization names, sexes, dates, addresses and the like, and the matching success rate can be increased by expanding the word stock and the part-of-speech.

In a specific embodiment, the step S4 further includes:

In a specific embodiment, the line text data structure includes: line number, line text content, line state, search keyword and position of the search keyword.

In a specific embodiment, the column retrieval data structure includes: column name, column content, and column content text location.

An information processing apparatus using the structured processing method of any one of the above embodiments, comprising:

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for structured processing of official documents, comprising:

2. A structured processing method for official documents according to claim 1, characterized by further comprising:

3. The information structuring processing method for referee document according to claim 1, wherein the step S4 comprises:

4. The information structuring processing method for referee document according to claim 3, wherein the step S4 further comprises:

5. An information structuring method for official document according to any one of claims 1 to 4, characterized in that said line text data structure comprises: line number, line text content, line state, search keyword and position of the search keyword.

6. The information structuring processing method for referee document according to any one of claims 1 to 4, wherein the column retrieval data structure comprises: column name, column content, and text position of column content.

7. An information processing apparatus using the structuring processing method according to any one of claims 1 to 6, comprising:

the first detection unit is used for traversing the line text module data and performing line-by-line matching; and the storage unit is used for storing the at least one target keyword to a database.