CN113239681B - Court case file identification method - Google Patents

Court case file identification method Download PDF

Info

Publication number
CN113239681B
CN113239681B CN202110543832.8A CN202110543832A CN113239681B CN 113239681 B CN113239681 B CN 113239681B CN 202110543832 A CN202110543832 A CN 202110543832A CN 113239681 B CN113239681 B CN 113239681B
Authority
CN
China
Prior art keywords
case
rule
criminal
paragraph
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110543832.8A
Other languages
Chinese (zh)
Other versions
CN113239681A (en
Inventor
姜森
谢绍韫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Black Cloud Intelligent Technology Co ltd
Original Assignee
Suzhou Black Cloud Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Black Cloud Intelligent Technology Co ltd filed Critical Suzhou Black Cloud Intelligent Technology Co ltd
Priority to CN202110543832.8A priority Critical patent/CN113239681B/en
Publication of CN113239681A publication Critical patent/CN113239681A/en
Application granted granted Critical
Publication of CN113239681B publication Critical patent/CN113239681B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a court case file identification method, and belongs to the technical field of natural language processing. The method is used for carrying out criminal analysis and characteristic element extraction based on a case full text, extracting case elements and assisting court staff in analyzing cases, and specifically comprises the following steps: s1: analyzing case and criminal names by adopting a rule-based method and a similarity model method; s2: constructing a corpus and a rule base; s3: paragraph division is carried out based on semantic and sentence pattern rules; s4: extracting key characteristic elements in the case by adopting a rule-based method and an entity identification-based method; s5: standardizing a data format; s6: and displaying the analysis result. The method provided by the invention analyzes and displays the information data concerned by the user in the case with complex judgment books in detail, obviously improves the refinement degree and the accuracy of the analysis result, and effectively improves the case analysis efficiency of court staff.

Description

Court case file identification method
Technical Field
The invention belongs to the technical field of natural language processing, and relates to a court case file identification method.
Background
With the rapid development of the information era and the rapid promotion of the law consciousness of people, the number and the quality of various criminal case judgment documents to be processed by the court are rapidly increased and promoted, and how to improve the working efficiency of the court analysis judgment documents becomes a problem which needs to be solved urgently. In the past, for judgment documents to be processed, a court generally studies and analyzes the judgment documents based on legal experts, the case document processing efficiency is low, it is difficult to quickly establish complete and standard case element structures of cases, and the analysis process often needs to consume a large amount of manpower and energy, so that the case processing efficiency of the court is greatly influenced. Therefore, a technology is needed to be found for assisting court staff to rapidly and automatically analyze a judgment document, intelligently extract characteristic elements of criminal cases and clearly display an analysis result to a user when the cases are analyzed in the court.
At present, in the field of court intellectualization, two types of technologies are mainly used for the application of judicial judgment book data, namely a rule-based judicial judgment book information extraction technology and a search engine-based class case retrieval technology. The related application of the technology focuses on simple retrieval of judicial judgment books and accurate extraction of partial data in the judicial judgment books, the information redundancy and the various expression modes of the judicial judgment books cannot be fully considered, the accurate extraction capability of partial case information is not provided, and the subsequent analysis requirement facing the judicial judgment book data is difficult to meet.
Disclosure of Invention
In view of the above, the present invention is directed to a method for identifying court case files. The system helps court staff to quickly analyze criminal case judgment documents, accurately extracts and analyzes case information in the judgment documents by combining a natural language processing technology and a Web development technology, and visually feeds back the case information to users through a clear page display function, so that the requirement of quickly analyzing and judging the case information in various judicial scenes is met, the working efficiency of the court is greatly improved, and high-quality service is provided for the court staff.
In order to achieve the purpose, the invention provides the following technical scheme:
a court case file identification method is characterized by comprising the following steps of carrying out criminal name analysis and characteristic element extraction based on a case full text, extracting case elements and assisting court staff in case analysis, and specifically comprising the following steps:
s1: analyzing case and criminal names by adopting a rule-based method and a similarity model method;
the similarity model method is an algorithm for judging the similarity degree of two articles or sentences according to the cosine similarity of word2 vec. And according to the vector coordinates, drawing in the space to obtain the cos value of the included angle. The closer the Cos value is to 1, the smaller the included angle, i.e., the two vectors are similar.
S2: constructing a corpus and a rule base;
s3: paragraph division is carried out based on semantic and sentence pattern rules;
s4: extracting key characteristic elements in the case by adopting a rule-based method and an entity identification-based method;
s5: standardizing a data format;
s6: and displaying the analysis result.
Optionally, in S1, the rule-based method is to construct a criminal name sentence rule base, and extract criminal name data matched with the rule base through a regular expression;
if the extraction is invalid and the crime data are not extracted from the judgment book, adopting a word2vec similarity model-based mode;
the method comprises the steps of training a corpus model of the same case and the name of a crime based on a large number of same case judgment documents, and then carrying out the name of the crime analysis on a new document to be processed based on the trained model.
Optionally, in S2, based on a plurality of cases with similar names, similar paragraphs and cases are induced and analyzed, the format of the decision book has a certain relationship with the court and time, and a certain decision book is induced and summarized to summarize sentence rules and keyword libraries, and different regular expressions and word libraries are designated according to different names, wherein, crimes are intentionally injured to construct a case article word library, and drug crimes are sold to construct a drug word library; iteratively completing a corpus and a rule base through case data for extracting subsequent paragraphs and structured data; regular expressions are used to retrieve and replace text that conforms to a certain pattern and rule.
Optionally, in S3, the whole decision book is divided into an advertised personal information section, a case section, a hospital deeming section and a criminal interpretation result section;
wherein, the personal information paragraph of the advertiser is extracted based on the semanteme, and comprises the name, the birth year and month, the birth place, the ethnicity, the cultural degree, the occupation and the address of each advertiser;
the case paragraph is divided and extracted based on the sentence pattern rule, the sentence head sentence pattern of the case paragraph accords with a certain sentence pattern rule, and the case paragraph is divided for all the judgment documents by continuously iterating and perfecting the sentence pattern rule;
the hospital considers that the paragraphs are divided through semantic and sentence pattern rules and contain crime summary and judgment basis information of the reporter;
the sentence of the criminal interpretation result is the criminal interpretation result of each advertiser in the review.
Optionally, in S4, a sentence rule based method is adopted to extract features of the numerical form, and a correct numerical term is extracted through a regular expression and sentence semantics; numerical features are features that include numbers, such as: 31 years old, 1985;
constructing a complete word bank of the enumerated features, and screening feature values in the case through regular expressions and sentence semantics based on the complete word bank; enumerated characteristics are specific characteristic information, such as: occupation: doing agricultural affairs and free occupation;
and (3) extracting the entity item characteristics of the harmed people and the involved places by adopting an entity identification method, and selecting a BERT algorithm. The physical item characteristics of the victim and the involved location include the article of the case, the place of the case, and the participating people.
Optionally, the entity item features of the victim and the involved location are extracted by an entity identification method, and the BERT algorithm is selected as follows:
the first step is as follows: selecting a data set, labeling a corpus by using name daily notes in a part-of-speech labeling task, and dividing the corpus into a training set and a test set according to a ratio of 7: 3;
the second step is that: preprocessing data, namely preprocessing the data of a Chinese text, splitting the text into a series of Chinese characters, and labeling the part of speech of each Chinese character;
the label adopts a BIO mode, wherein B represents that the Chinese character is the starting character of a vocabulary and simultaneously represents a single word; "I" indicates that the Chinese character is the middle character of the vocabulary; "O" indicates that the Chinese character is not in the vocabulary; setting the maximum sequence length according to the requirement of a BERT model, and padding the sequence according to the parameter;
the third step: model training, namely configuring a storage path, a word list, pre-training model configuration information, a checkpoint, a maximum sequence length, num _ epochs and a parameter training model of a learning rate of the model, and ensuring that all part-of-speech labels appear in training data when data are segmented;
the fourth step: and (3) entity recognition and extraction, namely splitting a sentence to be predicted into a series of single characters and inputting the single characters into a trained model, outputting the predicted part of speech corresponding to each single character by the model, splicing the beginning of the 'B' followed by the 'I' Chinese character until the next 'B' labeled Chinese character is encountered, thereby separating word words with part of speech labeled one by one, and taking out the victim and the case-related location item.
Optionally, in S5, the relevant data obtained in the structured extraction process is converted into a standard expression form, a corresponding mapping mechanism is constructed, and the extracted key feature elements are unified into a standard data format.
The invention has the beneficial effects that: the information data concerned by the user in the case with complex judgment is analyzed and displayed in detail, the refinement degree and the accuracy of the analysis result are obviously improved, and the case analysis efficiency of court staff is effectively improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a flow chart of case file identification.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
In order to realize case identification of court cases, the key technology is used for accurately extracting key elements in case cases. The whole judgment document analysis process is shown in the attached figure 2, wherein the process input is a judgment document, the process output is an analysis result, the analysis process sequentially identifies the names of the crimes, a corpus and a rule base are constructed, paragraphs are divided, structured extraction is carried out, data formats are standardized, and the analysis result is displayed.
In the invention, aiming at different types of criminal cases, the case key elements determined by the case mainly comprise personal information (name, year and month of birth, place of birth, ethnicity, cultural degree, occupation and address) of an defendant, case key information (such as intentional injury crimes including the defendant, a crime tool, a crime means, a crime place and severity, theft crimes including stolen case-related articles and severity related to case amount, and the like), crime name and criminal term (criminal appraisal, criminal term, delay and penalty).
Fig. 1 is a system architecture diagram of the present invention, which is mainly related to the design and implementation of a file identification system, and the key technology of the system is the precise extraction of relevant feature elements.
FIG. 2 is a flow chart of case file identification. The input is case files (supporting two input modes of texts and files), the output is case element analysis results, and the middle part is an intermediate process and a method related to file identification.
A court case file identification method is used for carrying out criminal name analysis and feature element extraction based on a case full text, accurately extracting case elements, assisting court staff in analyzing cases and improving working efficiency. It includes:
the case and the criminal names are analyzed by a rule-based method and a similarity model method. The rule-based method comprises the steps of constructing a criminal name sentence pattern rule base, and extracting criminal name data matched with the rule base through a regular expression; for the decision document, the extraction success rate of the method is more than nine times. If the extraction in the mode is invalid and the crime data is not extracted from the judgment book, a mode based on a word2vec similarity model is adopted. The method firstly trains a corpus model of the same case and the name of a crime based on a large number of same case judgment documents, and then can analyze the name of the new document to be processed based on the trained model.
Constructing a corpus and a rule base, carrying out induction analysis on similar paragraphs and cases based on a large number of cases with similar criminal names, summarizing sentence pattern rules and a keyword base, and iteratively completing the corpus and the rule base through a large number of case data for extracting subsequent paragraphs and structured data.
Paragraph division is based on semantic and sentence pattern rules, and can divide the whole judgment book into an advertiser personal information paragraph, a case paragraph, a section considered by the hospital and a sentence interpretation result paragraph. The personal information paragraph of the advertiser is extracted based on the semantic meaning and comprises the name, the year and month of birth, the place of birth, the nationality, the cultural degree, the occupation, the address and other information of each advertiser. The case paragraph is divided and extracted based on the sentence pattern rule, the sentence head and the sentence pattern of the case paragraph accord with a certain sentence pattern rule, and the case paragraph can be divided for all the judgment documents by continuously iterating and perfecting the sentence pattern rule. The courtyard considers that the paragraphs are divided by semantic and sentence pattern rules and contain information on crime summary and judgment basis of the reporter. And finally, the document end data are the criminal judging results of all the defenders in the trial and follow certain semantic and sentence pattern rules.
The structured extraction adopts a rule-based method and an entity identification-based method to extract key feature elements in the case. The information that needs to be extracted differs for different names of guilties. For example, the characteristics of the intentional injury crime which needs to be extracted are a crime tool, a crime means, the injury severity of the victim, an adding criminal item and a deductive item, and the illegal operation crime needs to be extracted are an involved article, an involved money amount, an adding criminal item and a deductive item. The method comprises the steps of extracting the characteristics of the numerical form in a sentence rule-based mode, and extracting correct numerical terms through regular expressions and sentence semantics. And constructing a complete word bank for the enumerated features, and screening feature values in the case through regular expressions and sentence semantics based on the complete word bank. For the characteristics of entity items such as the victim, the involved case places and the like, an entity identification method is adopted for extraction, a BERT algorithm is selected, and the method comprises the following specific steps:
the first step is as follows: selecting a data set, and in a part-of-speech tagging task, adopting a name daily tagging corpus and dividing the name daily tagging corpus into a training set and a test set according to a ratio of 7: 3; the second step is that: and data preprocessing, namely preprocessing the data of the Chinese text, splitting the text into a series of Chinese characters, and labeling the part of speech of each Chinese character. The label adopts a BIO mode, wherein B represents that the Chinese character is the starting character of a vocabulary and simultaneously represents a single word; "I" indicates that the Chinese character is the middle character of the vocabulary; "O" indicates that the Chinese character is not in the vocabulary. According to the requirement of the BERT model, the maximum sequence length is set, and the sequence is padded according to the parameter. The third step: and model training, namely configuring parameter training models such as a storage path, a word list, pre-training model configuration information, checkpoint, a maximum sequence length, num _ epochs, a learning rate and the like of the model, and ensuring that all part-of-speech labels appear in training data when data are segmented. The fourth step: and (3) entity recognition and extraction, namely as in the previous 3-step training model, splitting a sentence to be predicted into a series of single characters and inputting the single characters into the trained model, outputting the predicted part of speech corresponding to each single character by the model, splicing the beginning of the B followed by the Chinese character of the I until the next B labeled Chinese character is encountered, thereby separating word words and expressions with part of speech labeled one by one and taking out the victim and the case-related place items.
The data format specification adopts a mapping mechanism, related data obtained in the structured extraction process is converted into a standard expression form, and because different people in different regions have great difference in expression modes and great difference exists in expression of the same information, a corresponding mapping mechanism needs to be constructed, and the extracted key characteristic elements are unified into a standard data format. Taking the judging date as an example, the judging date has various writing methods such as "two good and two good components three months and sixteen days", "two zero year three months and sixteen days", and needs to be processed into a standard format of "2020-03-16" through a mapping mechanism.
After a series of analysis work is completed, the analysis result shows that extracted key element data are output on a page through an attractive and visual user interface for court staff to visually understand detailed elements in cases, the user interface supports two input modes of texts and files, for the input case data, information concerned by all users is extracted through the processes of building a corpus, a rule base, paragraph division and structural extraction, the extraction result is processed into a standard data format through a certain mapping mechanism, and the processed result is displayed on the user page for relevant court staff to analyze cases for reference, so that the case analysis work efficiency is greatly improved.
Example (b):
hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. As shown in fig. 2, an embodiment of the present invention provides a method for identifying a file of a case scenario by taking an intentional injury crime as an example. The method comprises the following steps: specifically, the judgment document is input into the whole system, and when a corpus and a rule base of corresponding crime names are constructed, a user needs to download a large number of judgment document files of the same case from a judgment document network and classify the files according to the corresponding crime names;
furthermore, the case and the criminal names are identified and analyzed by a rule-based method and a similarity model method. When a user inputs a new case, the method firstly constructs a criminal name rule base, and criminal name data matched with the rule base is extracted through a regular expression to serve as a criminal name extraction result. If the crime data is not extracted from the judgment book, training a corpus model of the intentional injury crime by adopting a word2vec similarity model-based mode and on a large number of intentional injury crime documents, and judging whether the crime is the intentional injury crime or not on the basis of the trained model.
Furthermore, a corpus and a rule base of the intentional injury crime are constructed, a general keyword base and sentence pattern rules are induced based on a large amount of intentional injury crime document data, and the corpus and the rule base are continuously optimized and perfected through a continuous iteration process.
Further, paragraph division is based on the semantics and sentence patterns of each paragraph of the judgment book, and the whole judgment book is divided into an advertiser personal information paragraph, a case scenario paragraph, a hospital deeming paragraph and a criminal interpretation result paragraph. Each paragraph contains specific key elements, and more specific element items need to be refined from the paragraphs through further structured extraction.
Further, a rule-based method and an entity identification-based method for structured extraction are used for extracting key feature elements in the case. The information that needs to be extracted differs for different names of guilties. The characteristics which need to be extracted for intentionally injuring a crime are a crime tool, a crime means, the injury severity of a victim, a criminal item and a criminal reduction item. The method comprises the steps of extracting the characteristics of the numerical form in a sentence rule-based mode, and extracting correct numerical terms through regular expressions and sentence semantics. And constructing a complete word bank for the enumerated features, and screening feature values in the case through regular expressions and sentence semantics based on the complete word bank.
Furthermore, the data format is standard, and some numerical data and date data in the extraction result are processed into a standard data format through a mapping mechanism for comparison and analysis by a user.
Further, the analysis result is displayed. The extracted key element data are output on a page through an attractive and visual user interface, so that court staff can visually understand detailed elements in cases, the user interface supports two input modes of texts and files, for the input case data, all information concerned by users is extracted through the processes of constructing a corpus, a rule base, paragraph division and structured extraction, the extraction result is processed into a standard data format through a certain mapping mechanism, the processed result is displayed on the user page and is used for the related court staff to analyze cases for reference, and the case analysis work efficiency is greatly improved.
A data flow after input of a decision document:
hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. As shown in fig. 2, the data flow in the decision-making portfolio identification method is described by taking the intentional injury crime as an example. The method comprises the following steps: specifically, the judgment document is input into the whole system, and when a user submits a judgment document in doc, docx and txt formats, the background receives the document and reads all the text contents in the document;
furthermore, the judgment document is identified as the document of the intentional injury crime according to the description about the judgment in the document.
Furthermore, the whole text is separated according to the paragraphs, and the personal information of the defendant, the information of the victim, the type of the judgment book, the crime place, the crime tool, the judgment time and other information are positioned in each paragraph in turn. And extracting key characteristic elements in the case by a rule-based method and an entity identification-based method.
And finally, the extracted information is displayed to the user after being arranged and combined.

Claims (1)

1. A court case file identification method is characterized by comprising the following steps: the method is used for carrying out criminal analysis and characteristic element extraction based on a case full text, extracting case elements and assisting court staff in analyzing cases, and specifically comprises the following steps:
s1: analyzing case and criminal names by adopting a rule-based method and a similarity model method;
s2: constructing a corpus and a rule base;
s3: paragraph division is carried out based on semantic and sentence pattern rules;
s4: extracting key characteristic elements in the case by adopting a rule-based method and an entity identification-based method;
s5: standardizing a data format;
s6: displaying the analysis result;
in the step S1, a rule-based method is to construct a sentence pattern rule base of the criminals, and extract the criminals data matched with the rule base through a regular expression;
if the extraction is invalid and the crime data are not extracted from the judgment book, adopting a method based on a word vector model word2vec and a similarity model;
the method comprises the steps of training a corpus model of the same case and criminal names based on a large number of same case judgment documents, and then carrying out criminal name analysis on a new document to be processed based on the trained model;
in the step S2, similar paragraphs and cases are induced and analyzed based on cases with similar names, the format of a decision book has a certain relation with the location and time of a court, a sentence rule and a keyword library are summarized by inducing and summarizing the certain decision book, different regular expressions and word libraries are designated according to different names, wherein, crimes are intentionally injured to construct a case article word library, and drug sells are sold to construct a drug word library; iteratively completing a corpus and a rule base through case data for extracting subsequent paragraphs and structured data; regular expressions are used to retrieve and replace text that conforms to a certain pattern and rule;
in the step S3, the whole judgment book is divided into an advertised personal information section, a case situation section, a hospital deeming section and a criminal interpretation result section;
wherein, the personal information paragraph of the advertiser is extracted based on the semanteme, and comprises the name, the birth year and month, the birth place, the ethnicity, the cultural degree, the occupation and the address of each advertiser;
the case paragraph is divided and extracted based on the sentence pattern rule, the sentence head and sentence pattern of the case paragraph accord with the sentence pattern rule, and the case paragraph is divided for all the judgment documents by continuously iterating and perfecting the sentence pattern rule;
the sentence pattern rule is: the paragraph begins to be ' the hospital considers ' as the add/subtract criminal paragraph, the paragraph begins to contain ' criminal ' and the paragraph judged ' is the appraisal paragraph;
the hospital considers that the paragraphs are divided through semantic and sentence pattern rules and contain crime summary and judgment basis information of the reporter;
the criminal interpretation result paragraph is the criminal interpretation result of each advertiser in the review;
in S4, extracting key feature elements of the numerical form in a sentence rule-based manner, and extracting correct numerical terms through regular expressions and sentence semantics;
constructing a complete word stock of the enumerated key characteristic elements, and screening characteristic values in the case through regular expressions and sentence semantics based on the complete word stock;
extracting key characteristic elements of entity items of the victim and the involved case places by adopting an entity identification method, and selecting a text preprocessing model BERT;
the entity item characteristics of the harmed people and the involved places are extracted by adopting an entity identification method, and the text preprocessing model BERT is selected as follows:
the first step is as follows: selecting a data set, labeling a corpus by using name daily notes in a part-of-speech labeling task, and dividing the corpus into a training set and a test set according to a ratio of 7: 3;
the second step is that: preprocessing data, namely preprocessing the data of a Chinese text, splitting the text into a series of Chinese characters, and labeling the part of speech of each Chinese character;
the label adopts a BIO mode, wherein B represents that the Chinese character is the beginning character of a vocabulary; "I" indicates that the Chinese character is the middle character of the vocabulary; "O" indicates that the Chinese character is not in the vocabulary; setting the maximum sequence length according to the requirement of a BERT model, and setting the data length padding for the sequence according to the parameter;
the third step: model training, namely configuring a storage path, a word list, pre-training model configuration information, a parameter training model with the maximum sequence length, a training batch num _ epochs and a learning rate of the model, and ensuring that all part-of-speech labels appear in training data during data segmentation;
the fourth step: the method comprises the steps of entity recognition and extraction, namely splitting a sentence to be predicted into a series of single characters, inputting the single characters into a trained model, outputting a predicted part of speech corresponding to each single character by the model, splicing the beginning of B followed by I until the next B-labeled Chinese character is encountered, separating word words with part of speech labeled, and taking out the victim and the related case location item;
wherein, a series of single characters are split according to a word.
CN202110543832.8A 2021-05-19 2021-05-19 Court case file identification method Active CN113239681B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110543832.8A CN113239681B (en) 2021-05-19 2021-05-19 Court case file identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110543832.8A CN113239681B (en) 2021-05-19 2021-05-19 Court case file identification method

Publications (2)

Publication Number Publication Date
CN113239681A CN113239681A (en) 2021-08-10
CN113239681B true CN113239681B (en) 2021-10-12

Family

ID=77137468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110543832.8A Active CN113239681B (en) 2021-05-19 2021-05-19 Court case file identification method

Country Status (1)

Country Link
CN (1) CN113239681B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761928A (en) * 2021-09-09 2021-12-07 深圳市大数据研究院 Method for obtaining location of legal document case based on word frequency scoring algorithm
CN116127977B (en) * 2023-02-08 2023-10-03 中国司法大数据研究院有限公司 Casualties extraction method for referee document
CN117852522B (en) * 2024-03-08 2024-06-04 中国科学院空间应用工程与技术中心 Chinese standard multidimensional similarity calculation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160147737A1 (en) * 2014-11-20 2016-05-26 Electronics And Telecommunications Research Institute Question answering system and method for structured knowledgebase using deep natual language question analysis
CN111145052A (en) * 2019-12-26 2020-05-12 北京法意科技有限公司 Structured analysis method and system of judicial documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160147737A1 (en) * 2014-11-20 2016-05-26 Electronics And Telecommunications Research Institute Question answering system and method for structured knowledgebase using deep natual language question analysis
CN111145052A (en) * 2019-12-26 2020-05-12 北京法意科技有限公司 Structured analysis method and system of judicial documents

Also Published As

Publication number Publication date
CN113239681A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN113239681B (en) Court case file identification method
US20240028651A1 (en) System and method for processing documents
US6505150B2 (en) Article and method of automatically filtering information retrieval results using test genre
Rabelo et al. COLIEE 2020: methods for legal document retrieval and entailment
Benajiba et al. Arabic named entity recognition: A feature-driven study
US20170300565A1 (en) System and method for entity extraction from semi-structured text documents
Kestemont et al. Weigh your words—memory-based lemmatization for Middle Dutch
JP2009521029A (en) Method and system for automatically generating multilingual electronic content from unstructured data
JP4347226B2 (en) Information extraction program, recording medium thereof, information extraction apparatus, and information extraction rule creation method
US7197697B1 (en) Apparatus for retrieving information using reference reason of document
Faruque et al. Ascertaining polarity of public opinions on Bangladesh cricket using machine learning techniques
Moradi et al. A hybrid approach for Persian named entity recognition
Kanan et al. Improving arabic text classification using p-stemmer
Sandhiya et al. A review of topic modeling and its application
Ahmed et al. Bangla text emotion classification using LR, MNB and MLP with TF-IDF & CountVectorizer
Iwatsuki et al. Using formulaic expressions in writing assistance systems
Yurtsever et al. Figure search by text in large scale digital document collections
US6973423B1 (en) Article and method of automatically determining text genre using surface features of untagged texts
CN113033176B (en) Court case judgment prediction method
Sarwar et al. AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model
Gutehrlé et al. Logical Layout Analysis Applied to Historical Newspapers
Zarifi et al. Gender identification of short text author using conceptual vectorization
Taniguchi Duplicate bibliographic record detection with an OCR-converted source of information
Shamma et al. Information extraction from arabic law documents
Pirovani et al. Indexing names of persons in a large dataset of a newspaper

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant