CN113918706A - Information extraction method for administrative punishment decision book - Google Patents

Information extraction method for administrative punishment decision book Download PDF

Info

Publication number
CN113918706A
CN113918706A CN202111201811.4A CN202111201811A CN113918706A CN 113918706 A CN113918706 A CN 113918706A CN 202111201811 A CN202111201811 A CN 202111201811A CN 113918706 A CN113918706 A CN 113918706A
Authority
CN
China
Prior art keywords
text
text information
information
vector
administrative penalty
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111201811.4A
Other languages
Chinese (zh)
Other versions
CN113918706B (en
Inventor
李玉军
赵思文
贲晛烨
胡伟凤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202111201811.4A priority Critical patent/CN113918706B/en
Publication of CN113918706A publication Critical patent/CN113918706A/en
Application granted granted Critical
Publication of CN113918706B publication Critical patent/CN113918706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an information extraction method of an administrative penalty decision, which comprises the following steps: the method comprises the following steps: crawling and obtaining an administrative penalty decision book of each province from an administrative penalty document network; step two: extracting the text content of the administrative penalty decision obtained in the first step in the html label, and constructing an original data set; step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set; step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result. The invention provides a method for extracting information of an administrative penalty decision book, which can accurately obtain the structural information of the decision book and is convenient for understanding the administrative penalty decision book and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like.

Description

Information extraction method for administrative punishment decision book
Technical Field
The invention relates to the fields of natural language processing and legal artificial intelligence, in particular to an information extraction method of an administrative penalty decision.
Background
The administrative penalty determinants serve as important carriers of the administrative penalty legal practices, and the huge number and the complex text content of the administrative penalty determinants increase the workload and difficulty for the practitioners. The information extraction of the administrative penalty decision book can help practitioners to quickly acquire required text information, provides a foundation for downstream tasks such as class case retrieval, class case recommendation, judgment prediction and the like, and improves the working quality and efficiency of the administrative penalty judgment.
The traditional information extraction work is manual input or information extraction is carried out according to an extraction rule summarized manually, the formulated rule cannot be transplanted, the application range is small, a large amount of manpower is consumed, the maintenance cost is high, and the accuracy is low. With the continuous deep research of statistical learning, the classical models such as hidden markov model, maximum entropy markov model and conditional random field are applied to information extraction in the legal field, and although the portability is improved and the processing speed is accelerated, the accuracy needs to be improved.
In recent years, natural language processing has been widely used in the judicial field, and the artificial intelligence field of law is receiving much attention. The artificial intelligence technology can greatly improve the efficiency and accuracy of information extraction, and brings convenience to practitioners. However, the simple deep learning or machine learning based method is influenced by the text length, the context information, and the like, and the effect is to be improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an information extraction method of an administrative penalty decision book.
The invention aims to solve the problems of low efficiency, low accuracy and the like of information extraction of an administrative penalty decision book in the existing judicial field, and provides an information extraction method of the administrative penalty decision book, which is used for segmenting and extracting long texts of documents and realizing text feature extraction by using an information extraction module.
Interpretation of terms:
1. a Greedy Fast Causal Inference algorithm, Greeny fruit cause Inference, GFCI, is a hybrid algorithm based on a constraint algorithm and a scoring algorithm, is used for reasoning and mining Causal relationships, and has high accuracy.
2. The Average Treatment Effect, ATE, is a method for measuring the strength of the Effect caused by factors from the overall perspective, and is used for estimation of the causal Effect and measurement of the correlation strength of the causal information.
The technical scheme of the invention is as follows:
an information extraction method for an administrative penalty decision, comprising:
the method comprises the following steps: crawling to obtain an administrative penalty decision;
step two: extracting the text content of the administrative penalty decision acquired in the first step, and constructing an original data set;
step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set;
step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result.
Preferably, in the second step, the text content of the decision for the administrative penalty includes a decision text number, a party, a subject qualification license name, a unified social credit code, a residence, a legal responsible person, an identity document number, a case source and survey pass, a case fact, an evidence certificate, a law violation nature, a penalty basis, a fact and reason of free adjudication, a fulfillment mode and a term of the administrative penalty, and a relief way and a term.
Preferably, in the third step, a regular expression is used for matching a series of character strings meeting syntactic rules, structured information in the administrative penalty decision and subject contents described by the unstructured cases are extracted, wherein the structured information comprises a decision document number, a party, a subject qualification certificate name, a unified social credit code, a residence, a legal responsible person and an identity card number, and the subject contents described by the unstructured cases comprise case sources and survey processes, case facts, evidence proofs, illegal behavior nature qualifications, penalty bases, facts and reasons of liberty amount, fulfillment modes and terms of administrative penalties, relief ways and terms.
According to the optimization of the invention, the information extraction module comprises a pre-training language module, a context information acquisition module, a weight distribution module and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence;
the context information acquisition module comprises a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and scoring, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module performs conditional probability calculation on the text information sequence and the output sequence, performs inspection correction on the result of text information sequence extraction, inputs the result into a Viterbi decoder to decode the text information sequence, outputs the text information and obtains the result of information extraction.
Preferably, the fourth step of the present invention comprises the following steps:
inputting the data set constructed in the third step into the pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative punishment determinants, wherein the short text information sequence comprises determinants, parties, main qualification license names, unified social credit codes, residences, legal accountants and identity card numbers; obtaining a long text information sequence of case sources, investigation passes, case facts, evidence proofings, illegal behavior nature qualifications, penalty bases, facts and reasons of free cutting amount, fulfillment modes and terms of administrative penalties, relief ways and terms through a combination of a sliding window self-attention mechanism and a global attention mechanism, and constructing a word-level text vector matrix X ═ X ═ after text contents in a data set pass through a pre-training language module1,x2,…,xNAs output, xiRepresenting the feature vector extracted by the administrative penalty decision, i belongs to N;
and (3) setting a word-level text vector matrix X as { X ═ X1,x2,…,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,…,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
the weight distribution module is based onText causal information is future text information vector Y ═ Y1,y2,…,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c12,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability12,…,ωNAnd output vector beta output by the feature classification module is ═ beta12,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction.
Preferably, the sliding window self-attention mechanism reduces or divides the context information sequence into smaller sequences, adopts a window with a fixed size to slide to obtain text information sequences at different positions, combines a plurality of window self-attention mechanisms obtained by sliding to generate a large receptive field, constructs a local context information sequence, sets the window size to be omega x omega, and assumes that the length of the text information sequence is n and omega is less than n.
Preferably, the weight distribution module implements context information causal relationship analysis by a greedy fast causal inference algorithm, measures causal strength by an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (i):
Figure BDA0003305236720000031
in formula (I), E represents expectation, do operator represents processing on Y,
Figure BDA0003305236720000032
in order to be a strong cause-and-effect relationship,
Figure BDA0003305236720000033
is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)12,…,ωN}。
A computer device comprising a memory storing a computer program and a processor implementing the steps of an information extraction method for an administrative penalty decision when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of an information extraction method for an administrative penalty decision.
The invention has the beneficial effects that:
1. the invention provides a method for extracting information of an administrative penalty decision book, which can accurately obtain the structural information of the decision book and is convenient for understanding the administrative penalty decision book and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like.
2. The information extraction module of the invention adopts a mode of combining a sliding window attention mechanism and a global attention mechanism to carry out text preprocessing, solves the long-distance dependency of a text sequence, fully considers the context information of a document through a bidirectional neural network, prevents information loss by calculating causal intensity weight distribution, effectively solves the similar text dependency before and after output, and improves the accuracy and efficiency of information extraction.
Drawings
FIG. 1 is a schematic view of an information extraction process for an administrative penalty decision according to the present invention;
FIG. 2 is a flow chart of an information extraction module according to the present invention.
FIG. 3(a) is a schematic diagram of the sliding window self-attention mechanism of the present invention.
FIG. 3(b) is a schematic diagram of the global attention mechanism of the present invention.
FIG. 4 is a diagram of a context information obtaining module according to the present invention.
Detailed Description
In order to facilitate understanding of the invention, the invention will be described in further detail below with reference to the drawings and examples. It should be understood that the described embodiments are only for explaining the invention, and are not used for limiting the invention.
Example 1:
an information extraction method for an administrative penalty decision, as shown in fig. 1, includes:
the method comprises the following steps: crawling and obtaining an administrative penalty decision book of each province from an administrative penalty document network; for later construction of the data set.
Step two: and extracting the text content of the administrative penalty decision acquired in the step one in the html label, constructing an original data set, and acquiring the csv file.
Step three: and according to the normative rule written by the administrative penalty determinants, carrying out data preprocessing on the administrative penalty determinants to be processed by using a regular expression, constructing a data set, and obtaining the csv file.
Step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result. And extracting information of the long texts which are checked and considered by the bureau, and outputting a result.
Example 2:
an information extraction method of an administrative penalty decision according to embodiment 1, wherein the method is different from the following steps:
in the second step, the < div > tag and the < p > tag of html are removed by using a strip () function in python, and the text content of the administrative penalty decision is obtained, wherein the text content of the administrative penalty decision comprises a decision text number, a party, a subject qualification license name, a unified social credit code, a residence (address), a legal responsible person (a principal and an operator), an identity document number, a case source and survey pass, a case fact, evidence proof (an administrative penalty informing situation, a party statement, a declaration, a listening evidence, a review and an adoption situation and reason), illegal behavior nature qualification, a penalty basis, a free cutting amount fact and reason, an administrative penalty fulfillment mode and term, a relief way and term.
In step three, a series of character strings meeting the syntactic rules are matched by using a regular expression, and structured information in the administrative penalty decision and subject contents of unstructured case descriptions are extracted, for example: construction of syntactic rules as shown by "via (| query) (| find) \ w (| \ S) (| -) S \ w \ S + - (| \ S \ w \ S \ w \ S + -) the case facts expressed as" queried "or" found "within the document are extracted as" parties: \ w +? (. |,) | "extracts the names of parties in the document, etc.; the structured information comprises a decision text number, a party, a subject qualification license name, a unified social credit code, a residence (address), a legal liability person (responsible person, operator), an identity document number, subject matter described by the unstructured case comprises case source and survey processes, case facts, evidence proofs (administrative punishment informing condition, party statement, explanation, hearing evidence opinion, rechecking and adopting condition and reason), illegal behavior nature, punishment basis, free-cutting fact and reason, execution mode and term of administrative punishment, relief way and term.
As shown in fig. 2, the information extraction module includes a pre-training language module, a context information acquisition module, a weight distribution module, and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence; as shown in fig. 3(a) and 3(b), for a short text information sequence, which contains less text information, a sliding window self-attention mechanism may be adopted to obtain the content of the short text information sequence on the basis of saving the memory space and the calculation time; for a long text information sequence, the text information is more, and important text information can be cut off and lost only by adopting a sliding window self-attention mechanism, so that the completeness of the text information sequence is ensured, and the memory space loss and the calculation time loss are reduced by adopting a mode of combining the sliding window self-attention mechanism and a global attention mechanism.
As shown in fig. 4, the context information obtaining module includes a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and grading, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module carries out conditional probability calculation on the text information sequence and the output sequence, carries out inspection correction on the result extracted from the text information sequence, inputs the result into a Viterbi decoder to carry out text information sequence decoding, outputs text information and obtains the result of information extraction.
Example 3:
an information extraction method of an administrative penalty decision according to embodiment 2 is different from the following:
the fourth step comprises the following steps:
inputting the data set constructed in the third step into a pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative penalty determinants, wherein the short text information sequence comprises a determinant document number, a party, a subject qualification certificate name and an unificationA social credit code, residence (address), legal responsible person (principal, operator), identity document number; obtaining case source and investigation history, case facts, evidence (administrative punishment informing condition, principal statement, explanation, hearing evidence opinion, rechecking and adoption condition and reason), illegal behavior nature qualification, punishment basis, free cutting amount fact and reason, executive punishment fulfillment mode and term, relief way and term long text information sequence through combination of sliding window self-attention mechanism and global attention mechanism, constructing word-level text vector matrix X { X ═ after text content in data set is subjected to pre-training language module1,x2,…,xNAs output, xiRepresenting the feature vector extracted by the administrative penalty decision, i belongs to N;
the sliding window self-attention mechanism can obtain a local context information sequence, and the global attention mechanism can obtain a complete sequence, so that the integrity of the text sequence is ensured.
And (3) setting a word-level text vector matrix X as { X ═ X1,x2,…,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,…,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
context information and causal relationship are fully extracted through the bidirectional neural network, and ambiguity and information loss are prevented. Compared with a Bi-LSTM neural network of a common method, a future text information vector and a historical text vector acquired by the context information module are not automatically spliced after being processed to serve as output vectors, but the text relation needs to be analyzed, weight distribution and weight calculation are carried out again, and the accuracy rate is increased for text information extraction.
The administrative penalty determination includes case sources, survey passes, case facts, evidence proofs, penalty qualitative properties, penalty evidences, administrative penalties and other contextual information with causal relationships. The bidirectional neural network can fully utilize text information of the context, retain complete causal relationship among the information, and facilitate the next module to judge the strength of the causal relationship. For example, the module acquires the behavior of the disposable protective mask in the document that the inspection result of the item "when the person is involved in producing the" particle filtration efficiency (salt medium) "is not qualified, the" rule of the thirty th article of the "quality law of the products of the people's republic of china" is violated, and the "behavior of the person who belongs to producing the products by pretending the unqualified products as the qualified products constitutes the violation" as the context information and the weight is assigned by the next module.
The weight distribution module is used for distributing a future text information vector Y to { Y ═ Y according to the text cause and effect information1,y2,…,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c12,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability12,…,ωNAnd output vector beta output by the feature classification module is ═ beta12,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction. The output information extraction result is the principal, case facts, illegal behaviors, punishment bases and the like, and is used for downstream tasks.
The feature classification module fully considers the dependency between context texts and the property difference between similar texts, and can improve the accuracy of feature classification.
Specifically, for example, in the administrative penalty decision, the "behavior of the principal using the elevator with failed inspection violates the fourth ten third articles of safety law of special equipments in the people's republic of china" is illegal, "and the" fourth ten third articles of safety law of special equipments in the people's republic of china "is violated and the special equipments with failed inspection are not regularly inspected or inspected, and thus cannot be used continuously. The regulation of' is qualitative in nature of law violation. The feature classification module processes and judges the similar texts again, so that the accuracy of information extraction can be improved. The output information extraction result is the principal, case facts, illegal behaviors, punishment bases and the like, and is used for downstream tasks.
The sliding window self-attention mechanism reduces or divides a context information sequence into smaller sequences, adopts a window with a fixed size to slide to obtain text information sequences at different positions, combines a plurality of window self-attention mechanisms obtained by sliding to generate a large receptive field, constructs a local context information sequence, sets the window size to be omega x omega, assumes that the length of the text information sequence is n, and omega is less than n.
The computational complexity is used for measuring the degree of the memory space and time loss along with the change of data quantity, compared with the conventional method, namely a transformer, a text information sequence with the length of n, and the computational complexity of the transformer is O (n)2) The computational complexity of the single window of the sliding window self-attention mechanism is O (omega)2) After n/omega window sliding, the total computational complexity of the sliding window self-attention mechanism is O (n x omega) which is far less than O (n)2) The method saves the running time and the memory space required by calculation, and improves the information acquisition efficiency.
Compared with the traditional self-attention mechanism, the required memory and calculation power are saved, but for long texts, important information of the long texts can be lost due to the reduction and division of the texts by the sliding window self-attention mechanism, so that a complete sequence is established by the global attention mechanism for prediction in a mode of combining the sliding window self-attention mechanism and the global attention mechanism, the important information is prevented from being lost, and the accuracy and the efficiency are increased for text information extraction.
The weight distribution module utilizes a greedy rapid causal reasoning algorithm to realize context information causal relationship analysis, measures causal strength by using an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (I):
Figure BDA0003305236720000081
in formula (I), E represents expectation, do operator represents processing on Y,
Figure BDA0003305236720000082
in order to be a strong cause-and-effect relationship,
Figure BDA0003305236720000083
is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)12,…,ωN}. Increasing the accuracy rate for extracting text information.
Specifically, for example, in the administrative penalty decision, the "act of producing a product by pretending a defective product to be a qualified product" violates the thirty-second regulation of the "national institutes of health" but has no causal relationship with the twenty-seventh first and fourth regulations of the "national institutes of health" so that when an output vector is constructed from a future text information vector and a history text information vector, different weights are assigned according to the relationship text relationship strength.
Compared with the traditional method, the method can accurately obtain the structured information of the determinants, is convenient for understanding the administrative penalty determinants and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like, and the provided information extraction module can improve the accuracy and efficiency of information extraction.
Example 4:
a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method for extracting information for an administrative penalty decision according to any one of embodiments 1 to 3 when executing the computer program.
Example 5:
a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for extracting information for an administrative penalty decision described in any one of embodiments 1-3.

Claims (9)

1. An information extraction method for an administrative penalty decision, comprising:
the method comprises the following steps: crawling to obtain an administrative penalty decision;
step two: extracting the text content of the administrative penalty decision acquired in the first step, and constructing an original data set;
step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set;
step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result.
2. The information extraction method of the administrative penalty decision according to claim 1, wherein the information extraction module comprises a pre-training language module, a context information acquisition module, a weight distribution module and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence;
the context information acquisition module comprises a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and scoring, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module performs conditional probability calculation on the text information sequence and the output sequence, performs inspection correction on the result of text information sequence extraction, inputs the result into a Viterbi decoder to decode the text information sequence, outputs the text information and obtains the result of information extraction.
3. The method for extracting information of administrative penalty decisions according to claim 2, wherein the fourth step comprises the steps of:
inputting the data set constructed in the third step into the pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative punishment determinants, wherein the short text information sequence comprises determinants, parties, main qualification license names, unified social credit codes, residences, legal accountants and identity card numbers; obtaining a long text information sequence of case sources, investigation passes, case facts, evidence proofings, illegal behavior nature qualifications, penalty bases, facts and reasons of free cutting amount, fulfillment modes and terms of administrative penalties, relief ways and terms through a combination of a sliding window self-attention mechanism and a global attention mechanism, and constructing a word-level text vector matrix X ═ X ═ after text contents in a data set pass through a pre-training language module1,x2,...,xNAs output, xiIndicating administrative penalty decisionsExtracting feature vectors of the book, wherein i belongs to N;
and (3) setting a word-level text vector matrix X as { X ═ X1,x2,...,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,...,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
the weight distribution module is used for distributing a future text information vector Y to { Y ═ Y according to the text cause and effect information1,y2,...,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c1,ω2,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability1,ω2,...,ωNAnd output vector beta output by the feature classification module is ═ beta1,β2,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction.
4. The method of claim 2, wherein the sliding window auto-attention mechanism reduces or divides the context information sequence into smaller sequences, uses a fixed-size window to slide to obtain text information sequences at different positions, combines a plurality of sliding windows from the attention mechanism to generate a large receptive field, constructs a local context information sequence, sets the window size to ω × ω, assumes that the text information sequence length is n, and ω < n.
5. The information extraction method of the administrative penalty decision according to claim 2, wherein the weight assignment module implements context information causal relationship analysis by a greedy fast causal inference algorithm, measures causal strength by an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (I):
Figure FDA0003305236710000021
in formula (I), E represents expectation, do operator represents the processing of Y,
Figure FDA0003305236710000022
in order to be a strong cause-and-effect relationship,
Figure FDA0003305236710000023
is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)1,ω2,...,ωN}。
6. The method for extracting information of administrative penalty decisions according to claim 1, wherein in step two, the text content of the administrative penalty decisions comprises the text number of the decision, the party, the name of the subject qualification certificate, the unified social credit code, the residence, the legal liability person, the identity card number, the case source and survey pass, the case fact, the evidence certification, the nature of the illegal act, the basis of the penalty, the fact and reason of the discretionary amount, the fulfillment mode and the term of the administrative penalty, the relief way and the term.
7. The method for extracting information of administrative penalty decisions according to claim 1, wherein in step three, regular expressions are used to match a series of character strings meeting syntactic rules, and structured information and subject matter of unstructured case description in the administrative penalty decisions are extracted, the structured information includes a decision text number, a party, a subject qualification certificate name, a unified social credit code, a residence, a legal responsibility person and an identity document number, and the subject matter of the unstructured case description includes case source and survey processes, case facts, evidence certificates, violation nature qualifications, penalty bases, facts and reasons of liberty, fulfillment ways and terms of administrative penalties, relief ways and terms.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of a method of information extraction for an administrative penalty decision according to any one of claims 1 to 7.
9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the method for extracting information for an administrative penalty decision according to any of claims 1 to 7.
CN202111201811.4A 2021-10-15 2021-10-15 Information extraction method of administrative punishment decision book Active CN113918706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111201811.4A CN113918706B (en) 2021-10-15 2021-10-15 Information extraction method of administrative punishment decision book

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111201811.4A CN113918706B (en) 2021-10-15 2021-10-15 Information extraction method of administrative punishment decision book

Publications (2)

Publication Number Publication Date
CN113918706A true CN113918706A (en) 2022-01-11
CN113918706B CN113918706B (en) 2024-05-28

Family

ID=79240643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111201811.4A Active CN113918706B (en) 2021-10-15 2021-10-15 Information extraction method of administrative punishment decision book

Country Status (1)

Country Link
CN (1) CN113918706B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609487A (en) * 2024-01-19 2024-02-27 武汉百智诚远科技有限公司 Legal provision quick retrieval method and system based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN109086869A (en) * 2018-07-16 2018-12-25 北京理工大学 A kind of human action prediction technique based on attention mechanism
CN111461932A (en) * 2020-04-09 2020-07-28 北京北大软件工程股份有限公司 Administrative punishment discretion rationality assessment method and device based on big data
CN111708875A (en) * 2020-06-02 2020-09-25 北京北大软件工程股份有限公司 Administrative law enforcement class recommendation method based on punishment characteristics
CN113434664A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Text abstract generation method, device, medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180300400A1 (en) * 2017-04-14 2018-10-18 Salesforce.Com, Inc. Deep Reinforced Model for Abstractive Summarization
CN109086869A (en) * 2018-07-16 2018-12-25 北京理工大学 A kind of human action prediction technique based on attention mechanism
CN111461932A (en) * 2020-04-09 2020-07-28 北京北大软件工程股份有限公司 Administrative punishment discretion rationality assessment method and device based on big data
CN111708875A (en) * 2020-06-02 2020-09-25 北京北大软件工程股份有限公司 Administrative law enforcement class recommendation method based on punishment characteristics
CN113434664A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Text abstract generation method, device, medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
商齐;曾碧卿;王盛玉;周才东;曾锋;: "ACMF:基于卷积注意力模型的评分预测研究", 中文信息学报, no. 11, 15 November 2018 (2018-11-15) *
李玉军;汤晓君;刘君华;: "粒子群优化算法在混合气体红外光谱定量分析中的应用", 光谱学与光谱分析, no. 05, 15 May 2009 (2009-05-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609487A (en) * 2024-01-19 2024-02-27 武汉百智诚远科技有限公司 Legal provision quick retrieval method and system based on artificial intelligence
CN117609487B (en) * 2024-01-19 2024-04-09 武汉百智诚远科技有限公司 Legal provision quick retrieval method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN113918706B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
Desai et al. Techniques for sentiment analysis of Twitter data: A comprehensive survey
CN110990564B (en) Negative news identification method based on emotion calculation and multi-head attention mechanism
CN110427616B (en) Text emotion analysis method based on deep learning
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN109933792B (en) Viewpoint type problem reading and understanding method based on multilayer bidirectional LSTM and verification model
Chakraborty et al. Predicting stock movement using sentiment analysis of Twitter feed
CN109472462B (en) Project risk rating method and device based on multi-model stack fusion
CN109344399A (en) A kind of Text similarity computing method based on the two-way lstm neural network of stacking
CN110543547B (en) Automobile public praise semantic emotion analysis system
CN106294324A (en) A kind of machine learning sentiment analysis device based on natural language parsing tree
Li et al. A method for resume information extraction using bert-bilstm-crf
CN114662477B (en) Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue
Kuchlous et al. Short text intent classification for conversational agents
CN115329207A (en) Intelligent sales information recommendation method and system
CN113918706A (en) Information extraction method for administrative punishment decision book
Kundana Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes.
CN113220964A (en) Opinion mining method based on short text in network communication field
Liu et al. Sgat: A self-supervised graph attention network for biomedical relation extraction
CN117077682A (en) Document analysis method and system based on semantic recognition
Putri et al. Content-based filtering model for recommendation of Indonesian legal article study case of klinik hukumonline
CN114490925A (en) Emotion mining method and equipment under public event
Kanclerz et al. Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems
CN112905790A (en) Method, device and system for extracting qualitative indexes of supervision events
Wen et al. Blockchain-based reviewer selection
Akbar et al. The implementation of Naïve Bayes algorithm for classifying tweets containing hate speech with political motive

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant