CN113918706A - Information extraction method for administrative punishment decision book - Google Patents
Information extraction method for administrative punishment decision book Download PDFInfo
- Publication number
- CN113918706A CN113918706A CN202111201811.4A CN202111201811A CN113918706A CN 113918706 A CN113918706 A CN 113918706A CN 202111201811 A CN202111201811 A CN 202111201811A CN 113918706 A CN113918706 A CN 113918706A
- Authority
- CN
- China
- Prior art keywords
- text
- text information
- information
- vector
- administrative penalty
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 55
- 238000000034 method Methods 0.000 claims abstract description 25
- 230000014509 gene expression Effects 0.000 claims abstract description 7
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 230000009193 crawling Effects 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 66
- 230000001364 causal effect Effects 0.000 claims description 53
- 230000007246 mechanism Effects 0.000 claims description 50
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 238000004422 calculation algorithm Methods 0.000 claims description 22
- 238000012797 qualification Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 7
- 238000007689 inspection Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000005520 cutting process Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000011835 investigation Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000001681 protective effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an information extraction method of an administrative penalty decision, which comprises the following steps: the method comprises the following steps: crawling and obtaining an administrative penalty decision book of each province from an administrative penalty document network; step two: extracting the text content of the administrative penalty decision obtained in the first step in the html label, and constructing an original data set; step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set; step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result. The invention provides a method for extracting information of an administrative penalty decision book, which can accurately obtain the structural information of the decision book and is convenient for understanding the administrative penalty decision book and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like.
Description
Technical Field
The invention relates to the fields of natural language processing and legal artificial intelligence, in particular to an information extraction method of an administrative penalty decision.
Background
The administrative penalty determinants serve as important carriers of the administrative penalty legal practices, and the huge number and the complex text content of the administrative penalty determinants increase the workload and difficulty for the practitioners. The information extraction of the administrative penalty decision book can help practitioners to quickly acquire required text information, provides a foundation for downstream tasks such as class case retrieval, class case recommendation, judgment prediction and the like, and improves the working quality and efficiency of the administrative penalty judgment.
The traditional information extraction work is manual input or information extraction is carried out according to an extraction rule summarized manually, the formulated rule cannot be transplanted, the application range is small, a large amount of manpower is consumed, the maintenance cost is high, and the accuracy is low. With the continuous deep research of statistical learning, the classical models such as hidden markov model, maximum entropy markov model and conditional random field are applied to information extraction in the legal field, and although the portability is improved and the processing speed is accelerated, the accuracy needs to be improved.
In recent years, natural language processing has been widely used in the judicial field, and the artificial intelligence field of law is receiving much attention. The artificial intelligence technology can greatly improve the efficiency and accuracy of information extraction, and brings convenience to practitioners. However, the simple deep learning or machine learning based method is influenced by the text length, the context information, and the like, and the effect is to be improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an information extraction method of an administrative penalty decision book.
The invention aims to solve the problems of low efficiency, low accuracy and the like of information extraction of an administrative penalty decision book in the existing judicial field, and provides an information extraction method of the administrative penalty decision book, which is used for segmenting and extracting long texts of documents and realizing text feature extraction by using an information extraction module.
Interpretation of terms:
1. a Greedy Fast Causal Inference algorithm, Greeny fruit cause Inference, GFCI, is a hybrid algorithm based on a constraint algorithm and a scoring algorithm, is used for reasoning and mining Causal relationships, and has high accuracy.
2. The Average Treatment Effect, ATE, is a method for measuring the strength of the Effect caused by factors from the overall perspective, and is used for estimation of the causal Effect and measurement of the correlation strength of the causal information.
The technical scheme of the invention is as follows:
an information extraction method for an administrative penalty decision, comprising:
the method comprises the following steps: crawling to obtain an administrative penalty decision;
step two: extracting the text content of the administrative penalty decision acquired in the first step, and constructing an original data set;
step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set;
step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result.
Preferably, in the second step, the text content of the decision for the administrative penalty includes a decision text number, a party, a subject qualification license name, a unified social credit code, a residence, a legal responsible person, an identity document number, a case source and survey pass, a case fact, an evidence certificate, a law violation nature, a penalty basis, a fact and reason of free adjudication, a fulfillment mode and a term of the administrative penalty, and a relief way and a term.
Preferably, in the third step, a regular expression is used for matching a series of character strings meeting syntactic rules, structured information in the administrative penalty decision and subject contents described by the unstructured cases are extracted, wherein the structured information comprises a decision document number, a party, a subject qualification certificate name, a unified social credit code, a residence, a legal responsible person and an identity card number, and the subject contents described by the unstructured cases comprise case sources and survey processes, case facts, evidence proofs, illegal behavior nature qualifications, penalty bases, facts and reasons of liberty amount, fulfillment modes and terms of administrative penalties, relief ways and terms.
According to the optimization of the invention, the information extraction module comprises a pre-training language module, a context information acquisition module, a weight distribution module and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence;
the context information acquisition module comprises a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and scoring, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module performs conditional probability calculation on the text information sequence and the output sequence, performs inspection correction on the result of text information sequence extraction, inputs the result into a Viterbi decoder to decode the text information sequence, outputs the text information and obtains the result of information extraction.
Preferably, the fourth step of the present invention comprises the following steps:
inputting the data set constructed in the third step into the pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative punishment determinants, wherein the short text information sequence comprises determinants, parties, main qualification license names, unified social credit codes, residences, legal accountants and identity card numbers; obtaining a long text information sequence of case sources, investigation passes, case facts, evidence proofings, illegal behavior nature qualifications, penalty bases, facts and reasons of free cutting amount, fulfillment modes and terms of administrative penalties, relief ways and terms through a combination of a sliding window self-attention mechanism and a global attention mechanism, and constructing a word-level text vector matrix X ═ X ═ after text contents in a data set pass through a pre-training language module1,x2,…,xNAs output, xiRepresenting the feature vector extracted by the administrative penalty decision, i belongs to N;
and (3) setting a word-level text vector matrix X as { X ═ X1,x2,…,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,…,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
the weight distribution module is based onText causal information is future text information vector Y ═ Y1,y2,…,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c1,ω2,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability1,ω2,…,ωNAnd output vector beta output by the feature classification module is ═ beta1,β2,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction.
Preferably, the sliding window self-attention mechanism reduces or divides the context information sequence into smaller sequences, adopts a window with a fixed size to slide to obtain text information sequences at different positions, combines a plurality of window self-attention mechanisms obtained by sliding to generate a large receptive field, constructs a local context information sequence, sets the window size to be omega x omega, and assumes that the length of the text information sequence is n and omega is less than n.
Preferably, the weight distribution module implements context information causal relationship analysis by a greedy fast causal inference algorithm, measures causal strength by an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (i):
in formula (I), E represents expectation, do operator represents processing on Y,in order to be a strong cause-and-effect relationship,is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)1,ω2,…,ωN}。
A computer device comprising a memory storing a computer program and a processor implementing the steps of an information extraction method for an administrative penalty decision when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of an information extraction method for an administrative penalty decision.
The invention has the beneficial effects that:
1. the invention provides a method for extracting information of an administrative penalty decision book, which can accurately obtain the structural information of the decision book and is convenient for understanding the administrative penalty decision book and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like.
2. The information extraction module of the invention adopts a mode of combining a sliding window attention mechanism and a global attention mechanism to carry out text preprocessing, solves the long-distance dependency of a text sequence, fully considers the context information of a document through a bidirectional neural network, prevents information loss by calculating causal intensity weight distribution, effectively solves the similar text dependency before and after output, and improves the accuracy and efficiency of information extraction.
Drawings
FIG. 1 is a schematic view of an information extraction process for an administrative penalty decision according to the present invention;
FIG. 2 is a flow chart of an information extraction module according to the present invention.
FIG. 3(a) is a schematic diagram of the sliding window self-attention mechanism of the present invention.
FIG. 3(b) is a schematic diagram of the global attention mechanism of the present invention.
FIG. 4 is a diagram of a context information obtaining module according to the present invention.
Detailed Description
In order to facilitate understanding of the invention, the invention will be described in further detail below with reference to the drawings and examples. It should be understood that the described embodiments are only for explaining the invention, and are not used for limiting the invention.
Example 1:
an information extraction method for an administrative penalty decision, as shown in fig. 1, includes:
the method comprises the following steps: crawling and obtaining an administrative penalty decision book of each province from an administrative penalty document network; for later construction of the data set.
Step two: and extracting the text content of the administrative penalty decision acquired in the step one in the html label, constructing an original data set, and acquiring the csv file.
Step three: and according to the normative rule written by the administrative penalty determinants, carrying out data preprocessing on the administrative penalty determinants to be processed by using a regular expression, constructing a data set, and obtaining the csv file.
Step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result. And extracting information of the long texts which are checked and considered by the bureau, and outputting a result.
Example 2:
an information extraction method of an administrative penalty decision according to embodiment 1, wherein the method is different from the following steps:
in the second step, the < div > tag and the < p > tag of html are removed by using a strip () function in python, and the text content of the administrative penalty decision is obtained, wherein the text content of the administrative penalty decision comprises a decision text number, a party, a subject qualification license name, a unified social credit code, a residence (address), a legal responsible person (a principal and an operator), an identity document number, a case source and survey pass, a case fact, evidence proof (an administrative penalty informing situation, a party statement, a declaration, a listening evidence, a review and an adoption situation and reason), illegal behavior nature qualification, a penalty basis, a free cutting amount fact and reason, an administrative penalty fulfillment mode and term, a relief way and term.
In step three, a series of character strings meeting the syntactic rules are matched by using a regular expression, and structured information in the administrative penalty decision and subject contents of unstructured case descriptions are extracted, for example: construction of syntactic rules as shown by "via (| query) (| find) \ w (| \ S) (| -) S \ w \ S + - (| \ S \ w \ S \ w \ S + -) the case facts expressed as" queried "or" found "within the document are extracted as" parties: \ w +? (. |,) | "extracts the names of parties in the document, etc.; the structured information comprises a decision text number, a party, a subject qualification license name, a unified social credit code, a residence (address), a legal liability person (responsible person, operator), an identity document number, subject matter described by the unstructured case comprises case source and survey processes, case facts, evidence proofs (administrative punishment informing condition, party statement, explanation, hearing evidence opinion, rechecking and adopting condition and reason), illegal behavior nature, punishment basis, free-cutting fact and reason, execution mode and term of administrative punishment, relief way and term.
As shown in fig. 2, the information extraction module includes a pre-training language module, a context information acquisition module, a weight distribution module, and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence; as shown in fig. 3(a) and 3(b), for a short text information sequence, which contains less text information, a sliding window self-attention mechanism may be adopted to obtain the content of the short text information sequence on the basis of saving the memory space and the calculation time; for a long text information sequence, the text information is more, and important text information can be cut off and lost only by adopting a sliding window self-attention mechanism, so that the completeness of the text information sequence is ensured, and the memory space loss and the calculation time loss are reduced by adopting a mode of combining the sliding window self-attention mechanism and a global attention mechanism.
As shown in fig. 4, the context information obtaining module includes a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and grading, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module carries out conditional probability calculation on the text information sequence and the output sequence, carries out inspection correction on the result extracted from the text information sequence, inputs the result into a Viterbi decoder to carry out text information sequence decoding, outputs text information and obtains the result of information extraction.
Example 3:
an information extraction method of an administrative penalty decision according to embodiment 2 is different from the following:
the fourth step comprises the following steps:
inputting the data set constructed in the third step into a pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative penalty determinants, wherein the short text information sequence comprises a determinant document number, a party, a subject qualification certificate name and an unificationA social credit code, residence (address), legal responsible person (principal, operator), identity document number; obtaining case source and investigation history, case facts, evidence (administrative punishment informing condition, principal statement, explanation, hearing evidence opinion, rechecking and adoption condition and reason), illegal behavior nature qualification, punishment basis, free cutting amount fact and reason, executive punishment fulfillment mode and term, relief way and term long text information sequence through combination of sliding window self-attention mechanism and global attention mechanism, constructing word-level text vector matrix X { X ═ after text content in data set is subjected to pre-training language module1,x2,…,xNAs output, xiRepresenting the feature vector extracted by the administrative penalty decision, i belongs to N;
the sliding window self-attention mechanism can obtain a local context information sequence, and the global attention mechanism can obtain a complete sequence, so that the integrity of the text sequence is ensured.
And (3) setting a word-level text vector matrix X as { X ═ X1,x2,…,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,…,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
context information and causal relationship are fully extracted through the bidirectional neural network, and ambiguity and information loss are prevented. Compared with a Bi-LSTM neural network of a common method, a future text information vector and a historical text vector acquired by the context information module are not automatically spliced after being processed to serve as output vectors, but the text relation needs to be analyzed, weight distribution and weight calculation are carried out again, and the accuracy rate is increased for text information extraction.
The administrative penalty determination includes case sources, survey passes, case facts, evidence proofs, penalty qualitative properties, penalty evidences, administrative penalties and other contextual information with causal relationships. The bidirectional neural network can fully utilize text information of the context, retain complete causal relationship among the information, and facilitate the next module to judge the strength of the causal relationship. For example, the module acquires the behavior of the disposable protective mask in the document that the inspection result of the item "when the person is involved in producing the" particle filtration efficiency (salt medium) "is not qualified, the" rule of the thirty th article of the "quality law of the products of the people's republic of china" is violated, and the "behavior of the person who belongs to producing the products by pretending the unqualified products as the qualified products constitutes the violation" as the context information and the weight is assigned by the next module.
The weight distribution module is used for distributing a future text information vector Y to { Y ═ Y according to the text cause and effect information1,y2,…,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c1,ω2,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability1,ω2,…,ωNAnd output vector beta output by the feature classification module is ═ beta1,β2,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction. The output information extraction result is the principal, case facts, illegal behaviors, punishment bases and the like, and is used for downstream tasks.
The feature classification module fully considers the dependency between context texts and the property difference between similar texts, and can improve the accuracy of feature classification.
Specifically, for example, in the administrative penalty decision, the "behavior of the principal using the elevator with failed inspection violates the fourth ten third articles of safety law of special equipments in the people's republic of china" is illegal, "and the" fourth ten third articles of safety law of special equipments in the people's republic of china "is violated and the special equipments with failed inspection are not regularly inspected or inspected, and thus cannot be used continuously. The regulation of' is qualitative in nature of law violation. The feature classification module processes and judges the similar texts again, so that the accuracy of information extraction can be improved. The output information extraction result is the principal, case facts, illegal behaviors, punishment bases and the like, and is used for downstream tasks.
The sliding window self-attention mechanism reduces or divides a context information sequence into smaller sequences, adopts a window with a fixed size to slide to obtain text information sequences at different positions, combines a plurality of window self-attention mechanisms obtained by sliding to generate a large receptive field, constructs a local context information sequence, sets the window size to be omega x omega, assumes that the length of the text information sequence is n, and omega is less than n.
The computational complexity is used for measuring the degree of the memory space and time loss along with the change of data quantity, compared with the conventional method, namely a transformer, a text information sequence with the length of n, and the computational complexity of the transformer is O (n)2) The computational complexity of the single window of the sliding window self-attention mechanism is O (omega)2) After n/omega window sliding, the total computational complexity of the sliding window self-attention mechanism is O (n x omega) which is far less than O (n)2) The method saves the running time and the memory space required by calculation, and improves the information acquisition efficiency.
Compared with the traditional self-attention mechanism, the required memory and calculation power are saved, but for long texts, important information of the long texts can be lost due to the reduction and division of the texts by the sliding window self-attention mechanism, so that a complete sequence is established by the global attention mechanism for prediction in a mode of combining the sliding window self-attention mechanism and the global attention mechanism, the important information is prevented from being lost, and the accuracy and the efficiency are increased for text information extraction.
The weight distribution module utilizes a greedy rapid causal reasoning algorithm to realize context information causal relationship analysis, measures causal strength by using an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (I):
in formula (I), E represents expectation, do operator represents processing on Y,in order to be a strong cause-and-effect relationship,is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)1,ω2,…,ωN}. Increasing the accuracy rate for extracting text information.
Specifically, for example, in the administrative penalty decision, the "act of producing a product by pretending a defective product to be a qualified product" violates the thirty-second regulation of the "national institutes of health" but has no causal relationship with the twenty-seventh first and fourth regulations of the "national institutes of health" so that when an output vector is constructed from a future text information vector and a history text information vector, different weights are assigned according to the relationship text relationship strength.
Compared with the traditional method, the method can accurately obtain the structured information of the determinants, is convenient for understanding the administrative penalty determinants and implementing downstream tasks such as class case retrieval, class case recommendation, decision prediction and the like, and the provided information extraction module can improve the accuracy and efficiency of information extraction.
Example 4:
a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method for extracting information for an administrative penalty decision according to any one of embodiments 1 to 3 when executing the computer program.
Example 5:
a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for extracting information for an administrative penalty decision described in any one of embodiments 1-3.
Claims (9)
1. An information extraction method for an administrative penalty decision, comprising:
the method comprises the following steps: crawling to obtain an administrative penalty decision;
step two: extracting the text content of the administrative penalty decision acquired in the first step, and constructing an original data set;
step three: according to a normative rule written by the administrative penalty determinants, performing data preprocessing on the administrative penalty determinants to be processed by using a regular expression to construct a data set;
step four: and inputting the data set constructed in the step three into an information extraction module trained by using the original data set constructed in the step two, and outputting an administrative penalty document information extraction result.
2. The information extraction method of the administrative penalty decision according to claim 1, wherein the information extraction module comprises a pre-training language module, a context information acquisition module, a weight distribution module and a feature classification module in sequence;
the pre-training language module comprises a sliding window attention mechanism and a global attention mechanism; the sliding window self-attention mechanism is to use a window self-attention mechanism with a fixed size of omega x omega on a text information sequence, obtain text information sequences at different positions through window sliding, combine a plurality of window self-attention mechanisms obtained through sliding to generate a large receptive field, and construct a local context information sequence; the global attention mechanism acquires a complete text information sequence and constructs a representation comprising the whole input text information sequence;
the context information acquisition module comprises a forward neural network, a backward neural network and a hidden layer;
inputting the input text information sequence into a hidden layer through a forward neural network, and calculating a future text information sequence; the input text information sequence is subjected to time backward propagation to the neural network, the output text information sequence is calculated firstly and then is returned to the hidden layer, and a historical text information sequence is obtained;
the weight distribution module calculates the context cause-and-effect relationship by using a GFCI algorithm, the GFCI algorithm combines an algorithm based on constraint and scoring, a future text information sequence and a historical text information sequence are used as input, the cause-and-effect relationship between the text information sequences is searched by using a greedy algorithm, the cause-and-effect relationship is calculated by using a rapid cause-and-effect reasoning algorithm, the cause-and-effect strength of the text information sequences is calculated by using ATE (automatic test equipment) to measure the cause-and-effect strength, and weights are distributed for the cause-and-effect relationship;
the feature classification module performs conditional probability calculation on the text information sequence and the output sequence, performs inspection correction on the result of text information sequence extraction, inputs the result into a Viterbi decoder to decode the text information sequence, outputs the text information and obtains the result of information extraction.
3. The method for extracting information of administrative penalty decisions according to claim 2, wherein the fourth step comprises the steps of:
inputting the data set constructed in the third step into the pre-training language module, and acquiring a short text information sequence through a sliding window self-attention mechanism according to the text characteristics of the administrative punishment determinants, wherein the short text information sequence comprises determinants, parties, main qualification license names, unified social credit codes, residences, legal accountants and identity card numbers; obtaining a long text information sequence of case sources, investigation passes, case facts, evidence proofings, illegal behavior nature qualifications, penalty bases, facts and reasons of free cutting amount, fulfillment modes and terms of administrative penalties, relief ways and terms through a combination of a sliding window self-attention mechanism and a global attention mechanism, and constructing a word-level text vector matrix X ═ X ═ after text contents in a data set pass through a pre-training language module1,x2,...,xNAs output, xiIndicating administrative penalty decisionsExtracting feature vectors of the book, wherein i belongs to N;
and (3) setting a word-level text vector matrix X as { X ═ X1,x2,...,xNAn input context information acquisition module, which inputs the input text vector to a forward hidden layer through a forward neural network, and the hidden layer acquires a future text information vector Y of the text vector as { Y ═ Y }1,y2,...,yN},yiA feature vector representing future text information; the backward neural network reversely propagates the input text vector through time, calculates the output vector and then returns the output vector to the hidden layer to obtain the historical text information vector Z of the text vector1,z2,...,zN},ziA feature vector representing future text information;
the weight distribution module is used for distributing a future text information vector Y to { Y ═ Y according to the text cause and effect information1,y2,...,yNZ and a history text information vector Z ═ Z1,z2,...,zNCalculating and distributing different weights; after weight normalization, weight calculation is performed to obtain an output vector W ═ ω { ω ═ c1,ω2,…,ωN};
The feature classification module calculates an input vector W ═ ω { ω) according to the conditional probability1,ω2,...,ωNAnd output vector beta output by the feature classification module is ═ beta1,β2,...,βNAnd inputting the text information sequence into a Viterbi decoder for decoding, outputting the text information and obtaining the result of information extraction.
4. The method of claim 2, wherein the sliding window auto-attention mechanism reduces or divides the context information sequence into smaller sequences, uses a fixed-size window to slide to obtain text information sequences at different positions, combines a plurality of sliding windows from the attention mechanism to generate a large receptive field, constructs a local context information sequence, sets the window size to ω × ω, assumes that the text information sequence length is n, and ω < n.
5. The information extraction method of the administrative penalty decision according to claim 2, wherein the weight assignment module implements context information causal relationship analysis by a greedy fast causal inference algorithm, measures causal strength by an average processing effect, sets the context information causal relationship as a binary variable, and assumes that the context information causal relationship is Y → Z and the causal strength is as shown in formula (I):
in formula (I), E represents expectation, do operator represents the processing of Y,in order to be a strong cause-and-effect relationship,is a weak causal relationship;
distributing the causal strength as a weight vector, distributing strong causal relationship weight for strong causal relationship, namely the strong causal relationship weight is high, distributing weak causal relationship weight for weak causal relationship, namely the weak causal relationship weight is low, performing weight normalization, and performing weight calculation to obtain an output vector W ═ omega { omega [ (. omega.)1,ω2,...,ωN}。
6. The method for extracting information of administrative penalty decisions according to claim 1, wherein in step two, the text content of the administrative penalty decisions comprises the text number of the decision, the party, the name of the subject qualification certificate, the unified social credit code, the residence, the legal liability person, the identity card number, the case source and survey pass, the case fact, the evidence certification, the nature of the illegal act, the basis of the penalty, the fact and reason of the discretionary amount, the fulfillment mode and the term of the administrative penalty, the relief way and the term.
7. The method for extracting information of administrative penalty decisions according to claim 1, wherein in step three, regular expressions are used to match a series of character strings meeting syntactic rules, and structured information and subject matter of unstructured case description in the administrative penalty decisions are extracted, the structured information includes a decision text number, a party, a subject qualification certificate name, a unified social credit code, a residence, a legal responsibility person and an identity document number, and the subject matter of the unstructured case description includes case source and survey processes, case facts, evidence certificates, violation nature qualifications, penalty bases, facts and reasons of liberty, fulfillment ways and terms of administrative penalties, relief ways and terms.
8. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of a method of information extraction for an administrative penalty decision according to any one of claims 1 to 7.
9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, performs the steps of the method for extracting information for an administrative penalty decision according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111201811.4A CN113918706B (en) | 2021-10-15 | 2021-10-15 | Information extraction method of administrative punishment decision book |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111201811.4A CN113918706B (en) | 2021-10-15 | 2021-10-15 | Information extraction method of administrative punishment decision book |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113918706A true CN113918706A (en) | 2022-01-11 |
CN113918706B CN113918706B (en) | 2024-05-28 |
Family
ID=79240643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111201811.4A Active CN113918706B (en) | 2021-10-15 | 2021-10-15 | Information extraction method of administrative punishment decision book |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113918706B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117609487A (en) * | 2024-01-19 | 2024-02-27 | 武汉百智诚远科技有限公司 | Legal provision quick retrieval method and system based on artificial intelligence |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300400A1 (en) * | 2017-04-14 | 2018-10-18 | Salesforce.Com, Inc. | Deep Reinforced Model for Abstractive Summarization |
CN109086869A (en) * | 2018-07-16 | 2018-12-25 | 北京理工大学 | A kind of human action prediction technique based on attention mechanism |
CN111461932A (en) * | 2020-04-09 | 2020-07-28 | 北京北大软件工程股份有限公司 | Administrative punishment discretion rationality assessment method and device based on big data |
CN111708875A (en) * | 2020-06-02 | 2020-09-25 | 北京北大软件工程股份有限公司 | Administrative law enforcement class recommendation method based on punishment characteristics |
CN113434664A (en) * | 2021-06-30 | 2021-09-24 | 平安科技(深圳)有限公司 | Text abstract generation method, device, medium and electronic equipment |
-
2021
- 2021-10-15 CN CN202111201811.4A patent/CN113918706B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180300400A1 (en) * | 2017-04-14 | 2018-10-18 | Salesforce.Com, Inc. | Deep Reinforced Model for Abstractive Summarization |
CN109086869A (en) * | 2018-07-16 | 2018-12-25 | 北京理工大学 | A kind of human action prediction technique based on attention mechanism |
CN111461932A (en) * | 2020-04-09 | 2020-07-28 | 北京北大软件工程股份有限公司 | Administrative punishment discretion rationality assessment method and device based on big data |
CN111708875A (en) * | 2020-06-02 | 2020-09-25 | 北京北大软件工程股份有限公司 | Administrative law enforcement class recommendation method based on punishment characteristics |
CN113434664A (en) * | 2021-06-30 | 2021-09-24 | 平安科技(深圳)有限公司 | Text abstract generation method, device, medium and electronic equipment |
Non-Patent Citations (2)
Title |
---|
商齐;曾碧卿;王盛玉;周才东;曾锋;: "ACMF:基于卷积注意力模型的评分预测研究", 中文信息学报, no. 11, 15 November 2018 (2018-11-15) * |
李玉军;汤晓君;刘君华;: "粒子群优化算法在混合气体红外光谱定量分析中的应用", 光谱学与光谱分析, no. 05, 15 May 2009 (2009-05-15) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117609487A (en) * | 2024-01-19 | 2024-02-27 | 武汉百智诚远科技有限公司 | Legal provision quick retrieval method and system based on artificial intelligence |
CN117609487B (en) * | 2024-01-19 | 2024-04-09 | 武汉百智诚远科技有限公司 | Legal provision quick retrieval method and system based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN113918706B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Desai et al. | Techniques for sentiment analysis of Twitter data: A comprehensive survey | |
CN110990564B (en) | Negative news identification method based on emotion calculation and multi-head attention mechanism | |
CN110427616B (en) | Text emotion analysis method based on deep learning | |
CN108563703A (en) | A kind of determination method of charge, device and computer equipment, storage medium | |
CN109933792B (en) | Viewpoint type problem reading and understanding method based on multilayer bidirectional LSTM and verification model | |
Chakraborty et al. | Predicting stock movement using sentiment analysis of Twitter feed | |
CN109472462B (en) | Project risk rating method and device based on multi-model stack fusion | |
CN109344399A (en) | A kind of Text similarity computing method based on the two-way lstm neural network of stacking | |
CN110543547B (en) | Automobile public praise semantic emotion analysis system | |
CN106294324A (en) | A kind of machine learning sentiment analysis device based on natural language parsing tree | |
Li et al. | A method for resume information extraction using bert-bilstm-crf | |
CN114662477B (en) | Method, device and storage medium for generating deactivated word list based on Chinese medicine dialogue | |
Kuchlous et al. | Short text intent classification for conversational agents | |
CN115329207A (en) | Intelligent sales information recommendation method and system | |
CN113918706A (en) | Information extraction method for administrative punishment decision book | |
Kundana | Data Driven Analysis of Borobudur Ticket Sentiment Using Naïve Bayes. | |
CN113220964A (en) | Opinion mining method based on short text in network communication field | |
Liu et al. | Sgat: A self-supervised graph attention network for biomedical relation extraction | |
CN117077682A (en) | Document analysis method and system based on semantic recognition | |
Putri et al. | Content-based filtering model for recommendation of Indonesian legal article study case of klinik hukumonline | |
CN114490925A (en) | Emotion mining method and equipment under public event | |
Kanclerz et al. | Towards Model-Based Data Acquisition for Subjective Multi-Task NLP Problems | |
CN112905790A (en) | Method, device and system for extracting qualitative indexes of supervision events | |
Wen et al. | Blockchain-based reviewer selection | |
Akbar et al. | The implementation of Naïve Bayes algorithm for classifying tweets containing hate speech with political motive |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |