CN115470354A - Method and system for identifying nested and overlapped risk points based on multi-label classification - Google Patents
Method and system for identifying nested and overlapped risk points based on multi-label classification Download PDFInfo
- Publication number
- CN115470354A CN115470354A CN202211366277.7A CN202211366277A CN115470354A CN 115470354 A CN115470354 A CN 115470354A CN 202211366277 A CN202211366277 A CN 202211366277A CN 115470354 A CN115470354 A CN 115470354A
- Authority
- CN
- China
- Prior art keywords
- entity
- sentence
- contract
- label
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000012549 training Methods 0.000 claims abstract description 72
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims abstract description 19
- 238000007906 compression Methods 0.000 claims abstract description 16
- 230000006835 compression Effects 0.000 claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 20
- 230000002457 bidirectional effect Effects 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000000903 blocking effect Effects 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 4
- QVRVXSZKCXFBTE-UHFFFAOYSA-N n-[4-(6,7-dimethoxy-3,4-dihydro-1h-isoquinolin-2-yl)butyl]-2-(2-fluoroethoxy)-5-methylbenzamide Chemical compound C1C=2C=C(OC)C(OC)=CC=2CCN1CCCCNC(=O)C1=CC(C)=CC=C1OCCF QVRVXSZKCXFBTE-UHFFFAOYSA-N 0.000 claims description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 abstract description 7
- 238000002372 labelling Methods 0.000 description 8
- 238000012550 audit Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of contract document identification, and particularly relates to a method and a system for identifying nested and overlapped risk points based on multi-label classification. The method comprises the steps that S1, after sentence segmentation is carried out on contract documents marked with risk points and keywords, the contract documents are input into a BERT model to be pre-trained, and the contract pre-training BERT model is obtained; s2, segmenting the contract document into sentences, and inputting the sentences into a contract pre-training BERT model to extract word expressions and bottom layer characteristics; s3, merging the position information and the label information of the sentence sequence into the BilSTM to obtain the characteristics of the merged position and label information and performing characteristic compression; and S4, inputting the compressed features into a double affine network, performing range enumeration and entity label classification, and performing parameter learning. The method and the device have the characteristics of automatically identifying the risk points in the contract document and auditing the risk points.
Description
Technical Field
The invention belongs to the technical field of contract document identification, and particularly relates to a method and a system for identifying nested and overlapped risk points based on multi-label classification.
Background
The identification of risk points in the contract document is a method for identifying various types of key information defined by industry-related experts from the contract document, is a process for extracting structured information from an unstructured text of the contract document, and is widely applied to the examination and verification of the risk points of the same type, such as labor contracts, buying and selling contracts, purchasing contracts and the like of various industries at present. Some approval processes of various industries need to audit and approve contract documents, the key point of audit is often some key information and risk points in the contract documents, traditional contract document audit depends on manual identification and audit of various risk points, along with rapid development of internet technology and information office, the audit of a large number of contract documents is electronic, a convenient scene is provided for improving audit efficiency through manual intelligent technology intervention, but the risk points of electronic paid documents need manual positioning, the efficiency is low, and omission is easy. The artificial intelligence technology is utilized to help practitioners in various industries to carry out risk point auditing work on electronic contract documents, so that the risk points are automatically identified, the efficiency of auditing the contracts in various industries is improved, and the omission of the risk points is avoided.
Common risk point identification scenes in contracts are generally divided into common scenes, nested scenes and overlapped scenes.
The common scene means that risk points are not associated with each other in the contract text, such as: [ first name: company limited, telephone number: year 2022, month 01, day 01, wherein the risk point "first party name" [ finity company ] is not associated with the risk point "telephone number" [ year 2022, month 01, day 01 ].
The nested scene refers to the fact that entity texts are nested with each other, for example [ from the date of the same life, the first party pays the second party in a bank transfer mode ] and the text belongs to a risk point 'payment mode', but the text [ bank transfer ] also belongs to a risk point 'payment tool'.
The risk points overlap means that, for text [ contract time: 2022, 01 month, 01 day, wherein [ 2022, 01 month, 01 day ] is both the risk point "first party contract time" and the risk point "second party contract time" in the contracting parties, and belongs to both risk points.
Methods of named entity identification can be generally divided into sequence annotation-based and pointer-based methods. The method of sequence labeling is to give a label to each word of the text, and the extraction of the entity is completed through the correspondence between the label and the category. Pointer-based methods typically predict head and tail pointers and then assign the text wrapped by the head and tail pointers to corresponding entity classes.
In the contract, for the entity identification of the common scene, no matter a method based on sequence labeling or a method based on a pointer can well complete the task, the common mode is to use a machine learning model to model the text, then extract entity information by adopting a method based on sequence labeling, and the common model has a Long Short-Term Memory network (LSTM) such as a bidirectional LSTM named entity identification method based on prediction position attention described in the patent application No. CN201910225622.7 or a mixed language material named entity identification method based on Bi-LSTM-CNN mixed with a Convolutional Neural network model (CNN) to capture characteristics, such as described in the patent application No. CN 201710946532.8.
For the nested scenario, a general sequence labeling-based method fails because the general sequence labeling method can only assign one category information to each word, and the overlapping part of the entities corresponds to multiple tags in the nested case. At this time, the problem of entity nesting can be solved by improving a method based on sequence labeling, for example, a multi-label classification problem can be converted into a multi-classification problem through combining labels, or a method of hierarchical identification is proposed, and nested entities are hierarchically identified from an internal entity to an external entity and from the external entity to the internal entity. As described in patent publication No. CN114281937a, each sub-entity of a nested entity can be identified by predicting a first nested entity and then identifying a second nested entity based on the prediction information of the first nested entity. For the nested entity identification, by means of determining the start index and the end index and the entity class label, for a non-overlapping nested entity, as long as the start index and the end index of the entity are unique, it indicates that the entity can be uniquely identified, as described in patent application CN 202011522097.4. The document with patent publication number CN114386417a proposes a named entity recognition method of incorporated word boundary information, which utilizes information of incorporated word level in an external word list, utilizes a pre-training model to extract vector representation of semantic information, and performs head-tail range judgment on a sequence input into a bidirectional affine network based on a bidirectional affine network method.
For the overlapped scene, because the entity belongs to the particularity of the entity overlapped scene, namely, the entities are completely consistent, the current method generally considers that different entities have different head and tail parts, in the scene of extracting the risk point identification information in the contract, a range formed by the head and the tail possibly belongs to two or more risk points at the same time, so that the method for processing the common scene and the nested scene is invalid, and the general method is to model the entity independently or extract the entity independently firstly and then distribute the entity to different categories by utilizing a multi-classification model.
However, the above prior art has the following disadvantages:
1. nested and overlapping scenarios where risk points in the treaty document exist in large numbers. For a nested scene, a general model is difficult to capture nested entity information of a contract text, such as LSTM and CNN, because there is no interaction of information between head and tail entities, the recognition effect is not good. For the overlapping scene, a method of identifying different risk points by using separate models is adopted, so that the overlapping problem caused by identifying the risk points can be avoided, but a large amount of resource overhead is also caused, and a large amount of waste is caused. For the situation that error accumulation inevitably occurs when a model which firstly identifies risk points and then distributes corresponding labels to the risk points through classification is adopted, firstly, the learning process of two stages has a dependency relationship, the entity identification result of the first stage influences the classification result of the second stage, secondly, the weight of loss functions of each stage is introduced as a hyper-parameter, and finally, the network structure and the loss functions need to be designed manually in each stage, wherein the quality of enumeration candidate range needs to be improved by designing the network structure and the loss functions. The two-stage learning increases the difficulty of model design and the cost of parameter tuning.
2. The contract documents also have a large number of unbalanced categories, whether the method is based on sequence labeling or a pointer-based method, wherein the identification result of most candidates is not a certain entity, the number of the entities is far less than that of the non-entity candidate results, and the problem of sample unbalance exists, and the phenomenon of category unbalance in the multi-classification problem or the multi-label classification problem is not considered based on Binary cross entropy Loss (BCE Loss) or cross entropy Loss at present. The existing nested named entity recognition method uses cross entropy loss during training as described in the document with patent publication number CN112989835a to minimize the difference between the predicted distribution and the standard distribution, so that the problem of category imbalance is not concerned.
Therefore, it is very important to design a method and a system for identifying nested and overlapped risk points based on multi-label classification, which can automatically identify the risk points in the contract document and help practitioners in various industries to perform risk point review.
Disclosure of Invention
The invention provides a method and a system for identifying nested and overlapped risk points based on multi-label classification, which can automatically identify the risk points in a contract document and help practitioners in various industries to carry out risk point audit, in order to solve the problems that a great number of nested and overlapped scenes of the risk points in the conventional contract document and a great number of unbalanced categories of the contract document exist in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for identifying nested and overlapping risk points based on multi-label classification, comprising the steps of;
s1, after segmenting a contract document labeled with risk points and keywords into sentences, inputting the segmented contract document into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, segmenting a sentence of the contract document to be identified, inputting the sentence into a contract pre-training BERT model to extract word representation and bottom layer characteristics, and simultaneously obtaining position information and label information of a sentence sequence;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM), obtaining the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
Preferably, the pre-training in step S1 includes the following steps:
s11, shielding risk points and keywords in the contract in a random shielding mode, and predicting the shielded risk points and keywords by using an unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Preferably, step S2 includes the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor expression is set as any label typeThe initial position isOf a matrixSetting as 1;
s22, the sequence of the input sentence is represented by characters; input sentenceIn whichIs a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)WhereinPre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
preferably, step S3 includes the steps of:
s31, using the BilSTM to extract features, and fusing position information and label information of the sentence sequence into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model asInitializing a vector matrix of class information asWherein m is the number of categories; derived from two weight matricesAndis marked asAndand obtaining the fusion information of the category and the position by using the formula (2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,a feature corresponding to each token; using fused information pairs of derived categories and locationsWeighting is carried out, as shown in formula (4), and finally the obtained productFor fusing categories and positions token characteristics of the information;
s32, inputting the token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
wherein,,are respectively candidate entity rangesThe start and end positions of the optical fiber,andis the sentence in the positionAndfusing token characteristics of category and position information,,are respectively as,Entity after feature compressionCephalad and caudal features.
Preferably, step S4 includes the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, as shown in the following formula (7)Andthe characteristics of the characters of the beginning and end of the ith entity respectively;
whereinA tensor of (NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION), NUM _ DIMENSION being a DIMENSION after feature compression, NUM _ LABEL being a number of entity classes;
s42, classifying the resultAnd performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the sameIs taken asWhen the temperature of the water is higher than the set temperature,represent the lifeThe name entity type isThe probability of (a) of (b) being,is entity i in typeThe classification score of (a) is calculated,has a value range of [1,C]C is the number of categories NUM _ LABEL; when in useWhen the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)m is a set hyperparameter, is a shift parameter of the probability, and willThe following formula of belt-type (11)(ii) a If it is an entity, directly substitute the formula (11)In aIn the calculation formula (2), p is the calculation result of the formula (8);Andrespectively positive and negative attention parameters, representing the probability deviation width as a set hyper-parameter, and finally calculating the ASL loss value in the training stage according to a formula (10) and a formula (11);
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
The invention also provides a system for identifying nested and overlapping risk points based on multi-label classification, comprising;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into the double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
Preferably, the pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the blocked keywords by using the context which is not covered by a BERT model;
randomly putting two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Compared with the prior art, the invention has the beneficial effects that: (1) The invention is based on BERT (bidirectional coding representation conversion algorithm), adopts a new pre-training method, leads the model to be pre-trained in a mode of predicting risk points in the contract through context, and changes the commonly used pre-training method of adjacent sentences into the training mode of the same paragraph according to the characteristics of the contract text; (2) Aiming at the problem of entity nesting and overlapping existing in a named entity recognition task of a contract document, a pointer-based mode is adopted, a bidirectional affine network is used as a module for entity range enumeration and entity classification, a matrix containing a label type and an entity initial position subscript is used instead of a label sequence to represent the label type of training data, the condition that a BIO sequence cannot represent a nested entity and two head-tail pointer sequences cannot represent an overlapped entity is avoided, the representation is simple and visual, labels of the nested and overlapped entities can be directly represented, particularly labels of different entities with completely same head-tail positions are represented and input into a double affine network by utilizing token fusion of positions and labels, interactive information of the head and tail of the entities can be well captured, and the problem of entity nesting and overlapping is solved; (3) Aiming at the problem of class unbalance, the invention uses asymmetric loss (ASL loss), which is an improvement of Focal loss (Focal loss) on the multi-label classification problem, and solves the problem of unbalance of positive and negative samples in the multi-label classification task; (4) Aiming at the condition that the entity is too long, the invention fuses the position information and the label information of the sequence into the BilSTM, so that the BilSTM has stronger sequence modeling capability, thereby reducing the condition of entity fracture.
Drawings
FIG. 1 is a schematic diagram of a method for identifying nested and overlapping risk points based on multi-label classification in accordance with the present invention;
fig. 2 is a schematic diagram of an actual application of the method for identifying nested and overlapped risk points based on multi-label classification according to the embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
the method for identifying nested and overlapping risk points based on multi-label classification as shown in fig. 1 comprises the following steps;
s1, segmenting a contract document marked with risk points and keywords into sentences, and inputting the sentences into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, sentence segmentation is carried out on the contract document needing to be recognized, the contract document is input into a contract pre-training BERT model, word expression and bottom layer characteristics are extracted, and meanwhile position information and label information of a sentence sequence are obtained;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network BilSTM to obtain the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning by a label matrix and introducing an ASL loss function.
Further, the pre-training in step S1 includes the following steps:
s11, blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the blocked keywords by using the unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Through the pre-training mode, the pre-trained BERT can more effectively acquire the characteristics of the contract text, and a better effect is generated on downstream tasks.
Further, step S2 includes the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor expression is set as any label typeThe initial position isOf a matrixSetting as 1;
the input of the model training stage comprises two parts, namely a text needing document named entity extraction and a position and an entity type of a named entity in the text based on pointer labeling. The entity LABELs of the data in the data loading phase passing through the model are converted into tensor L of (NUM _ LABEL, NUM _ SEQ _ LEN, NUM _ SEQ _ LEN), wherein NUM _ LABEL is the category number of the entity, and NUM _ SEQ _ LEN is the maximum length of the input text. L is a three-dimensional tensor, the first dimension being a categorical category and the second, three dimensions being all possible combinations of a starting location to an ending location of an entity.
S22, the sequence of the input sentence is represented by characters; input sentenceWhereinIs a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)WhereinPre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
further, step S3 includes the steps of:
s31, using the BilSTM to extract features, and fusing position information and label information of the sentence sequence into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model asThe vector matrix of the initialized category information isWherein m is the number of categories; derived from two weight matricesAndis marked asAndand obtaining the fusion information of the category and the position by using the formula (2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,a feature corresponding to each token; using fused information pairs of derived categories and locationsWeighting is carried out, as shown in formula (4), and the obtained product isA token feature fusing category and position information;
s32, inputting token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
wherein,,are respectively candidate entity rangesThe start and end positions of the air flow,andis the sentence in the positionAnda token feature that fuses category and location information,are respectively as,Entities after feature compressionCephalad and caudal features.
Further, step S4 includes the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, wherein the classification is shown in the following formula (7)Andthe characteristics of the characters of the beginning and end of the ith entity respectively;
whereinA tensor of (NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION), NUM _ DIMENSION being a DIMENSION after feature compression, NUM _ LABEL being a number of entity classes;
s42, classifying the resultsAnd performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the sameIs taken asWhen the temperature of the water is higher than the set temperature,represents a named entity type ofThe probability of (a) of (b) being,is entity i in typeThe classification score of (a) is determined,has a value range of [1,C]C is the number of categories NUM _ LABEL; when in useWhen the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)m is a set hyperparameter, is a shift parameter of the probability, and willThe following formula of belt-type (11)(ii) a If it is a solid, it is directly substituted for the formula (11) aboveIn the calculation formula (2), p is the calculation result of the formula (8);Andrespectively positive and negative attention parameters, representing the probability deviation width as a set hyper-parameter, and finally calculating the ASL loss value in the training stage according to a formula (10) and a formula (11);
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
The invention also provides a system for identifying nested and overlapped risk points based on multi-label classification, which comprises the following steps of;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into the double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
The pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the keywords by using an unmasked context through a BERT model;
two sentences are randomly put together, and whether the two sentences are in the same paragraph is judged through a BERT model.
Based on the technical scheme of the invention, the steps of the specific implementation and operation process are as follows:
pre-training:
1. and (4) masking the risk points and the keywords in the contract text, and sending the masked risk points and the keywords into a BERT model, wherein the BERT predicts the masked risk points and the keywords according to the unmasked context.
2. The sentences in the contract are randomly combined into sentence pairs and sent into a BERT model, and the BERT predicts whether the two sentences are in the same paragraph (same paragraph).
And (4) supervision training:
the training phase is shown in fig. 2 and has the following steps:
3. and dividing documents to be subjected to entity extraction and input into the system into a training set document set and a verification set.
4. Each document needs to contain two parts, one part is the text of the document, and the other part is the entity position and the label corresponding to the text. And the entity set of the document is subdivided according to the segmentation position of the sentence to obtain a list consisting of the sentence and the entities of the sentence.
5. Each sentence of the training setAnd sentence entityRespectively input into BERT to obtain vector representation of sentence token;
6. Token vector for final output of BERTInputting into BilSTM to extract context information of sentenceThen, the position information and the context information are fused according to the formula (4) to obtain the context expression of the fused position information;
7. Respectively inputting the sentence context information extracted by the BilSTM into two FFNN feed-forward neural networks with 1 layer, applying the formulas (5) and (6) to obtain the characteristics of the entity head and tail of the sentence,and;
8. inputting the characteristics of the head and the tail of the entity of the sentence into a bidirectional affine network, and applying the formula (7) and the formula (8) to output the range enumeration result of the entity and the probability corresponding to the range;
9. Calculating an ASL loss value of the affine-two network output probability by applying the formula (10) and the formula (11) according to an ASL loss function;
10. adjusting the weight of each layer of the neural network model according to the ASL loss value;
11. after sentences of the training set are trained for one round, calculating the accuracy of the current model parameter prediction verification set, and storing the model weight;
12. repeating the steps 1 to 9 until the preset epoch value is reached;
13. the value stored in the epoch and having the maximum accuracy in the verification set is the learned optimal model.
In addition, the prediction phase of the model comprises the following steps:
1. and (3) segmenting the text of the document by using periods to obtain a sentence list according to the entity positions and labels which are not required to be contained in the document to be predicted.
2. Inputting each sentence to be predicted into BERT to obtain vector representation of the sentence token;
3. inputting the token vector finally output by the BERT into the BilSTM to extract the context information of the sentence;
4. then fusing the position information and the context information according to a formula (4) to obtain a context expression of the fused position information;
5. respectively inputting the context expression of the fusion position information into two FFNN feedforward neural networks of 1 layer to obtain the characteristics of the entity head and the tail of the sentence;
6. inputting the characteristics of the head and the tail of the entity of the sentence into a bidirectional affine network, and outputting a range enumeration result and the probability corresponding to the range of the entity;
7. outputting entities with the probability larger than a set threshold value in the enumeration range result, namely the entities extracted from the sentences;
8. combining the extraction results of all sentences in the document to serve as an entity extraction result of the document;
the entity extraction model is trained from the training data set through the training stage, and the entity contained in the document is predicted through the prediction result after the document to be extracted is input, so that the training and prediction process of a complete entity extraction model is completed.
The invention adopts a method based on a bidirectional affine network to identify entities in contract documents to solve the problems of entity overlapping and entity nesting in the task of extracting the entities in the contract documents, improves the universality of the algorithm by adopting a mode of cutting documents by periods, can be simply popularized in the task of extracting other scenes, adopts BERT to carry out token representation and BilSTM to carry out feature extraction, adopts a label matrix and the bidirectional affine network to carry out entity range enumeration and entity category prediction, can solve the problem of entity nesting in the contract document extraction, and adopts ALS loss to optimize network parameters to solve the problem of complete overlapping of the initial positions of the entities.
The invention discloses a pre-training method based on the characteristics of contract texts, which is characterized in that pre-training is carried out by a training method which enables a model to predict hidden risk points and keywords and predict whether two sentences are in the same paragraph, and BERT pre-trained according to the method has more contract text information, can better acquire the contract text information during training prediction, and greatly reduces the requirements of training resources.
The invention creatively provides a method for enumerating and classifying entity ranges in contract documents based on a bidirectional affine network, which is different from a method for enumerating the entity ranges and then classifying the entities.
The invention creatively blends the position information and the category information of the bottom layer into the middle layer BilSTM network, so that the capacity of capturing the long text information by the middle layer BilSTM network is enhanced, the information characteristics of the risk points can be better extracted, and the possibility of entity fracture is greatly reduced.
The invention innovatively designs the entity range in the contract document expressed by using the label matrix, combines the multi-label classification loss function ASL loss, is suitable for the problem of network parameter optimization under the condition of entity nesting and overlapping in the contract document, solves the problem of category imbalance in range-based named entity extraction, and can be popularized to other scenes.
In conclusion, the method has the characteristics of reducing manual design, having strong universality and solving the problems of entity nesting and overlapping in entity extraction in the contract document.
The foregoing has outlined, rather broadly, the preferred embodiment and principles of the present invention in order that those skilled in the art may better understand the detailed description of the invention without departing from its broader aspects.
Claims (7)
1. A method for identifying nested and overlapping risk points based on multi-label classification, comprising the steps of;
s1, after segmenting a contract document labeled with risk points and keywords into sentences, inputting the segmented contract document into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, segmenting a sentence of the contract document to be identified, inputting the sentence into a contract pre-training BERT model to extract word representation and bottom layer characteristics, and simultaneously obtaining position information and label information of a sentence sequence;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM), obtaining the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
2. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 1, wherein the pre-training in step S1 comprises the steps of:
s11, shielding risk points and keywords in the contract in a random shielding mode, and predicting the shielded risk points and keywords by using an unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
3. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 2, wherein step S2 comprises the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor is expressed asSet for any one of the label categories asThe initial position isOf a matrixSetting as 1;
s22, the sequence of the input sentence is represented by characters; input sentenceWhereinIs a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)WhereinPre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
4. the method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 3, wherein step S3 comprises the steps of:
s31, using BilSTM to extract features, and extracting position information and label information of sentence sequenceMelting the information into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model asThe vector matrix of the initialized category information isWherein m is the number of categories; derived from two weight matricesAndis marked asAndand obtaining the fusion information of the category and the position by using the formula (2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,a feature corresponding to each token; using fused pairs of information derived for category and locationWeighting is carried out as shown in equation (4) and finallyObtainedA token feature fusing category and position information;
s32, inputting token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
(5)
5. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 4, wherein step S4 comprises the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, wherein the classification is shown in the following formula (7)And features of characters that are the beginning and end, respectively, of the ith entity;
wherein(NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION) tensor, NUM _ DIMENSION being the DIMENSION after feature compression, NUM _ LABELIs the number of entity classes;
s42, classifying the resultAnd performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the sameIs taken asWhen the temperature of the water is higher than the set temperature,represents a named entity type ofThe probability of (a) of (b) being,is entity i in typeThe classification score of (a) is calculated,has a value range of [1,C]C is the number of categories NUM _ LABEL; when in useWhen the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)m is a set hyperparameter, is a shift parameter of the probability, and willThe following formula of belt-type (11)(ii) a If it is an entity, directly substitute the formula (11)In aIn the calculation formula (2), p is the calculation result of the formula (8);Andrespectively positive and negative attention parameters, and set hyperparameters representing probability deviationWidth, and finally calculating an ASL loss value in a training stage according to a formula (10) and a formula (11);
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
6. A system for identifying nested and overlapping risk points based on multi-label classification, comprising;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into a double affine network, performing range enumeration and entity label classification, and performing parameter learning by a label matrix and introducing an ASL loss function.
7. The system for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 6, wherein the pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the keywords by using an unmasked context through a BERT model;
two sentences are randomly put together, and whether the two sentences are in the same paragraph is judged through a BERT model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211366277.7A CN115470354B (en) | 2022-11-03 | 2022-11-03 | Method and system for identifying nested and overlapped risk points based on multi-label classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211366277.7A CN115470354B (en) | 2022-11-03 | 2022-11-03 | Method and system for identifying nested and overlapped risk points based on multi-label classification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115470354A true CN115470354A (en) | 2022-12-13 |
CN115470354B CN115470354B (en) | 2023-08-22 |
Family
ID=84338111
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211366277.7A Active CN115470354B (en) | 2022-11-03 | 2022-11-03 | Method and system for identifying nested and overlapped risk points based on multi-label classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115470354B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115995087A (en) * | 2023-03-23 | 2023-04-21 | 杭州实在智能科技有限公司 | Document catalog intelligent generation method and system based on fusion visual information |
CN116092493A (en) * | 2023-04-07 | 2023-05-09 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer readable storage medium |
CN116306657A (en) * | 2023-05-19 | 2023-06-23 | 之江实验室 | Entity extraction method and system based on square matrix labeling and double affine layers attention |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413785A (en) * | 2019-07-25 | 2019-11-05 | 淮阴工学院 | A kind of Automatic document classification method based on BERT and Fusion Features |
US10528866B1 (en) * | 2015-09-04 | 2020-01-07 | Google Llc | Training a document classification neural network |
CN112101027A (en) * | 2020-07-24 | 2020-12-18 | 昆明理工大学 | Chinese named entity recognition method based on reading understanding |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
US20210034812A1 (en) * | 2019-07-30 | 2021-02-04 | Imrsv Data Labs Inc. | Methods and systems for multi-label classification of text data |
WO2021051516A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Ancient poem generation method and apparatus based on artificial intelligence, and device and storage medium |
CN112860889A (en) * | 2021-01-29 | 2021-05-28 | 太原理工大学 | BERT-based multi-label classification method |
-
2022
- 2022-11-03 CN CN202211366277.7A patent/CN115470354B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10528866B1 (en) * | 2015-09-04 | 2020-01-07 | Google Llc | Training a document classification neural network |
US20210012199A1 (en) * | 2019-07-04 | 2021-01-14 | Zhejiang University | Address information feature extraction method based on deep neural network model |
CN110413785A (en) * | 2019-07-25 | 2019-11-05 | 淮阴工学院 | A kind of Automatic document classification method based on BERT and Fusion Features |
US20210034812A1 (en) * | 2019-07-30 | 2021-02-04 | Imrsv Data Labs Inc. | Methods and systems for multi-label classification of text data |
WO2021051516A1 (en) * | 2019-09-18 | 2021-03-25 | 平安科技(深圳)有限公司 | Ancient poem generation method and apparatus based on artificial intelligence, and device and storage medium |
CN112101027A (en) * | 2020-07-24 | 2020-12-18 | 昆明理工大学 | Chinese named entity recognition method based on reading understanding |
CN112860889A (en) * | 2021-01-29 | 2021-05-28 | 太原理工大学 | BERT-based multi-label classification method |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115995087A (en) * | 2023-03-23 | 2023-04-21 | 杭州实在智能科技有限公司 | Document catalog intelligent generation method and system based on fusion visual information |
CN116092493A (en) * | 2023-04-07 | 2023-05-09 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer readable storage medium |
CN116092493B (en) * | 2023-04-07 | 2023-08-25 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer readable storage medium |
CN116306657A (en) * | 2023-05-19 | 2023-06-23 | 之江实验室 | Entity extraction method and system based on square matrix labeling and double affine layers attention |
CN116306657B (en) * | 2023-05-19 | 2023-08-22 | 之江实验室 | Entity extraction method and system based on square matrix labeling and double affine layers attention |
Also Published As
Publication number | Publication date |
---|---|
CN115470354B (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115470354A (en) | Method and system for identifying nested and overlapped risk points based on multi-label classification | |
CN109597493B (en) | Expression recommendation method and device | |
CN115952291B (en) | Financial public opinion classification method and system based on multi-head self-attention and LSTM | |
CN111339260A (en) | BERT and QA thought-based fine-grained emotion analysis method | |
CN112528031A (en) | Work order intelligent distribution method and system | |
CN115952292B (en) | Multi-label classification method, apparatus and computer readable medium | |
CN116956929B (en) | Multi-feature fusion named entity recognition method and device for bridge management text data | |
CN112434164A (en) | Network public opinion analysis method and system considering topic discovery and emotion analysis | |
CN112988970A (en) | Text matching algorithm serving intelligent question-answering system | |
CN115455189A (en) | Policy text classification method based on prompt learning | |
CN117197569A (en) | Image auditing method, image auditing model training method, device and equipment | |
CN111651597A (en) | Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network | |
Kiyak et al. | Comparison of image-based and text-based source code classification using deep learning | |
CN111274494A (en) | Composite label recommendation method combining deep learning and collaborative filtering technology | |
CN113761184A (en) | Text data classification method, equipment and storage medium | |
Jayashree et al. | Sentimental analysis on voice based reviews using fuzzy logic | |
Léon | Extracting information from PDF invoices using deep learning | |
CN107729509A (en) | The chapter similarity decision method represented based on recessive higher-dimension distributed nature | |
CN114510569A (en) | Chemical emergency news classification method based on Chinesebert model and attention mechanism | |
CN113177121A (en) | Text topic classification method and device, electronic equipment and storage medium | |
CN116562284B (en) | Government affair text automatic allocation model training method and device | |
CN118210926B (en) | Text label prediction method and device, electronic equipment and storage medium | |
Seerangan et al. | Ensemble Based Temporal Weighting and Pareto Ranking (ETP) Model for Effective Root Cause Analysis. | |
Liu | A Big Data Analysis of Job Position Status Based on Natural Language Processing | |
Dogra | Aspect-Based Approaches for Measuring Customer Feedback in the E-Commerce Industry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |