CN115470354A - Method and system for identifying nested and overlapped risk points based on multi-label classification - Google Patents

Method and system for identifying nested and overlapped risk points based on multi-label classification Download PDF

Info

Publication number
CN115470354A
CN115470354A CN202211366277.7A CN202211366277A CN115470354A CN 115470354 A CN115470354 A CN 115470354A CN 202211366277 A CN202211366277 A CN 202211366277A CN 115470354 A CN115470354 A CN 115470354A
Authority
CN
China
Prior art keywords
entity
sentence
contract
label
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211366277.7A
Other languages
Chinese (zh)
Other versions
CN115470354B (en
Inventor
郭建威
孙林君
高扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Real Intelligence Technology Co ltd
Original Assignee
Hangzhou Real Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Real Intelligence Technology Co ltd filed Critical Hangzhou Real Intelligence Technology Co ltd
Priority to CN202211366277.7A priority Critical patent/CN115470354B/en
Publication of CN115470354A publication Critical patent/CN115470354A/en
Application granted granted Critical
Publication of CN115470354B publication Critical patent/CN115470354B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of contract document identification, and particularly relates to a method and a system for identifying nested and overlapped risk points based on multi-label classification. The method comprises the steps that S1, after sentence segmentation is carried out on contract documents marked with risk points and keywords, the contract documents are input into a BERT model to be pre-trained, and the contract pre-training BERT model is obtained; s2, segmenting the contract document into sentences, and inputting the sentences into a contract pre-training BERT model to extract word expressions and bottom layer characteristics; s3, merging the position information and the label information of the sentence sequence into the BilSTM to obtain the characteristics of the merged position and label information and performing characteristic compression; and S4, inputting the compressed features into a double affine network, performing range enumeration and entity label classification, and performing parameter learning. The method and the device have the characteristics of automatically identifying the risk points in the contract document and auditing the risk points.

Description

Method and system for identifying nested and overlapped risk points based on multi-label classification
Technical Field
The invention belongs to the technical field of contract document identification, and particularly relates to a method and a system for identifying nested and overlapped risk points based on multi-label classification.
Background
The identification of risk points in the contract document is a method for identifying various types of key information defined by industry-related experts from the contract document, is a process for extracting structured information from an unstructured text of the contract document, and is widely applied to the examination and verification of the risk points of the same type, such as labor contracts, buying and selling contracts, purchasing contracts and the like of various industries at present. Some approval processes of various industries need to audit and approve contract documents, the key point of audit is often some key information and risk points in the contract documents, traditional contract document audit depends on manual identification and audit of various risk points, along with rapid development of internet technology and information office, the audit of a large number of contract documents is electronic, a convenient scene is provided for improving audit efficiency through manual intelligent technology intervention, but the risk points of electronic paid documents need manual positioning, the efficiency is low, and omission is easy. The artificial intelligence technology is utilized to help practitioners in various industries to carry out risk point auditing work on electronic contract documents, so that the risk points are automatically identified, the efficiency of auditing the contracts in various industries is improved, and the omission of the risk points is avoided.
Common risk point identification scenes in contracts are generally divided into common scenes, nested scenes and overlapped scenes.
The common scene means that risk points are not associated with each other in the contract text, such as: [ first name: company limited, telephone number: year 2022, month 01, day 01, wherein the risk point "first party name" [ finity company ] is not associated with the risk point "telephone number" [ year 2022, month 01, day 01 ].
The nested scene refers to the fact that entity texts are nested with each other, for example [ from the date of the same life, the first party pays the second party in a bank transfer mode ] and the text belongs to a risk point 'payment mode', but the text [ bank transfer ] also belongs to a risk point 'payment tool'.
The risk points overlap means that, for text [ contract time: 2022, 01 month, 01 day, wherein [ 2022, 01 month, 01 day ] is both the risk point "first party contract time" and the risk point "second party contract time" in the contracting parties, and belongs to both risk points.
Methods of named entity identification can be generally divided into sequence annotation-based and pointer-based methods. The method of sequence labeling is to give a label to each word of the text, and the extraction of the entity is completed through the correspondence between the label and the category. Pointer-based methods typically predict head and tail pointers and then assign the text wrapped by the head and tail pointers to corresponding entity classes.
In the contract, for the entity identification of the common scene, no matter a method based on sequence labeling or a method based on a pointer can well complete the task, the common mode is to use a machine learning model to model the text, then extract entity information by adopting a method based on sequence labeling, and the common model has a Long Short-Term Memory network (LSTM) such as a bidirectional LSTM named entity identification method based on prediction position attention described in the patent application No. CN201910225622.7 or a mixed language material named entity identification method based on Bi-LSTM-CNN mixed with a Convolutional Neural network model (CNN) to capture characteristics, such as described in the patent application No. CN 201710946532.8.
For the nested scenario, a general sequence labeling-based method fails because the general sequence labeling method can only assign one category information to each word, and the overlapping part of the entities corresponds to multiple tags in the nested case. At this time, the problem of entity nesting can be solved by improving a method based on sequence labeling, for example, a multi-label classification problem can be converted into a multi-classification problem through combining labels, or a method of hierarchical identification is proposed, and nested entities are hierarchically identified from an internal entity to an external entity and from the external entity to the internal entity. As described in patent publication No. CN114281937a, each sub-entity of a nested entity can be identified by predicting a first nested entity and then identifying a second nested entity based on the prediction information of the first nested entity. For the nested entity identification, by means of determining the start index and the end index and the entity class label, for a non-overlapping nested entity, as long as the start index and the end index of the entity are unique, it indicates that the entity can be uniquely identified, as described in patent application CN 202011522097.4. The document with patent publication number CN114386417a proposes a named entity recognition method of incorporated word boundary information, which utilizes information of incorporated word level in an external word list, utilizes a pre-training model to extract vector representation of semantic information, and performs head-tail range judgment on a sequence input into a bidirectional affine network based on a bidirectional affine network method.
For the overlapped scene, because the entity belongs to the particularity of the entity overlapped scene, namely, the entities are completely consistent, the current method generally considers that different entities have different head and tail parts, in the scene of extracting the risk point identification information in the contract, a range formed by the head and the tail possibly belongs to two or more risk points at the same time, so that the method for processing the common scene and the nested scene is invalid, and the general method is to model the entity independently or extract the entity independently firstly and then distribute the entity to different categories by utilizing a multi-classification model.
However, the above prior art has the following disadvantages:
1. nested and overlapping scenarios where risk points in the treaty document exist in large numbers. For a nested scene, a general model is difficult to capture nested entity information of a contract text, such as LSTM and CNN, because there is no interaction of information between head and tail entities, the recognition effect is not good. For the overlapping scene, a method of identifying different risk points by using separate models is adopted, so that the overlapping problem caused by identifying the risk points can be avoided, but a large amount of resource overhead is also caused, and a large amount of waste is caused. For the situation that error accumulation inevitably occurs when a model which firstly identifies risk points and then distributes corresponding labels to the risk points through classification is adopted, firstly, the learning process of two stages has a dependency relationship, the entity identification result of the first stage influences the classification result of the second stage, secondly, the weight of loss functions of each stage is introduced as a hyper-parameter, and finally, the network structure and the loss functions need to be designed manually in each stage, wherein the quality of enumeration candidate range needs to be improved by designing the network structure and the loss functions. The two-stage learning increases the difficulty of model design and the cost of parameter tuning.
2. The contract documents also have a large number of unbalanced categories, whether the method is based on sequence labeling or a pointer-based method, wherein the identification result of most candidates is not a certain entity, the number of the entities is far less than that of the non-entity candidate results, and the problem of sample unbalance exists, and the phenomenon of category unbalance in the multi-classification problem or the multi-label classification problem is not considered based on Binary cross entropy Loss (BCE Loss) or cross entropy Loss at present. The existing nested named entity recognition method uses cross entropy loss during training as described in the document with patent publication number CN112989835a to minimize the difference between the predicted distribution and the standard distribution, so that the problem of category imbalance is not concerned.
Therefore, it is very important to design a method and a system for identifying nested and overlapped risk points based on multi-label classification, which can automatically identify the risk points in the contract document and help practitioners in various industries to perform risk point review.
Disclosure of Invention
The invention provides a method and a system for identifying nested and overlapped risk points based on multi-label classification, which can automatically identify the risk points in a contract document and help practitioners in various industries to carry out risk point audit, in order to solve the problems that a great number of nested and overlapped scenes of the risk points in the conventional contract document and a great number of unbalanced categories of the contract document exist in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for identifying nested and overlapping risk points based on multi-label classification, comprising the steps of;
s1, after segmenting a contract document labeled with risk points and keywords into sentences, inputting the segmented contract document into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, segmenting a sentence of the contract document to be identified, inputting the sentence into a contract pre-training BERT model to extract word representation and bottom layer characteristics, and simultaneously obtaining position information and label information of a sentence sequence;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM), obtaining the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
Preferably, the pre-training in step S1 includes the following steps:
s11, shielding risk points and keywords in the contract in a random shielding mode, and predicting the shielded risk points and keywords by using an unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Preferably, step S2 includes the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor expression is set as any label type
Figure 409670DEST_PATH_IMAGE001
The initial position is
Figure 533484DEST_PATH_IMAGE002
Of a matrix
Figure 129419DEST_PATH_IMAGE003
Setting as 1;
s22, the sequence of the input sentence is represented by characters; input sentence
Figure 604263DEST_PATH_IMAGE004
In which
Figure 445311DEST_PATH_IMAGE005
Is a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)
Figure 423631DEST_PATH_IMAGE006
Wherein
Figure 675621DEST_PATH_IMAGE007
Pre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
Figure 887028DEST_PATH_IMAGE008
(1)。
preferably, step S3 includes the steps of:
s31, using the BilSTM to extract features, and fusing position information and label information of the sentence sequence into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model as
Figure 249876DEST_PATH_IMAGE009
Initializing a vector matrix of class information as
Figure 364594DEST_PATH_IMAGE010
Wherein m is the number of categories; derived from two weight matrices
Figure 787485DEST_PATH_IMAGE011
And
Figure 486189DEST_PATH_IMAGE012
is marked as
Figure 918307DEST_PATH_IMAGE013
And
Figure 356373DEST_PATH_IMAGE014
and obtaining the fusion information of the category and the position by using the formula (2)
Figure 215744DEST_PATH_IMAGE015
Figure 870586DEST_PATH_IMAGE016
(2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,
Figure 106395DEST_PATH_IMAGE017
a feature corresponding to each token; using fused information pairs of derived categories and locations
Figure 930125DEST_PATH_IMAGE017
Weighting is carried out, as shown in formula (4), and finally the obtained product
Figure 960398DEST_PATH_IMAGE018
For fusing categories and positions token characteristics of the information;
Figure 853268DEST_PATH_IMAGE019
(3)
Figure 876456DEST_PATH_IMAGE020
(4);
s32, inputting the token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
Figure 803961DEST_PATH_IMAGE021
(5)
Figure 490289DEST_PATH_IMAGE022
(6)
wherein,
Figure 870454DEST_PATH_IMAGE023
,
Figure 431755DEST_PATH_IMAGE024
are respectively candidate entity ranges
Figure 213766DEST_PATH_IMAGE025
The start and end positions of the optical fiber,
Figure 336574DEST_PATH_IMAGE026
and
Figure 469615DEST_PATH_IMAGE027
is the sentence in the position
Figure 834606DEST_PATH_IMAGE023
And
Figure 736703DEST_PATH_IMAGE024
fusing token characteristics of category and position information,
Figure 30412DEST_PATH_IMAGE028
Figure 385170DEST_PATH_IMAGE029
are respectively as
Figure 39005DEST_PATH_IMAGE026
Figure 68314DEST_PATH_IMAGE027
Entity after feature compression
Figure 516613DEST_PATH_IMAGE025
Cephalad and caudal features.
Preferably, step S4 includes the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, as shown in the following formula (7)
Figure 374978DEST_PATH_IMAGE030
And
Figure 566925DEST_PATH_IMAGE031
the characteristics of the characters of the beginning and end of the ith entity respectively;
Figure 427303DEST_PATH_IMAGE032
(7)
wherein
Figure 312082DEST_PATH_IMAGE033
A tensor of (NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION), NUM _ DIMENSION being a DIMENSION after feature compression, NUM _ LABEL being a number of entity classes;
is composed of
Figure 392165DEST_PATH_IMAGE034
The tensor of (a);
Figure 387803DEST_PATH_IMAGE035
for offset, the dimension is (
Figure 853419DEST_PATH_IMAGE036
Tensor of 1);
Figure 423946DEST_PATH_IMAGE037
is a classification score of entity i;
s42, classifying the result
Figure 975014DEST_PATH_IMAGE037
And performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the same
Figure 259495DEST_PATH_IMAGE038
Is taken as
Figure 845198DEST_PATH_IMAGE039
When the temperature of the water is higher than the set temperature,
Figure 586626DEST_PATH_IMAGE040
represent the lifeThe name entity type is
Figure 624990DEST_PATH_IMAGE039
The probability of (a) of (b) being,
Figure 978742DEST_PATH_IMAGE041
is entity i in type
Figure 950109DEST_PATH_IMAGE038
The classification score of (a) is calculated,
Figure 642865DEST_PATH_IMAGE038
has a value range of [1,C]C is the number of categories NUM _ LABEL; when in use
Figure 512732DEST_PATH_IMAGE025
When the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)
Figure 168710DEST_PATH_IMAGE042
m is a set hyperparameter, is a shift parameter of the probability, and will
Figure 729004DEST_PATH_IMAGE043
The following formula of belt-type (11)
Figure 782542DEST_PATH_IMAGE045
(ii) a If it is an entity, directly substitute the formula (11)
Figure 61077DEST_PATH_IMAGE046
In a
Figure 989587DEST_PATH_IMAGE046
In the calculation formula (2), p is the calculation result of the formula (8)
Figure 404388DEST_PATH_IMAGE047
Figure 894406DEST_PATH_IMAGE048
And
Figure 394658DEST_PATH_IMAGE050
respectively positive and negative attention parameters, representing the probability deviation width as a set hyper-parameter, and finally calculating the ASL loss value in the training stage according to a formula (10) and a formula (11);
Figure 923597DEST_PATH_IMAGE051
(8)
Figure 192904DEST_PATH_IMAGE052
(10)
Figure 588245DEST_PATH_IMAGE053
(11)
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
The invention also provides a system for identifying nested and overlapping risk points based on multi-label classification, comprising;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into the double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
Preferably, the pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the blocked keywords by using the context which is not covered by a BERT model;
randomly putting two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Compared with the prior art, the invention has the beneficial effects that: (1) The invention is based on BERT (bidirectional coding representation conversion algorithm), adopts a new pre-training method, leads the model to be pre-trained in a mode of predicting risk points in the contract through context, and changes the commonly used pre-training method of adjacent sentences into the training mode of the same paragraph according to the characteristics of the contract text; (2) Aiming at the problem of entity nesting and overlapping existing in a named entity recognition task of a contract document, a pointer-based mode is adopted, a bidirectional affine network is used as a module for entity range enumeration and entity classification, a matrix containing a label type and an entity initial position subscript is used instead of a label sequence to represent the label type of training data, the condition that a BIO sequence cannot represent a nested entity and two head-tail pointer sequences cannot represent an overlapped entity is avoided, the representation is simple and visual, labels of the nested and overlapped entities can be directly represented, particularly labels of different entities with completely same head-tail positions are represented and input into a double affine network by utilizing token fusion of positions and labels, interactive information of the head and tail of the entities can be well captured, and the problem of entity nesting and overlapping is solved; (3) Aiming at the problem of class unbalance, the invention uses asymmetric loss (ASL loss), which is an improvement of Focal loss (Focal loss) on the multi-label classification problem, and solves the problem of unbalance of positive and negative samples in the multi-label classification task; (4) Aiming at the condition that the entity is too long, the invention fuses the position information and the label information of the sequence into the BilSTM, so that the BilSTM has stronger sequence modeling capability, thereby reducing the condition of entity fracture.
Drawings
FIG. 1 is a schematic diagram of a method for identifying nested and overlapping risk points based on multi-label classification in accordance with the present invention;
fig. 2 is a schematic diagram of an actual application of the method for identifying nested and overlapped risk points based on multi-label classification according to the embodiment of the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
the method for identifying nested and overlapping risk points based on multi-label classification as shown in fig. 1 comprises the following steps;
s1, segmenting a contract document marked with risk points and keywords into sentences, and inputting the sentences into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, sentence segmentation is carried out on the contract document needing to be recognized, the contract document is input into a contract pre-training BERT model, word expression and bottom layer characteristics are extracted, and meanwhile position information and label information of a sentence sequence are obtained;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network BilSTM to obtain the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning by a label matrix and introducing an ASL loss function.
Further, the pre-training in step S1 includes the following steps:
s11, blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the blocked keywords by using the unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
Through the pre-training mode, the pre-trained BERT can more effectively acquire the characteristics of the contract text, and a better effect is generated on downstream tasks.
Further, step S2 includes the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor expression is set as any label type
Figure 575792DEST_PATH_IMAGE001
The initial position is
Figure 846106DEST_PATH_IMAGE002
Of a matrix
Figure 235499DEST_PATH_IMAGE003
Setting as 1;
the input of the model training stage comprises two parts, namely a text needing document named entity extraction and a position and an entity type of a named entity in the text based on pointer labeling. The entity LABELs of the data in the data loading phase passing through the model are converted into tensor L of (NUM _ LABEL, NUM _ SEQ _ LEN, NUM _ SEQ _ LEN), wherein NUM _ LABEL is the category number of the entity, and NUM _ SEQ _ LEN is the maximum length of the input text. L is a three-dimensional tensor, the first dimension being a categorical category and the second, three dimensions being all possible combinations of a starting location to an ending location of an entity.
S22, the sequence of the input sentence is represented by characters; input sentence
Figure 332899DEST_PATH_IMAGE004
Wherein
Figure 542163DEST_PATH_IMAGE005
Is a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)
Figure 366900DEST_PATH_IMAGE006
Wherein
Figure 860067DEST_PATH_IMAGE007
Pre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
Figure 112057DEST_PATH_IMAGE008
(1)。
further, step S3 includes the steps of:
s31, using the BilSTM to extract features, and fusing position information and label information of the sentence sequence into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model as
Figure 559350DEST_PATH_IMAGE009
The vector matrix of the initialized category information is
Figure 187777DEST_PATH_IMAGE010
Wherein m is the number of categories; derived from two weight matrices
Figure 535451DEST_PATH_IMAGE011
And
Figure 958342DEST_PATH_IMAGE012
is marked as
Figure 158510DEST_PATH_IMAGE013
And
Figure 325049DEST_PATH_IMAGE014
and obtaining the fusion information of the category and the position by using the formula (2)
Figure 277962DEST_PATH_IMAGE015
Figure 386601DEST_PATH_IMAGE016
(2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,
Figure 792174DEST_PATH_IMAGE017
a feature corresponding to each token; using fused information pairs of derived categories and locations
Figure 778716DEST_PATH_IMAGE017
Weighting is carried out, as shown in formula (4), and the obtained product is
Figure 851714DEST_PATH_IMAGE018
A token feature fusing category and position information;
Figure 865676DEST_PATH_IMAGE019
(3)
Figure 758545DEST_PATH_IMAGE020
(4);
s32, inputting token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
Figure 17619DEST_PATH_IMAGE021
(5)
Figure 210703DEST_PATH_IMAGE022
(6)
wherein,
Figure 661145DEST_PATH_IMAGE023
,
Figure 41311DEST_PATH_IMAGE024
are respectively candidate entity ranges
Figure 353343DEST_PATH_IMAGE025
The start and end positions of the air flow,
Figure 886087DEST_PATH_IMAGE026
and
Figure 258163DEST_PATH_IMAGE027
is the sentence in the position
Figure 640471DEST_PATH_IMAGE023
And
Figure 490616DEST_PATH_IMAGE024
a token feature that fuses category and location information,
Figure 143445DEST_PATH_IMAGE029
are respectively as
Figure 686422DEST_PATH_IMAGE026
Figure 48306DEST_PATH_IMAGE027
Entities after feature compression
Figure 702141DEST_PATH_IMAGE025
Cephalad and caudal features.
Further, step S4 includes the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, wherein the classification is shown in the following formula (7)
Figure 458744DEST_PATH_IMAGE030
And
Figure 657776DEST_PATH_IMAGE031
the characteristics of the characters of the beginning and end of the ith entity respectively;
Figure 765409DEST_PATH_IMAGE032
(7)
wherein
Figure 206623DEST_PATH_IMAGE033
A tensor of (NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION), NUM _ DIMENSION being a DIMENSION after feature compression, NUM _ LABEL being a number of entity classes;
Figure 817733DEST_PATH_IMAGE054
is composed of
Figure 718824DEST_PATH_IMAGE034
The tensor of (a);
Figure 782595DEST_PATH_IMAGE035
for offset, the dimension is (
Figure 512654DEST_PATH_IMAGE036
Tensor of 1);
Figure 493117DEST_PATH_IMAGE037
is a classification score of entity i;
s42, classifying the results
Figure 814377DEST_PATH_IMAGE037
And performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the same
Figure 116176DEST_PATH_IMAGE038
Is taken as
Figure 915505DEST_PATH_IMAGE039
When the temperature of the water is higher than the set temperature,
Figure 750475DEST_PATH_IMAGE040
represents a named entity type of
Figure 977057DEST_PATH_IMAGE039
The probability of (a) of (b) being,
Figure 766152DEST_PATH_IMAGE041
is entity i in type
Figure 369172DEST_PATH_IMAGE038
The classification score of (a) is determined,
Figure 793069DEST_PATH_IMAGE038
has a value range of [1,C]C is the number of categories NUM _ LABEL; when in use
Figure 456132DEST_PATH_IMAGE025
When the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)
Figure 732523DEST_PATH_IMAGE042
m is a set hyperparameter, is a shift parameter of the probability, and will
Figure 404813DEST_PATH_IMAGE043
The following formula of belt-type (11)
Figure 214375DEST_PATH_IMAGE045
(ii) a If it is a solid, it is directly substituted for the formula (11) above
Figure 48339DEST_PATH_IMAGE046
In the calculation formula (2), p is the calculation result of the formula (8)
Figure 77606DEST_PATH_IMAGE047
Figure 22428DEST_PATH_IMAGE048
And
Figure 686496DEST_PATH_IMAGE050
respectively positive and negative attention parameters, representing the probability deviation width as a set hyper-parameter, and finally calculating the ASL loss value in the training stage according to a formula (10) and a formula (11);
Figure 425782DEST_PATH_IMAGE051
(8)
Figure 676766DEST_PATH_IMAGE052
(10)
Figure 159700DEST_PATH_IMAGE053
(11)
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
The invention also provides a system for identifying nested and overlapped risk points based on multi-label classification, which comprises the following steps of;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into the double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
The pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the keywords by using an unmasked context through a BERT model;
two sentences are randomly put together, and whether the two sentences are in the same paragraph is judged through a BERT model.
Based on the technical scheme of the invention, the steps of the specific implementation and operation process are as follows:
pre-training:
1. and (4) masking the risk points and the keywords in the contract text, and sending the masked risk points and the keywords into a BERT model, wherein the BERT predicts the masked risk points and the keywords according to the unmasked context.
2. The sentences in the contract are randomly combined into sentence pairs and sent into a BERT model, and the BERT predicts whether the two sentences are in the same paragraph (same paragraph).
And (4) supervision training:
the training phase is shown in fig. 2 and has the following steps:
3. and dividing documents to be subjected to entity extraction and input into the system into a training set document set and a verification set.
4. Each document needs to contain two parts, one part is the text of the document, and the other part is the entity position and the label corresponding to the text. And the entity set of the document is subdivided according to the segmentation position of the sentence to obtain a list consisting of the sentence and the entities of the sentence.
5. Each sentence of the training set
Figure 678275DEST_PATH_IMAGE055
And sentence entity
Figure 588462DEST_PATH_IMAGE056
Respectively input into BERT to obtain vector representation of sentence token
Figure 326742DEST_PATH_IMAGE057
6. Token vector for final output of BERT
Figure 613367DEST_PATH_IMAGE058
Inputting into BilSTM to extract context information of sentence
Figure 737181DEST_PATH_IMAGE059
Then, the position information and the context information are fused according to the formula (4) to obtain the context expression of the fused position information
Figure 333116DEST_PATH_IMAGE020
7. Respectively inputting the sentence context information extracted by the BilSTM into two FFNN feed-forward neural networks with 1 layer, applying the formulas (5) and (6) to obtain the characteristics of the entity head and tail of the sentence,
Figure 276801DEST_PATH_IMAGE028
and
Figure 117850DEST_PATH_IMAGE029
8. inputting the characteristics of the head and the tail of the entity of the sentence into a bidirectional affine network, and applying the formula (7) and the formula (8) to output the range enumeration result of the entity and the probability corresponding to the range
Figure 361749DEST_PATH_IMAGE040
9. Calculating an ASL loss value of the affine-two network output probability by applying the formula (10) and the formula (11) according to an ASL loss function;
10. adjusting the weight of each layer of the neural network model according to the ASL loss value;
11. after sentences of the training set are trained for one round, calculating the accuracy of the current model parameter prediction verification set, and storing the model weight;
12. repeating the steps 1 to 9 until the preset epoch value is reached;
13. the value stored in the epoch and having the maximum accuracy in the verification set is the learned optimal model.
In addition, the prediction phase of the model comprises the following steps:
1. and (3) segmenting the text of the document by using periods to obtain a sentence list according to the entity positions and labels which are not required to be contained in the document to be predicted.
2. Inputting each sentence to be predicted into BERT to obtain vector representation of the sentence token;
3. inputting the token vector finally output by the BERT into the BilSTM to extract the context information of the sentence;
4. then fusing the position information and the context information according to a formula (4) to obtain a context expression of the fused position information;
5. respectively inputting the context expression of the fusion position information into two FFNN feedforward neural networks of 1 layer to obtain the characteristics of the entity head and the tail of the sentence;
6. inputting the characteristics of the head and the tail of the entity of the sentence into a bidirectional affine network, and outputting a range enumeration result and the probability corresponding to the range of the entity;
7. outputting entities with the probability larger than a set threshold value in the enumeration range result, namely the entities extracted from the sentences;
8. combining the extraction results of all sentences in the document to serve as an entity extraction result of the document;
the entity extraction model is trained from the training data set through the training stage, and the entity contained in the document is predicted through the prediction result after the document to be extracted is input, so that the training and prediction process of a complete entity extraction model is completed.
The invention adopts a method based on a bidirectional affine network to identify entities in contract documents to solve the problems of entity overlapping and entity nesting in the task of extracting the entities in the contract documents, improves the universality of the algorithm by adopting a mode of cutting documents by periods, can be simply popularized in the task of extracting other scenes, adopts BERT to carry out token representation and BilSTM to carry out feature extraction, adopts a label matrix and the bidirectional affine network to carry out entity range enumeration and entity category prediction, can solve the problem of entity nesting in the contract document extraction, and adopts ALS loss to optimize network parameters to solve the problem of complete overlapping of the initial positions of the entities.
The invention discloses a pre-training method based on the characteristics of contract texts, which is characterized in that pre-training is carried out by a training method which enables a model to predict hidden risk points and keywords and predict whether two sentences are in the same paragraph, and BERT pre-trained according to the method has more contract text information, can better acquire the contract text information during training prediction, and greatly reduces the requirements of training resources.
The invention creatively provides a method for enumerating and classifying entity ranges in contract documents based on a bidirectional affine network, which is different from a method for enumerating the entity ranges and then classifying the entities.
The invention creatively blends the position information and the category information of the bottom layer into the middle layer BilSTM network, so that the capacity of capturing the long text information by the middle layer BilSTM network is enhanced, the information characteristics of the risk points can be better extracted, and the possibility of entity fracture is greatly reduced.
The invention innovatively designs the entity range in the contract document expressed by using the label matrix, combines the multi-label classification loss function ASL loss, is suitable for the problem of network parameter optimization under the condition of entity nesting and overlapping in the contract document, solves the problem of category imbalance in range-based named entity extraction, and can be popularized to other scenes.
In conclusion, the method has the characteristics of reducing manual design, having strong universality and solving the problems of entity nesting and overlapping in entity extraction in the contract document.
The foregoing has outlined, rather broadly, the preferred embodiment and principles of the present invention in order that those skilled in the art may better understand the detailed description of the invention without departing from its broader aspects.

Claims (7)

1. A method for identifying nested and overlapping risk points based on multi-label classification, comprising the steps of;
s1, after segmenting a contract document labeled with risk points and keywords into sentences, inputting the segmented contract document into a BERT model for pre-training to obtain a contract pre-training BERT model;
s2, segmenting a sentence of the contract document to be identified, inputting the sentence into a contract pre-training BERT model to extract word representation and bottom layer characteristics, and simultaneously obtaining position information and label information of a sentence sequence;
s3, merging the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM), obtaining the characteristics of the merged position and label information and performing characteristic compression;
and S4, inputting the features compressed in the step S3 into a double affine network, performing range enumeration and entity label classification, and performing parameter learning through a label matrix and an introduced ASL loss function.
2. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 1, wherein the pre-training in step S1 comprises the steps of:
s11, shielding risk points and keywords in the contract in a random shielding mode, and predicting the shielded risk points and keywords by using an unmasked context through a BERT model;
and S12, randomly putting the two sentences together, and judging whether the two sentences are in the same paragraph or not through a BERT model.
3. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 2, wherein step S2 comprises the steps of:
s21, representing the entity labels of the input sentences by adopting a three-dimensional tensor; the three-dimensional tensor is expressed asSet for any one of the label categories as
Figure 477911DEST_PATH_IMAGE001
The initial position is
Figure 644450DEST_PATH_IMAGE002
Of a matrix
Figure 128521DEST_PATH_IMAGE003
Setting as 1;
s22, the sequence of the input sentence is represented by characters; input sentence
Figure 253472DEST_PATH_IMAGE004
Wherein
Figure 65570DEST_PATH_IMAGE005
Is a word in a sentence;
s23, inputting the character sequence of the sentence into a contract pre-training BERT model obtained by pre-training; obtaining a vector representation of a sentence as shown in equation (1)
Figure 301379DEST_PATH_IMAGE006
Wherein
Figure 639957DEST_PATH_IMAGE007
Pre-training the output of the last hidden layer in the BERT model for the contract, wherein n is the length of a sentence;
Figure 935809DEST_PATH_IMAGE008
(1)。
4. the method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 3, wherein step S3 comprises the steps of:
s31, using BilSTM to extract features, and extracting position information and label information of sentence sequenceMelting the information into the BilSTM; setting the position vector of the contract document in the contract pre-training BERT model as
Figure 94258DEST_PATH_IMAGE009
The vector matrix of the initialized category information is
Figure 9124DEST_PATH_IMAGE010
Wherein m is the number of categories; derived from two weight matrices
Figure 467787DEST_PATH_IMAGE011
And
Figure 934541DEST_PATH_IMAGE012
is marked as
Figure 580286DEST_PATH_IMAGE013
And
Figure 298843DEST_PATH_IMAGE014
and obtaining the fusion information of the category and the position by using the formula (2)
Figure 612013DEST_PATH_IMAGE015
Figure 249667DEST_PATH_IMAGE016
(2);
The formula of the BilSTM feature extraction is formula (3), wherein X is a sentence vector output by the contract pre-training BERT model,
Figure 117129DEST_PATH_IMAGE017
a feature corresponding to each token; using fused pairs of information derived for category and location
Figure 639377DEST_PATH_IMAGE017
Weighting is carried out as shown in equation (4) and finallyObtained
Figure 807054DEST_PATH_IMAGE018
A token feature fusing category and position information;
Figure 350030DEST_PATH_IMAGE019
(3)
Figure 235947DEST_PATH_IMAGE020
(4);
s32, inputting token features fused with the category and the position information into two feed-forward neural networks FFNN for feature compression, wherein the specific process is shown as a formula (5) and a formula (6):
(5)
Figure 155361DEST_PATH_IMAGE021
(6)
wherein,
Figure 318489DEST_PATH_IMAGE022
,
Figure 297947DEST_PATH_IMAGE023
are respectively candidate entity ranges
Figure 405580DEST_PATH_IMAGE024
The start and end positions of the air flow,
Figure 128685DEST_PATH_IMAGE025
and
Figure 146320DEST_PATH_IMAGE026
is the sentence in the position
Figure 31099DEST_PATH_IMAGE022
And
Figure 360449DEST_PATH_IMAGE023
fusing token characteristics of category and position information,
Figure 887246DEST_PATH_IMAGE027
Figure 884021DEST_PATH_IMAGE028
are respectively as
Figure 205280DEST_PATH_IMAGE025
Figure 162872DEST_PATH_IMAGE026
Entity after feature compression
Figure 493359DEST_PATH_IMAGE024
Cephalad and caudal features.
5. The method for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 4, wherein step S4 comprises the steps of:
s41, inputting the compressed features into a bidirectional affine network classifier for classification, wherein the classification is shown in the following formula (7)
Figure 344641DEST_PATH_IMAGE029
And features of characters that are the beginning and end, respectively, of the ith entity;
Figure 836802DEST_PATH_IMAGE030
(7)
wherein
Figure 281690DEST_PATH_IMAGE031
(NUM _ DIMENSION, NUM _ LABEL, NUM _ DIMENSION) tensor, NUM _ DIMENSION being the DIMENSION after feature compression, NUM _ LABELIs the number of entity classes;
Figure 884709DEST_PATH_IMAGE032
is composed of
Figure 590497DEST_PATH_IMAGE033
The tensor of (a);
Figure 519139DEST_PATH_IMAGE034
for offset, the dimension is (
Figure 310377DEST_PATH_IMAGE035
Tensor of 1);
Figure 858033DEST_PATH_IMAGE036
is a classification score of entity i;
s42, classifying the result
Figure 418328DEST_PATH_IMAGE036
And performing loss calculation on the input label in the step S21, and for any entity i, when the entity type is the same
Figure 517871DEST_PATH_IMAGE037
Is taken as
Figure 796405DEST_PATH_IMAGE038
When the temperature of the water is higher than the set temperature,
Figure 475648DEST_PATH_IMAGE039
represents a named entity type of
Figure 890449DEST_PATH_IMAGE038
The probability of (a) of (b) being,
Figure 895314DEST_PATH_IMAGE040
is entity i in type
Figure 661145DEST_PATH_IMAGE037
The classification score of (a) is calculated,
Figure 409658DEST_PATH_IMAGE037
has a value range of [1,C]C is the number of categories NUM _ LABEL; when in use
Figure 819911DEST_PATH_IMAGE024
When the probability is not an entity, the probability after the deviation is calculated by adopting a probability driving formula (10) calculated by adopting a formula (8), wherein p is a result calculated by adopting the formula (8)
Figure 261257DEST_PATH_IMAGE041
m is a set hyperparameter, is a shift parameter of the probability, and will
Figure 514383DEST_PATH_IMAGE042
The following formula of belt-type (11)
Figure 66587DEST_PATH_IMAGE043
(ii) a If it is an entity, directly substitute the formula (11)
Figure 596926DEST_PATH_IMAGE044
In a
Figure 943594DEST_PATH_IMAGE044
In the calculation formula (2), p is the calculation result of the formula (8)
Figure 684017DEST_PATH_IMAGE045
Figure 39912DEST_PATH_IMAGE046
And
Figure 424756DEST_PATH_IMAGE048
respectively positive and negative attention parameters, and set hyperparameters representing probability deviationWidth, and finally calculating an ASL loss value in a training stage according to a formula (10) and a formula (11);
Figure 942325DEST_PATH_IMAGE049
(8)
Figure 904465DEST_PATH_IMAGE050
(10)
Figure 798472DEST_PATH_IMAGE051
(11)
and carrying out iterative tuning on the weights of all layers in the neural network model by utilizing a back propagation algorithm according to the calculated ASL loss value.
6. A system for identifying nested and overlapping risk points based on multi-label classification, comprising;
the pre-training module is used for segmenting the contract document labeled with the risk points and the keywords into sentences and inputting the sentence into the BERT model for pre-training to obtain a contract pre-training BERT model;
the system comprises a feature extraction module, a sentence segmentation module, a contract pre-training BERT module and a recognition module, wherein the feature extraction module is used for segmenting a sentence of a contract document to be recognized, inputting the sentence into the contract pre-training BERT module to extract word representation and bottom layer features, and simultaneously obtaining position information and label information of a sentence sequence;
the characteristic fusion module is used for fusing the position information and the label information of the sentence sequence into a bidirectional long-short term memory network (BilSTM) to obtain the characteristics fused with the position and the label information and carry out characteristic compression;
and the classification module is used for inputting the compressed features into a double affine network, performing range enumeration and entity label classification, and performing parameter learning by a label matrix and introducing an ASL loss function.
7. The system for identifying nested and overlapping risk points based on multi-label classification as claimed in claim 6, wherein the pre-training module is specifically as follows:
blocking the risk points and the keywords in the contract in a random covering mode, and predicting the blocked risk points and the keywords by using an unmasked context through a BERT model;
two sentences are randomly put together, and whether the two sentences are in the same paragraph is judged through a BERT model.
CN202211366277.7A 2022-11-03 2022-11-03 Method and system for identifying nested and overlapped risk points based on multi-label classification Active CN115470354B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211366277.7A CN115470354B (en) 2022-11-03 2022-11-03 Method and system for identifying nested and overlapped risk points based on multi-label classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211366277.7A CN115470354B (en) 2022-11-03 2022-11-03 Method and system for identifying nested and overlapped risk points based on multi-label classification

Publications (2)

Publication Number Publication Date
CN115470354A true CN115470354A (en) 2022-12-13
CN115470354B CN115470354B (en) 2023-08-22

Family

ID=84338111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211366277.7A Active CN115470354B (en) 2022-11-03 2022-11-03 Method and system for identifying nested and overlapped risk points based on multi-label classification

Country Status (1)

Country Link
CN (1) CN115470354B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN116092493A (en) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116306657A (en) * 2023-05-19 2023-06-23 之江实验室 Entity extraction method and system based on square matrix labeling and double affine layers attention

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413785A (en) * 2019-07-25 2019-11-05 淮阴工学院 A kind of Automatic document classification method based on BERT and Fusion Features
US10528866B1 (en) * 2015-09-04 2020-01-07 Google Llc Training a document classification neural network
CN112101027A (en) * 2020-07-24 2020-12-18 昆明理工大学 Chinese named entity recognition method based on reading understanding
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
US20210034812A1 (en) * 2019-07-30 2021-02-04 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
WO2021051516A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Ancient poem generation method and apparatus based on artificial intelligence, and device and storage medium
CN112860889A (en) * 2021-01-29 2021-05-28 太原理工大学 BERT-based multi-label classification method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10528866B1 (en) * 2015-09-04 2020-01-07 Google Llc Training a document classification neural network
US20210012199A1 (en) * 2019-07-04 2021-01-14 Zhejiang University Address information feature extraction method based on deep neural network model
CN110413785A (en) * 2019-07-25 2019-11-05 淮阴工学院 A kind of Automatic document classification method based on BERT and Fusion Features
US20210034812A1 (en) * 2019-07-30 2021-02-04 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
WO2021051516A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Ancient poem generation method and apparatus based on artificial intelligence, and device and storage medium
CN112101027A (en) * 2020-07-24 2020-12-18 昆明理工大学 Chinese named entity recognition method based on reading understanding
CN112860889A (en) * 2021-01-29 2021-05-28 太原理工大学 BERT-based multi-label classification method

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115995087A (en) * 2023-03-23 2023-04-21 杭州实在智能科技有限公司 Document catalog intelligent generation method and system based on fusion visual information
CN116092493A (en) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116092493B (en) * 2023-04-07 2023-08-25 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116306657A (en) * 2023-05-19 2023-06-23 之江实验室 Entity extraction method and system based on square matrix labeling and double affine layers attention
CN116306657B (en) * 2023-05-19 2023-08-22 之江实验室 Entity extraction method and system based on square matrix labeling and double affine layers attention

Also Published As

Publication number Publication date
CN115470354B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN115470354A (en) Method and system for identifying nested and overlapped risk points based on multi-label classification
CN109597493B (en) Expression recommendation method and device
CN115952291B (en) Financial public opinion classification method and system based on multi-head self-attention and LSTM
CN111339260A (en) BERT and QA thought-based fine-grained emotion analysis method
CN112528031A (en) Work order intelligent distribution method and system
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN116956929B (en) Multi-feature fusion named entity recognition method and device for bridge management text data
CN112434164A (en) Network public opinion analysis method and system considering topic discovery and emotion analysis
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN115455189A (en) Policy text classification method based on prompt learning
CN117197569A (en) Image auditing method, image auditing model training method, device and equipment
CN111651597A (en) Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network
Kiyak et al. Comparison of image-based and text-based source code classification using deep learning
CN111274494A (en) Composite label recommendation method combining deep learning and collaborative filtering technology
CN113761184A (en) Text data classification method, equipment and storage medium
Jayashree et al. Sentimental analysis on voice based reviews using fuzzy logic
Léon Extracting information from PDF invoices using deep learning
CN107729509A (en) The chapter similarity decision method represented based on recessive higher-dimension distributed nature
CN114510569A (en) Chemical emergency news classification method based on Chinesebert model and attention mechanism
CN113177121A (en) Text topic classification method and device, electronic equipment and storage medium
CN116562284B (en) Government affair text automatic allocation model training method and device
CN118210926B (en) Text label prediction method and device, electronic equipment and storage medium
Seerangan et al. Ensemble Based Temporal Weighting and Pareto Ranking (ETP) Model for Effective Root Cause Analysis.
Liu A Big Data Analysis of Job Position Status Based on Natural Language Processing
Dogra Aspect-Based Approaches for Measuring Customer Feedback in the E-Commerce Industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant