CN113869053A - Method and system for recognizing named entities oriented to judicial texts - Google Patents
Method and system for recognizing named entities oriented to judicial texts Download PDFInfo
- Publication number
- CN113869053A CN113869053A CN202111157229.2A CN202111157229A CN113869053A CN 113869053 A CN113869053 A CN 113869053A CN 202111157229 A CN202111157229 A CN 202111157229A CN 113869053 A CN113869053 A CN 113869053A
- Authority
- CN
- China
- Prior art keywords
- character
- data
- vector
- model
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Abstract
The invention relates to a method and a system for recognizing named entities facing judicial texts. The invention can effectively improve the speed and ensure the perception of the model to the long text; context convolution characteristic associated information can be effectively captured, and the problems of entity name overlapping and illegal labels are solved; the context characteristic information can be effectively extracted, the training speed is high, and the structure is simple.
Description
Technical Field
The invention relates to the field of data identification and processing, in particular to a method and a system for identifying a named entity oriented to a judicial text.
Background
With the continuous deepening application of technologies such as informatization, big data, natural language processing, artificial intelligence and the like in the judicial field, the creation of intelligent judicial and intelligent judicial platforms becomes a research hotspot in the current judicial field. In recent years, in terms of informatization optimization of case handling processes and in terms of intellectualized improvement of case handling efficiency, domestic intelligent jurisdictions have achieved a series of achievements, such as 12309 integrated service network platform, court informatization system and the like.
However, the existing intelligent judicial correlation technology has obvious limitations, and the product experience has a larger promotion space. By far, the Chinese judge paperwork network has disclosed nearly 6000 million pieces of judge paperwork. The judicial documents have large stock and various document types, so that the realization of automatic processing of judicial text information becomes the key work of intelligent judicial. The automatic extraction of the judicial text information has important significance in solving or improving the problems of few people and the like in a judicial system and improving the working efficiency of the judicial, and is an important aspect of effectively maintaining social fairness and justice.
Entity recognition is an important technology in automatic information extraction, and is the fundamental and necessary work of intelligent judicial platforms such as intelligent knowledge search in the judicial field, intelligent judge and sentencing, intelligent document generation and the like. Therefore, the research on the recognition of judicial text entities is particularly important for further advancing the intelligent judicial platform. Named Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. Simply, the boundaries and categories of entity designations in natural text are identified.
At present, the named entity recognition method mainly comprises a method based on rules and dictionaries, an unsupervised learning method which utilizes vocabulary resources (such as WordNet) to carry out context clustering, a mixing method combining several models and a method based on deep learning. The algorithm effect based on deep learning is good. The named entity recognition method based on deep learning is to use named entity recognition as a sequence labeling task, and the more classical method is LSTM + CRF and BiLSTM + CRF. However, the method based on deep learning needs further optimization due to the defects that the word representation limitation is large, the local semantic information is not sufficiently obtained, the parallel computing speed is low during training, and the like. The BERT model is proposed to achieve the state-of-the-art performance under various natural language tasks, and a plurality of subsequent researches are improved on the basis. For example, patent publication No. CN109992782A uses a BERT pre-training model to serve as a role of identifying word2vec in the traditional BiLSTM + CRF model architecture, and obtains more information with a deeper network, thereby improving accuracy and recall rate of named entity identification. However, the method has the defects of low parallel computing speed, poor mobility and the like during training. The patent publication number CN110807324A trains the entity recognition model through an IDCNN-CRF algorithm, and verifies and outputs the model prediction result. However, the method has the problems that the word representation limitation is large, and the data characteristics are seriously lost and the detail information is lost when the encoding and the decoding are carried out. Some researchers have proposed a Chinese named entity recognition method based on BERT-IDCNN-CRF, the method obtains the context representation of the word through a BERT pre-training language model, and then inputs the word vector sequence into the IDCNN-CRF model for training. The whole model is divided into a BERT layer, an IDCNN layer and a CRF layer, wherein the BERT layer obtains context-related word vector representation, the IDCNN layer extracts features, and the CRF layer prevents illegal tag sequences and obtains tag sequences with the maximum probability. The model has good timeliness, but the algorithm needs to be combined with three algorithms, and the overall structure is complex. Furthermore, the method of directly succeeding the IDCNN layer with CRF fails to take into account the correlation between different characters, which results in a loss of precision.
Disclosure of Invention
The invention aims to overcome the defects and provides a method and a system for recognizing named entities facing judicial texts, wherein the method and the system mainly adopt a named entity recognition method combining pre-training model character embedding with iterative expanded convolutional neural network (IDCNN), can effectively improve the speed and ensure the perceptibility of the model to long texts; by adopting a convolutional network coding and decoding model, context convolutional characteristic associated information can be effectively captured, and the problems of entity name overlapping and illegal labels are solved.
The invention achieves the aim through the following technical scheme: a method for recognizing named entities oriented to judicial texts comprises the following steps:
(1) data preprocessing: collecting judicial text data, performing cleaning and data standardization operation on the collected data, and performing text labeling on the data to obtain a training set, a cross validation set and a test set;
(2) character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
(3) and (3) expansion convolutional coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
(4) correlation normalization: calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector;
(5) and (3) expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
Preferably, the step (1) is specifically as follows:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on personal names, place names and organization names, legal provisions, court, judges, bookclerks and other legal personnel judicial text naming entities, wherein text labeling adopts a text labeling tool Doccano for labeling, and a data set labeling mode adopts a BIO labeling set to obtain trained corpora, which are divided into a training set, a cross validation set and a test set;
and (1.4) dividing the preprocessed data into a training set, a cross validation set and a test set according to proportions.
Preferably, the step (1.4) is to use the preprocessed data in a ratio of 8: 1.
Preferably, the BERT pre-training language model in step (2) is obtained by refining google BERT model with a large amount of judicial texts, and comprises:
(2.1) obtaining a pre-trained BERT model of Google, preparing judicial text data and arranging the judicial text data into an available data format;
(2.2) using the prepared text data as input, wherein the input data is characterized by word vectors, paragraph vectors and position vectors, and the three vectors are spliced to be used as the integral input of a BERT model;
(2.3) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, then predicting a covering value by using the BERT model, and minimizing a prediction error;
and (2.4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model.
Preferably, the masking in step (2.3) refers to masking 15% of the characters.
Preferably, the expanded convolutional neural network model (IDCNN) in step (3) performs feature extraction on the input character feature vector, and the specific steps are as follows:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtOutputting expansion vectors after expansion convolution for the word vector representation after full text semantics are fused, wherein the expansion vectors are represented as the probability of each class label; let the jth layer of the iterative expansion step size theta be convolved withThe iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3; the first layer in the network is iterative convolutionInputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
it=D1 (0)xt (1)
preferably, the iterative dilation convolutionL in (1)cLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold into each layer x in wider and wider rangetEmbedded representation of (a); wherein the step size of dilation increases exponentially with the number of layers of convolution; let r () denote the Relu activation function, fromInitially, a layer stack is defined that appears repeatedly, and the expression is as follows:
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vectorStacking and initializingThe expression is as follows:
wherein k is greater than or equal to 1.
Preferably, to prevent overfitting, the resulting output of the dilation convolution module is applied to 4 stacked dilation vectors using a dropout (.) functionAnd (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
preferably, in the association normalization method in step (4), the character feature vector R obtained by inputting IDCNN is [ R1.,. rn ], the association between characters is calculated, and a new character feature vector Z is obtained [ Z1.,. zn ]; the computational expression of Z is:
Z=WN.R
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
Preferably, the expanding convolution decoding in the step (5) is to splice the feature vectors and the covering and labeling sequences, perform probability calculation on the spliced feature vectors, and finally output an entity identification result by utilizing the probability maximization of the canonical sequence; the method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder outputs the sequence Z and the predicted feature sequence to form a state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) Mapping the extracted features to determine whether the input characters are entity labels or not through the full connection layerRate; the full-connection layer multiplies the weight matrix and the input vector, adds offset, maps n real numbers into K real numbers with the same number of corresponding entity labels, simultaneously adds a Relu nonlinear function between each layer of the network as an excitation function, and maps the K real numbers into K probabilities with the range of (0, infinity); the specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
Preferably, the model training in step (5) is performed to obtain a model by:
obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, and extracting data characteristics until the loss value is not reduced;
saving the finally obtained optimal entity identification model as a model, calculating the accuracy, the recall rate and the F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
A system for judicial text-oriented named entity recognition, comprising: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
The invention has the beneficial effects that: compared with the prior art, the method can effectively improve the speed and ensure the perception of the model to the long text; the method can effectively capture context convolution characteristic associated information and solve the problems of entity name overlapping and illegal labeling. The method can effectively extract the context characteristic information, and has the advantages of high training speed and simple structure.
Drawings
FIG. 1 is a schematic structural diagram of a system for recognition of named entities oriented to judicial texts according to the present invention;
FIG. 2 is a schematic flow chart of a method for recognition of named entities oriented to judicial texts according to the present invention;
FIG. 3 is a schematic diagram of the operation of the convolutional network module of the present invention.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:
example 1: as shown in fig. 1, a system for recognition of named entities oriented to judicial texts is composed of a data preprocessing module, a character vector embedding module, an expansion convolution feature coding module, an association normalization module and an expansion convolution feature decoding module; the method comprises the following steps: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
As shown in fig. 2, a judicial-text-oriented named entity recognition is implemented by the following steps:
(1) data preprocessing module data preprocessing:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on a person name, a place name and an organization name, legal provisions, court, judges, bookmarker and other legal personnel judicial text naming entities, wherein the text labeling adopts a text labeling tool Doccano for labeling, and a BIO labeling set is adopted as a data set labeling mode, wherein the person name is marked by PER, the place name is marked by LOC, and the organization name is marked by ORG. And the location of the word is identified using a set of ternary tokens { B, I, O }, where "B" represents the first word of an entity, "I" represents the remaining words of an entity, and "O" represents a non-entity. The data set is labeled by combining the entity category and the position of the word, and seven types of labels are shown in the following table 1. Obtaining a training corpus after the labeling is finished, and dividing the training corpus into a training set, a cross validation set and a test set;
TABLE 1
(1.4) dividing the preprocessed data into three parts according to the ratio of 8: 1, namely a training set train.tsv, a verification set dev.tsv and a test set test.tsv, wherein each row in the file comprises two elements, words and identifications, and each sentence is separated by a blank space.
For example: the sentence "Wangwu and Xunewu district civil and administrative department dispute examination is as follows" is labeled as "[ B-PER, I-PER, O, B-LOC, I-LOC, I-LOC, B-ORG, I-ORG, I-ORG, O, O, O, O, O ]".
(2) Character vector embedding module character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
the BERT pre-training language model is obtained by refining the Google BERT model through a large amount of judicial texts, and comprises the following steps:
(2.1) text vectorization: acquiring a BERT pre-training language model, and sorting a data set into an available data format, wherein input data is represented by word vectors, paragraph vectors and position vectors, three vectors are spliced to be used as integral input of the BERT model, namely, each character is converted into a vector representation corresponding to each word after full-text semantics are fused through the BERT model, and the length of the character vector is 768 dimensions.
(2.2) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, namely covering 15% of characters, predicting a covering value by using the BERT model, and minimizing a prediction error;
(2.3) character vector embedding: and (4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model. Each sentence gets a vector with shape of 768 dimensions, and then 64 sentences are set in a batch. The sequence length of each sentence is set to 40, and the input text sequence is [ t ]1,t2,…,tn](n is more than 0 and less than or equal to 40). The final vector dimension generated was 64 x 40 x 768.
(3) The expansion convolution characteristic coding module is used for expansion convolution coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
the expansion convolution neural network model (IDCNN) performs feature extraction on the input character feature vector, as shown in fig. 3, the lower half part of the expansion convolution neural network model is spliced together by 4 expansion convolution neural network blocks with the same size and structure, and each expansion convolution block has 3 layers of expansion convolution layers with expansion step length of {1, 1, 2 }. Wherein the maximum superposition step size in each block is 4 and the maximum filtering step size is 3. The lower half of the graph shows the convolution operations of 3 layers of superposition, respectively, wherein the first one is normal convolution, the expansion step size is 1, and the convolution kernel size is 3; continuously superposing the expansion convolution vectors, wherein the expansion step length is 2, and the perception visual field is increased to 7; the superposition of the dilated convolution vectors continues on top of the convolution operation of the previous step, with a dilation step of 4, where the convolution perception field is correspondingly enlarged to 15.
The method comprises the following specific steps:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtFor the word vector representation after the full text semantics are fused, the output is the expansion vector after expansion convolution, which is represented as the probability of each class label. Let the jth layer of the iterative expansion step size theta be convolved withThe iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3. The first layer in the network is iterative convolutionInputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
it=D1 (0)xt (1)
preferably, L in the iterative dilation convolutioncLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold into each layer x in wider and wider rangetEmbedded representation of (a); where the step size of the dilation increases exponentially as the number of layers of convolution increases. Let r () denote the Relu activation function, fromInitially, a layer stack is defined that appears repeatedly, and the expression is as follows:
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vectorStacking and initializingThe expression is as follows:
wherein k is greater than or equal to 1.
To prevent overfitting, the resulting output of the dilated convolution module is applied to 4 stacked dilated vectors using the dropout (.) functionAnd (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
(4) the association normalization module performs association normalization: and calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector. The component calculates new character vector features, the new vector dimensions remaining unchanged. For example: "civil and administrative dispute between wang wu and xu hui district", the feature vector of a single character is 256. The weighting sequence for the character "king" is [0.6, 0.3, 0.02, 0.02, 0.02, 0.02, 0.02, 0, 0, 0, 0], and the weight sum is 1. The characteristic vector of the character 'wang' after weighting is
0.6*r1+0.3*r2+0.02*r3+0.02*r4+0.02*r5+0.02*r6+0.02*r7。
The association normalization method calculates the association between characters by inputting a character feature vector R ═[ R1.,. rn ] obtained by IDCNN, and obtains a new character feature vector Z ═[ Z1.,. zn ]. The computational expression of Z is:
Z=WN·R
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
(5) The expansion convolution characteristic decoding module is used for expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
The expansion convolution decoding means that the feature vector and the covering mark sequence are spliced, probability calculation is carried out on the spliced feature vector, and finally an entity identification result is output by utilizing the probability maximization of the canonical sequence. For example, when predicting yt, the mask tag sequence is [ pad, y1,y2...,yt-1,mask...]That is, the entity name type corresponding to the preceding character of the t-sequence number is known, and the following type is filled as mask. And calculating the mask labeling sequence by using IDCNN to obtain a decoding characteristic corresponding to yt, wherein the dimension of the decoding characteristic is 256. After splicing, 2 × 256 decoded feature vectors are obtained. The probability of the corresponding entity name type is then predicted using the fully connected layer and the sofimax layer.
The method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder outputs the sequence Z and the predicted feature sequence to form a state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) And mapping the extracted features to the probability of whether the input characters are entity labels or not through the full connection layer. The full-connection layer multiplies the weight matrix and the input vector and adds offset, and maps n real numbers into corresponding entity labelsThe same number of K real numbers are added, meanwhile, a Relu nonlinear function is added between each layer of the network to serve as an excitation function, and the K real numbers are mapped into K probabilities with the range of (0, + ∞). The specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
The method for carrying out model training to obtain the model comprises the following steps:
obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, extracting data characteristics until loss value is not reduced for 3 times continuously, storing a named entity recognition model, calculating accuracy, recall rate and F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
TABLE 2
Zhao (Zhao) | Six ingredients | In that | On the upper part | Sea water | City (R) | Quilt | Fishing device | |
B-PER | 1.8 | 0.8 | 0.1 | 0.07 | 0.7 | 0.3 | 0.42 | 0.2 |
I-PER | 1.3 | 1.7 | 0.7 | 0.6 | 0.8 | 1.1 | 0.15 | 0.5 |
B-LOC | 0.13 | 0.5 | 0.2 | 1.5 | 1.3 | 0.6 | 0.08 | 0.15 |
I-LOC | 0.02 | 0.6 | 0.03 | 0.4 | 1.6 | 1.9 | 0.1 | 0.02 |
O | 0.06 | 0.8 | 1.8 | 0.5 | 0.07 | 0.8 | 1.3 | 1.6 |
Prediction value | B-PER | I-PER | O | B-LOC | I-LOC | I-LOC | O | O |
TABLE 2
In this embodiment, the size of the character set is 5412, the gradient descent algorithm Adam is used for optimization training, and the learning rate lr is 0.01. In order to prevent overfitting, a Dropout technology is introduced, and the effect is best when Dropout is 0.5 through repeated verification.
While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (12)
1. A method for recognizing named entities oriented to judicial texts is characterized by comprising the following steps:
(1) data preprocessing: collecting judicial text data, performing cleaning and data standardization operation on the collected data, and performing text labeling on the data to obtain a training set, a cross validation set and a test set;
(2) character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
(3) and (3) expansion convolutional coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
(4) correlation normalization: calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector;
(5) and (3) expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
2. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the step (1) is specifically as follows:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on personal names, place names and organization names, legal provisions, court, judges, bookclerks and other legal personnel judicial text naming entities, wherein text labeling adopts a text labeling tool Doccano for labeling, and a data set labeling mode adopts a BIO labeling set to obtain trained corpora, which are divided into a training set, a cross validation set and a test set;
and (1.4) dividing the preprocessed data into a training set, a cross validation set and a test set according to proportions.
3. The method for recognition of named entities oriented to judicial texts according to claim 2, wherein: in the step (1.4), the preprocessed data are in a ratio of 8: 1 according to the ratio.
4. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the BERT pre-training language model in the step (2) is obtained by finely tuning a Google BERT model through a large amount of judicial texts, and comprises the following steps:
(2.1) obtaining a pre-trained BERT model of Google, preparing judicial text data and arranging the judicial text data into an available data format;
(2.2) using the prepared text data as input, wherein the input data is characterized by word vectors, paragraph vectors and position vectors, and the three vectors are spliced to be used as the integral input of a BERT model;
(2.3) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, then predicting a covering value by using the BERT model, and minimizing a prediction error;
and (2.4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model.
5. The method for recognition of named entities oriented to judicial texts according to claim 4, wherein: the masking in said step (2.3) means that 15% of the characters are masked.
6. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: in the step (3), the expansion convolution neural network model (IDCNN) performs feature extraction on the input character feature vector, and the specific steps are as follows:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtOutputting expansion vectors after expansion convolution for the word vector representation after full text semantics are fused, wherein the expansion vectors are represented as the probability of each class label; let the jth layer of the iterative expansion step size theta be convolved withThe iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3; the first layer in the network is iterative convolutionInputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
7. the method for recognition of named entities oriented to judicial texts according to claim 6, wherein: l in iterative dilation convolutioncLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold over a wider and wider rangeEach layer x is laminatedtEmbedded representation of (a); wherein the step size of the dilation increases exponentially with the number of layers of convolution, let r () denote the Relu activation function, so thatInitially, a layer stack is defined that appears repeatedly, and the expression is as follows:
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vectorStacking and initializingThe expression is as follows:
wherein k is greater than or equal to 1.
8. The method for recognition of named entities oriented to judicial texts according to claim 7, wherein: to prevent overfitting, the resulting output of the dilated convolution module is applied to 4 stacked dilated vectors using the dropout (.) functionAnd (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
9. the method for recognition of named entities oriented to judicial texts according to claim 1, wherein: in the association normalization method in step (4), the character feature vector R obtained by inputting IDCNN is [ R1.,. rn ], the association between characters is calculated, and a new character feature vector Z is obtained [ Z1.,. zn ]; the computational expression of Z is:
Z=WN·R
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
10. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the expansion convolution decoding in the step (5) is to splice the feature vectors and the covering and labeling sequences, perform probability calculation on the spliced feature vectors, and finally output an entity identification result by utilizing the probability maximization of the canonical sequence; the method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder output sequence is splicedColumn Z and predicted signature sequence, composition state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) Mapping the extracted features to the probability of whether the input characters are entity labels or not through the full connection layer; the full-connection layer multiplies the weight matrix and the input vector, adds offset, maps n real numbers into K real numbers with the same number of corresponding entity labels, simultaneously adds a Relu nonlinear function between each layer of the network as an excitation function, and maps the K real numbers into K probabilities with the range of (0, infinity); the specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
11. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the model training in step (5) is carried out, and the method for obtaining the model comprises the following steps: obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, and extracting data characteristics until the loss value is not reduced;
saving the finally obtained optimal entity identification model as a model, calculating the accuracy, the recall rate and the F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
12. A system for judicial text-oriented named entity recognition applying the method of claim 1, wherein: the method comprises the following steps: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111157229.2A CN113869053A (en) | 2021-09-30 | 2021-09-30 | Method and system for recognizing named entities oriented to judicial texts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111157229.2A CN113869053A (en) | 2021-09-30 | 2021-09-30 | Method and system for recognizing named entities oriented to judicial texts |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113869053A true CN113869053A (en) | 2021-12-31 |
Family
ID=79000815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111157229.2A Pending CN113869053A (en) | 2021-09-30 | 2021-09-30 | Method and system for recognizing named entities oriented to judicial texts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113869053A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114757184A (en) * | 2022-04-11 | 2022-07-15 | 中国航空综合技术研究所 | Method and system for realizing knowledge question answering in aviation field |
CN115859983A (en) * | 2022-12-14 | 2023-03-28 | 成都信息工程大学 | Fine-grained Chinese named entity recognition method |
CN116756596A (en) * | 2023-08-17 | 2023-09-15 | 智慧眼科技股份有限公司 | Text clustering model training method, text clustering device and related equipment |
CN116821286A (en) * | 2023-08-23 | 2023-09-29 | 北京宝隆泓瑞科技有限公司 | Correlation rule analysis method and system for gas pipeline accidents |
CN116842957A (en) * | 2023-08-28 | 2023-10-03 | 佰墨思(成都)数字技术有限公司 | Dual-channel neural network for extracting biological math entity and feature recognition method |
CN117111540A (en) * | 2023-10-25 | 2023-11-24 | 南京德克威尔自动化有限公司 | Environment monitoring and early warning method and system for IO remote control bus module |
-
2021
- 2021-09-30 CN CN202111157229.2A patent/CN113869053A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114757184A (en) * | 2022-04-11 | 2022-07-15 | 中国航空综合技术研究所 | Method and system for realizing knowledge question answering in aviation field |
CN114757184B (en) * | 2022-04-11 | 2023-11-10 | 中国航空综合技术研究所 | Method and system for realizing knowledge question and answer in aviation field |
CN115859983A (en) * | 2022-12-14 | 2023-03-28 | 成都信息工程大学 | Fine-grained Chinese named entity recognition method |
CN115859983B (en) * | 2022-12-14 | 2023-08-25 | 成都信息工程大学 | Fine-granularity Chinese named entity recognition method |
CN116756596A (en) * | 2023-08-17 | 2023-09-15 | 智慧眼科技股份有限公司 | Text clustering model training method, text clustering device and related equipment |
CN116756596B (en) * | 2023-08-17 | 2023-11-14 | 智慧眼科技股份有限公司 | Text clustering model training method, text clustering device and related equipment |
CN116821286A (en) * | 2023-08-23 | 2023-09-29 | 北京宝隆泓瑞科技有限公司 | Correlation rule analysis method and system for gas pipeline accidents |
CN116842957A (en) * | 2023-08-28 | 2023-10-03 | 佰墨思(成都)数字技术有限公司 | Dual-channel neural network for extracting biological math entity and feature recognition method |
CN117111540A (en) * | 2023-10-25 | 2023-11-24 | 南京德克威尔自动化有限公司 | Environment monitoring and early warning method and system for IO remote control bus module |
CN117111540B (en) * | 2023-10-25 | 2023-12-29 | 南京德克威尔自动化有限公司 | Environment monitoring and early warning method and system for IO remote control bus module |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN113869053A (en) | Method and system for recognizing named entities oriented to judicial texts | |
CN111444726B (en) | Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure | |
CN108090049B (en) | Multi-document abstract automatic extraction method and system based on sentence vectors | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN111966917B (en) | Event detection and summarization method based on pre-training language model | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN110083831A (en) | A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN111858940B (en) | Multi-head attention-based legal case similarity calculation method and system | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN110263325A (en) | Chinese automatic word-cut | |
CN111144119B (en) | Entity identification method for improving knowledge migration | |
CN113178193A (en) | Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip | |
CN113282711A (en) | Internet of vehicles text matching method and device, electronic equipment and storage medium | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN114997288A (en) | Design resource association method | |
CN114428850A (en) | Text retrieval matching method and system | |
CN112800184A (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN114691864A (en) | Text classification model training method and device and text classification method and device | |
CN111984780A (en) | Multi-intention recognition model training method, multi-intention recognition method and related device | |
Celikyilmaz et al. | A graph-based semi-supervised learning for question-answering | |
CN112818698B (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN114282592A (en) | Deep learning-based industry text matching model method and device | |
Wu et al. | One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |