CN113869053A - Method and system for recognizing named entities oriented to judicial texts - Google Patents

Method and system for recognizing named entities oriented to judicial texts Download PDF

Info

Publication number
CN113869053A
CN113869053A CN202111157229.2A CN202111157229A CN113869053A CN 113869053 A CN113869053 A CN 113869053A CN 202111157229 A CN202111157229 A CN 202111157229A CN 113869053 A CN113869053 A CN 113869053A
Authority
CN
China
Prior art keywords
character
data
vector
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111157229.2A
Other languages
Chinese (zh)
Inventor
陈晓亮
武敏
张绩晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Enjoyor Smart Intelligent Technology Co ltd
Original Assignee
Shanghai Enjoyor Smart Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Enjoyor Smart Intelligent Technology Co ltd filed Critical Shanghai Enjoyor Smart Intelligent Technology Co ltd
Priority to CN202111157229.2A priority Critical patent/CN113869053A/en
Publication of CN113869053A publication Critical patent/CN113869053A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to a method and a system for recognizing named entities facing judicial texts. The invention can effectively improve the speed and ensure the perception of the model to the long text; context convolution characteristic associated information can be effectively captured, and the problems of entity name overlapping and illegal labels are solved; the context characteristic information can be effectively extracted, the training speed is high, and the structure is simple.

Description

Method and system for recognizing named entities oriented to judicial texts
Technical Field
The invention relates to the field of data identification and processing, in particular to a method and a system for identifying a named entity oriented to a judicial text.
Background
With the continuous deepening application of technologies such as informatization, big data, natural language processing, artificial intelligence and the like in the judicial field, the creation of intelligent judicial and intelligent judicial platforms becomes a research hotspot in the current judicial field. In recent years, in terms of informatization optimization of case handling processes and in terms of intellectualized improvement of case handling efficiency, domestic intelligent jurisdictions have achieved a series of achievements, such as 12309 integrated service network platform, court informatization system and the like.
However, the existing intelligent judicial correlation technology has obvious limitations, and the product experience has a larger promotion space. By far, the Chinese judge paperwork network has disclosed nearly 6000 million pieces of judge paperwork. The judicial documents have large stock and various document types, so that the realization of automatic processing of judicial text information becomes the key work of intelligent judicial. The automatic extraction of the judicial text information has important significance in solving or improving the problems of few people and the like in a judicial system and improving the working efficiency of the judicial, and is an important aspect of effectively maintaining social fairness and justice.
Entity recognition is an important technology in automatic information extraction, and is the fundamental and necessary work of intelligent judicial platforms such as intelligent knowledge search in the judicial field, intelligent judge and sentencing, intelligent document generation and the like. Therefore, the research on the recognition of judicial text entities is particularly important for further advancing the intelligent judicial platform. Named Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. Simply, the boundaries and categories of entity designations in natural text are identified.
At present, the named entity recognition method mainly comprises a method based on rules and dictionaries, an unsupervised learning method which utilizes vocabulary resources (such as WordNet) to carry out context clustering, a mixing method combining several models and a method based on deep learning. The algorithm effect based on deep learning is good. The named entity recognition method based on deep learning is to use named entity recognition as a sequence labeling task, and the more classical method is LSTM + CRF and BiLSTM + CRF. However, the method based on deep learning needs further optimization due to the defects that the word representation limitation is large, the local semantic information is not sufficiently obtained, the parallel computing speed is low during training, and the like. The BERT model is proposed to achieve the state-of-the-art performance under various natural language tasks, and a plurality of subsequent researches are improved on the basis. For example, patent publication No. CN109992782A uses a BERT pre-training model to serve as a role of identifying word2vec in the traditional BiLSTM + CRF model architecture, and obtains more information with a deeper network, thereby improving accuracy and recall rate of named entity identification. However, the method has the defects of low parallel computing speed, poor mobility and the like during training. The patent publication number CN110807324A trains the entity recognition model through an IDCNN-CRF algorithm, and verifies and outputs the model prediction result. However, the method has the problems that the word representation limitation is large, and the data characteristics are seriously lost and the detail information is lost when the encoding and the decoding are carried out. Some researchers have proposed a Chinese named entity recognition method based on BERT-IDCNN-CRF, the method obtains the context representation of the word through a BERT pre-training language model, and then inputs the word vector sequence into the IDCNN-CRF model for training. The whole model is divided into a BERT layer, an IDCNN layer and a CRF layer, wherein the BERT layer obtains context-related word vector representation, the IDCNN layer extracts features, and the CRF layer prevents illegal tag sequences and obtains tag sequences with the maximum probability. The model has good timeliness, but the algorithm needs to be combined with three algorithms, and the overall structure is complex. Furthermore, the method of directly succeeding the IDCNN layer with CRF fails to take into account the correlation between different characters, which results in a loss of precision.
Disclosure of Invention
The invention aims to overcome the defects and provides a method and a system for recognizing named entities facing judicial texts, wherein the method and the system mainly adopt a named entity recognition method combining pre-training model character embedding with iterative expanded convolutional neural network (IDCNN), can effectively improve the speed and ensure the perceptibility of the model to long texts; by adopting a convolutional network coding and decoding model, context convolutional characteristic associated information can be effectively captured, and the problems of entity name overlapping and illegal labels are solved.
The invention achieves the aim through the following technical scheme: a method for recognizing named entities oriented to judicial texts comprises the following steps:
(1) data preprocessing: collecting judicial text data, performing cleaning and data standardization operation on the collected data, and performing text labeling on the data to obtain a training set, a cross validation set and a test set;
(2) character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
(3) and (3) expansion convolutional coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
(4) correlation normalization: calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector;
(5) and (3) expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
Preferably, the step (1) is specifically as follows:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on personal names, place names and organization names, legal provisions, court, judges, bookclerks and other legal personnel judicial text naming entities, wherein text labeling adopts a text labeling tool Doccano for labeling, and a data set labeling mode adopts a BIO labeling set to obtain trained corpora, which are divided into a training set, a cross validation set and a test set;
and (1.4) dividing the preprocessed data into a training set, a cross validation set and a test set according to proportions.
Preferably, the step (1.4) is to use the preprocessed data in a ratio of 8: 1.
Preferably, the BERT pre-training language model in step (2) is obtained by refining google BERT model with a large amount of judicial texts, and comprises:
(2.1) obtaining a pre-trained BERT model of Google, preparing judicial text data and arranging the judicial text data into an available data format;
(2.2) using the prepared text data as input, wherein the input data is characterized by word vectors, paragraph vectors and position vectors, and the three vectors are spliced to be used as the integral input of a BERT model;
(2.3) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, then predicting a covering value by using the BERT model, and minimizing a prediction error;
and (2.4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model.
Preferably, the masking in step (2.3) refers to masking 15% of the characters.
Preferably, the expanded convolutional neural network model (IDCNN) in step (3) performs feature extraction on the input character feature vector, and the specific steps are as follows:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtOutputting expansion vectors after expansion convolution for the word vector representation after full text semantics are fused, wherein the expansion vectors are represented as the probability of each class label; let the jth layer of the iterative expansion step size theta be convolved with
Figure BDA0003289089050000031
The iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3; the first layer in the network is iterative convolution
Figure BDA0003289089050000032
Inputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
it=D1 (0)xt (1)
preferably, the iterative dilation convolutionL in (1)cLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold into each layer x in wider and wider rangetEmbedded representation of (a); wherein the step size of dilation increases exponentially with the number of layers of convolution; let r () denote the Relu activation function, from
Figure BDA0003289089050000033
Initially, a layer stack is defined that appears repeatedly, and the expression is as follows:
Figure BDA0003289089050000034
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
Figure BDA0003289089050000035
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vector
Figure BDA0003289089050000036
Stacking and initializing
Figure BDA0003289089050000037
The expression is as follows:
Figure BDA0003289089050000038
wherein k is greater than or equal to 1.
Preferably, to prevent overfitting, the resulting output of the dilation convolution module is applied to 4 stacked dilation vectors using a dropout (.) function
Figure BDA0003289089050000039
And (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
Figure BDA00032890890500000310
preferably, in the association normalization method in step (4), the character feature vector R obtained by inputting IDCNN is [ R1.,. rn ], the association between characters is calculated, and a new character feature vector Z is obtained [ Z1.,. zn ]; the computational expression of Z is:
Z=WN.R
Figure BDA0003289089050000041
Figure BDA0003289089050000042
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
Preferably, the expanding convolution decoding in the step (5) is to splice the feature vectors and the covering and labeling sequences, perform probability calculation on the spliced feature vectors, and finally output an entity identification result by utilizing the probability maximization of the canonical sequence; the method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder outputs the sequence Z and the predicted feature sequence to form a state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) Mapping the extracted features to determine whether the input characters are entity labels or not through the full connection layerRate; the full-connection layer multiplies the weight matrix and the input vector, adds offset, maps n real numbers into K real numbers with the same number of corresponding entity labels, simultaneously adds a Relu nonlinear function between each layer of the network as an excitation function, and maps the K real numbers into K probabilities with the range of (0, infinity); the specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
Preferably, the model training in step (5) is performed to obtain a model by:
obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, and extracting data characteristics until the loss value is not reduced;
saving the finally obtained optimal entity identification model as a model, calculating the accuracy, the recall rate and the F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
A system for judicial text-oriented named entity recognition, comprising: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
The invention has the beneficial effects that: compared with the prior art, the method can effectively improve the speed and ensure the perception of the model to the long text; the method can effectively capture context convolution characteristic associated information and solve the problems of entity name overlapping and illegal labeling. The method can effectively extract the context characteristic information, and has the advantages of high training speed and simple structure.
Drawings
FIG. 1 is a schematic structural diagram of a system for recognition of named entities oriented to judicial texts according to the present invention;
FIG. 2 is a schematic flow chart of a method for recognition of named entities oriented to judicial texts according to the present invention;
FIG. 3 is a schematic diagram of the operation of the convolutional network module of the present invention.
Detailed Description
The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:
example 1: as shown in fig. 1, a system for recognition of named entities oriented to judicial texts is composed of a data preprocessing module, a character vector embedding module, an expansion convolution feature coding module, an association normalization module and an expansion convolution feature decoding module; the method comprises the following steps: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
As shown in fig. 2, a judicial-text-oriented named entity recognition is implemented by the following steps:
(1) data preprocessing module data preprocessing:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on a person name, a place name and an organization name, legal provisions, court, judges, bookmarker and other legal personnel judicial text naming entities, wherein the text labeling adopts a text labeling tool Doccano for labeling, and a BIO labeling set is adopted as a data set labeling mode, wherein the person name is marked by PER, the place name is marked by LOC, and the organization name is marked by ORG. And the location of the word is identified using a set of ternary tokens { B, I, O }, where "B" represents the first word of an entity, "I" represents the remaining words of an entity, and "O" represents a non-entity. The data set is labeled by combining the entity category and the position of the word, and seven types of labels are shown in the following table 1. Obtaining a training corpus after the labeling is finished, and dividing the training corpus into a training set, a cross validation set and a test set;
Figure BDA0003289089050000061
TABLE 1
(1.4) dividing the preprocessed data into three parts according to the ratio of 8: 1, namely a training set train.tsv, a verification set dev.tsv and a test set test.tsv, wherein each row in the file comprises two elements, words and identifications, and each sentence is separated by a blank space.
For example: the sentence "Wangwu and Xunewu district civil and administrative department dispute examination is as follows" is labeled as "[ B-PER, I-PER, O, B-LOC, I-LOC, I-LOC, B-ORG, I-ORG, I-ORG, O, O, O, O, O ]".
(2) Character vector embedding module character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
the BERT pre-training language model is obtained by refining the Google BERT model through a large amount of judicial texts, and comprises the following steps:
(2.1) text vectorization: acquiring a BERT pre-training language model, and sorting a data set into an available data format, wherein input data is represented by word vectors, paragraph vectors and position vectors, three vectors are spliced to be used as integral input of the BERT model, namely, each character is converted into a vector representation corresponding to each word after full-text semantics are fused through the BERT model, and the length of the character vector is 768 dimensions.
(2.2) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, namely covering 15% of characters, predicting a covering value by using the BERT model, and minimizing a prediction error;
(2.3) character vector embedding: and (4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model. Each sentence gets a vector with shape of 768 dimensions, and then 64 sentences are set in a batch. The sequence length of each sentence is set to 40, and the input text sequence is [ t ]1,t2,…,tn](n is more than 0 and less than or equal to 40). The final vector dimension generated was 64 x 40 x 768.
(3) The expansion convolution characteristic coding module is used for expansion convolution coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
the expansion convolution neural network model (IDCNN) performs feature extraction on the input character feature vector, as shown in fig. 3, the lower half part of the expansion convolution neural network model is spliced together by 4 expansion convolution neural network blocks with the same size and structure, and each expansion convolution block has 3 layers of expansion convolution layers with expansion step length of {1, 1, 2 }. Wherein the maximum superposition step size in each block is 4 and the maximum filtering step size is 3. The lower half of the graph shows the convolution operations of 3 layers of superposition, respectively, wherein the first one is normal convolution, the expansion step size is 1, and the convolution kernel size is 3; continuously superposing the expansion convolution vectors, wherein the expansion step length is 2, and the perception visual field is increased to 7; the superposition of the dilated convolution vectors continues on top of the convolution operation of the previous step, with a dilation step of 4, where the convolution perception field is correspondingly enlarged to 15.
The method comprises the following specific steps:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtFor the word vector representation after the full text semantics are fused, the output is the expansion vector after expansion convolution, which is represented as the probability of each class label. Let the jth layer of the iterative expansion step size theta be convolved with
Figure BDA0003289089050000071
The iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3. The first layer in the network is iterative convolution
Figure BDA0003289089050000072
Inputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
it=D1 (0)xt (1)
preferably, L in the iterative dilation convolutioncLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold into each layer x in wider and wider rangetEmbedded representation of (a); where the step size of the dilation increases exponentially as the number of layers of convolution increases. Let r () denote the Relu activation function, from
Figure BDA0003289089050000073
Initially, a layer stack is defined that appears repeatedly, and the expression is as follows:
Figure BDA0003289089050000074
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
Figure BDA0003289089050000075
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vector
Figure BDA0003289089050000076
Stacking and initializing
Figure BDA0003289089050000077
The expression is as follows:
Figure BDA0003289089050000081
wherein k is greater than or equal to 1.
To prevent overfitting, the resulting output of the dilated convolution module is applied to 4 stacked dilated vectors using the dropout (.) function
Figure BDA0003289089050000082
And (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
Figure BDA0003289089050000083
(4) the association normalization module performs association normalization: and calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector. The component calculates new character vector features, the new vector dimensions remaining unchanged. For example: "civil and administrative dispute between wang wu and xu hui district", the feature vector of a single character is 256. The weighting sequence for the character "king" is [0.6, 0.3, 0.02, 0.02, 0.02, 0.02, 0.02, 0, 0, 0, 0], and the weight sum is 1. The characteristic vector of the character 'wang' after weighting is
0.6*r1+0.3*r2+0.02*r3+0.02*r4+0.02*r5+0.02*r6+0.02*r7
The association normalization method calculates the association between characters by inputting a character feature vector R ═[ R1.,. rn ] obtained by IDCNN, and obtains a new character feature vector Z ═[ Z1.,. zn ]. The computational expression of Z is:
Z=WN·R
Figure BDA0003289089050000084
Figure BDA0003289089050000085
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
(5) The expansion convolution characteristic decoding module is used for expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
The expansion convolution decoding means that the feature vector and the covering mark sequence are spliced, probability calculation is carried out on the spliced feature vector, and finally an entity identification result is output by utilizing the probability maximization of the canonical sequence. For example, when predicting yt, the mask tag sequence is [ pad, y1,y2...,yt-1,mask...]That is, the entity name type corresponding to the preceding character of the t-sequence number is known, and the following type is filled as mask. And calculating the mask labeling sequence by using IDCNN to obtain a decoding characteristic corresponding to yt, wherein the dimension of the decoding characteristic is 256. After splicing, 2 × 256 decoded feature vectors are obtained. The probability of the corresponding entity name type is then predicted using the fully connected layer and the sofimax layer.
The method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder outputs the sequence Z and the predicted feature sequence to form a state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) And mapping the extracted features to the probability of whether the input characters are entity labels or not through the full connection layer. The full-connection layer multiplies the weight matrix and the input vector and adds offset, and maps n real numbers into corresponding entity labelsThe same number of K real numbers are added, meanwhile, a Relu nonlinear function is added between each layer of the network to serve as an excitation function, and the K real numbers are mapped into K probabilities with the range of (0, + ∞). The specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
The method for carrying out model training to obtain the model comprises the following steps:
obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, extracting data characteristics until loss value is not reduced for 3 times continuously, storing a named entity recognition model, calculating accuracy, recall rate and F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
TABLE 2
Zhao (Zhao) Six ingredients In that On the upper part Sea water City (R) Quilt Fishing device
B-PER 1.8 0.8 0.1 0.07 0.7 0.3 0.42 0.2
I-PER 1.3 1.7 0.7 0.6 0.8 1.1 0.15 0.5
B-LOC 0.13 0.5 0.2 1.5 1.3 0.6 0.08 0.15
I-LOC 0.02 0.6 0.03 0.4 1.6 1.9 0.1 0.02
O 0.06 0.8 1.8 0.5 0.07 0.8 1.3 1.6
Prediction value B-PER I-PER O B-LOC I-LOC I-LOC O O
TABLE 2
In this embodiment, the size of the character set is 5412, the gradient descent algorithm Adam is used for optimization training, and the learning rate lr is 0.01. In order to prevent overfitting, a Dropout technology is introduced, and the effect is best when Dropout is 0.5 through repeated verification.
While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (12)

1. A method for recognizing named entities oriented to judicial texts is characterized by comprising the following steps:
(1) data preprocessing: collecting judicial text data, performing cleaning and data standardization operation on the collected data, and performing text labeling on the data to obtain a training set, a cross validation set and a test set;
(2) character vector embedding: obtaining context-dependent character-level vectors of data of the preprocessed text sequence by using a BERT language model, performing character vector embedding calculation, and outputting character feature vectors;
(3) and (3) expansion convolutional coding: inputting the character feature vector into an expansion convolution neural network model, and extracting context data features by adopting a character-level iteration expansion convolution neural network to obtain the feature vector of each character;
(4) correlation normalization: calculating the correlation between the character feature vectors by using cosine distance and normalizing the character feature vectors to be used as a correlation weight, and performing weighting calculation to obtain a new character feature vector;
(5) and (3) expansion convolution decoding: and combining the encoder output feature vector with the sequence mask information as input, taking the corresponding entity name prediction result as output, and performing model training to obtain a final entity recognition model.
2. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the step (1) is specifically as follows:
(1.1) acquiring judicial data: acquiring a referee document data set as a data set;
(1.2) data cleaning: cleaning the data set, including similar text deduplication, low-quality text filtering and missing text removal, wherein the text deduplication adopts a similarity calculation method of Jaccard;
(1.3) data annotation: labeling a data set, labeling entities based on personal names, place names and organization names, legal provisions, court, judges, bookclerks and other legal personnel judicial text naming entities, wherein text labeling adopts a text labeling tool Doccano for labeling, and a data set labeling mode adopts a BIO labeling set to obtain trained corpora, which are divided into a training set, a cross validation set and a test set;
and (1.4) dividing the preprocessed data into a training set, a cross validation set and a test set according to proportions.
3. The method for recognition of named entities oriented to judicial texts according to claim 2, wherein: in the step (1.4), the preprocessed data are in a ratio of 8: 1 according to the ratio.
4. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the BERT pre-training language model in the step (2) is obtained by finely tuning a Google BERT model through a large amount of judicial texts, and comprises the following steps:
(2.1) obtaining a pre-trained BERT model of Google, preparing judicial text data and arranging the judicial text data into an available data format;
(2.2) using the prepared text data as input, wherein the input data is characterized by word vectors, paragraph vectors and position vectors, and the three vectors are spliced to be used as the integral input of a BERT model;
(2.3) training a BERT model by adopting a mask prediction mode to the spliced input vector to obtain a BERT pre-training language model, then predicting a covering value by using the BERT model, and minimizing a prediction error;
and (2.4) inputting a sentence, wherein each character in the sentence is converted into a vector representation corresponding to each word after full text semantics are fused through a BERT model.
5. The method for recognition of named entities oriented to judicial texts according to claim 4, wherein: the masking in said step (2.3) means that 15% of the characters are masked.
6. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: in the step (3), the expansion convolution neural network model (IDCNN) performs feature extraction on the input character feature vector, and the specific steps are as follows:
will vector xtThe sequence is used as network input, the sequence number of the expression vector is more than or equal to 0 and less than n, and x is definedtOutputting expansion vectors after expansion convolution for the word vector representation after full text semantics are fused, wherein the expansion vectors are represented as the probability of each class label; let the jth layer of the iterative expansion step size theta be convolved with
Figure FDA0003289089040000031
The iterative expansion theta is {1, 1, 2}, and j has a value range of 1 to 3; the first layer in the network is iterative convolution
Figure FDA0003289089040000032
Inputting vector xtConvolution is carried out to obtain an initial characteristic matrix itThe formula is as follows:
Figure FDA0003289089040000033
7. the method for recognition of named entities oriented to judicial texts according to claim 6, wherein: l in iterative dilation convolutioncLayer, acting on the feature matrix i with exponentially growing iteration stepstL in iterative dilation convolutioncLayer, the feature matrix itPerforming expansion convolution to fold over a wider and wider rangeEach layer x is laminatedtEmbedded representation of (a); wherein the step size of the dilation increases exponentially with the number of layers of convolution, let r () denote the Relu activation function, so that
Figure FDA0003289089040000041
Initially, a layer stack is defined that appears repeatedly, and the expression is as follows:
Figure FDA0003289089040000042
and adding a final iteration convolution layer in the stack, wherein the expression is as follows:
Figure FDA0003289089040000043
defining a stacking function as B (); for better fusion context without over-fitting and without introducing additional parameters, for the inflation vector
Figure FDA0003289089040000044
Stacking and initializing
Figure FDA0003289089040000045
The expression is as follows:
Figure FDA0003289089040000046
wherein k is greater than or equal to 1.
8. The method for recognition of named entities oriented to judicial texts according to claim 7, wherein: to prevent overfitting, the resulting output of the dilated convolution module is applied to 4 stacked dilated vectors using the dropout (.) function
Figure FDA0003289089040000047
And (3) carrying out random inactivation to obtain a coding feature vector R, wherein the expression is as follows:
Figure FDA0003289089040000048
9. the method for recognition of named entities oriented to judicial texts according to claim 1, wherein: in the association normalization method in step (4), the character feature vector R obtained by inputting IDCNN is [ R1.,. rn ], the association between characters is calculated, and a new character feature vector Z is obtained [ Z1.,. zn ]; the computational expression of Z is:
Z=WN·R
Figure FDA0003289089040000051
Figure FDA0003289089040000052
in the formula, cos is a cosine distance, WN is an associated normalized weight value, and is an n x n dimensional matrix, and the relevance among different characters can be calculated, so that a new character vector Z contains information of other characters at different distances.
10. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the expansion convolution decoding in the step (5) is to splice the feature vectors and the covering and labeling sequences, perform probability calculation on the spliced feature vectors, and finally output an entity identification result by utilizing the probability maximization of the canonical sequence; the method comprises the following specific steps:
(5.1) taking the covering and labeling sequence as the input of the IDCNN network in the decoder, and outputting the covering and labeling sequence as a prediction characteristic sequence with the same length as the input sequence, wherein the method is the same as the step (3); then, the decoder output sequence is splicedColumn Z and predicted signature sequence, composition state sequence h ═ (h)0,h1,…,hn-1);
(5.2) the sequence of hidden states (h) resulting from step (5.1)0,h1,…,hn-1) Mapping the extracted features to the probability of whether the input characters are entity labels or not through the full connection layer; the full-connection layer multiplies the weight matrix and the input vector, adds offset, maps n real numbers into K real numbers with the same number of corresponding entity labels, simultaneously adds a Relu nonlinear function between each layer of the network as an excitation function, and maps the K real numbers into K probabilities with the range of (0, infinity); the specific expression is as follows:
P=relu(z)=relu(WTh+b) (6)
wherein h is input of a full connection layer, W is weight, b is a bias item, and P is the probability of whether an input character is an entity label;
the probability of each named entity tag is then:
Pi=relu(zi)=relu(Wi Th+bi) (7)
wherein, WiIdentifying feature weights under the label for the ith named entity.
11. The method for recognition of named entities oriented to judicial texts according to claim 1, wherein: the model training in step (5) is carried out, and the method for obtaining the model comprises the following steps: obtaining context-dependent character-level vectors of data from a text sequence preprocessed by a data set by using a BERT language model, and embedding the character vectors;
inputting the character-level vector into an expansion convolution coding and decoding model, performing iterative training, and extracting data characteristics until the loss value is not reduced;
saving the finally obtained optimal entity identification model as a model, calculating the accuracy, the recall rate and the F value of the optimal entity model, and verifying the performance of the model through evaluation indexes; if the F value is lower than the expected threshold value, training again through the artificial dry prognosis, and repeating the iteration until the F value is higher than the expected threshold value.
12. A system for judicial text-oriented named entity recognition applying the method of claim 1, wherein: the method comprises the following steps: the device comprises a data preprocessing module, a character vector embedding module, an expansion convolution characteristic coding module, an association normalization module and an expansion convolution characteristic decoding module; the data preprocessing module is used for acquiring judicial text data, cleaning and standardizing the acquired data to form a referee document data set, and inputting a character vector embedding module after text labeling is carried out on the data set; the character vector embedding module is used for acquiring context-related character-level vectors of the data, performing character vector embedding calculation, and inputting character feature vectors into the expansion convolutional coding module; the expansion convolution coding module extracts data characteristics by adopting a character-level iterative expansion convolution neural network, and inputs the obtained characteristic vector of each character into the association normalization module as input; the correlation normalization module calculates the correlation between IDCNN character vectors by using cosine distance, calculates a new correlation weighted character vector, and inputs the new character feature vector calculated by weighting into the expansion convolution feature decoding module; and the expansion convolution characteristic decoding module takes the characteristic vector and the sequence mask information as input and takes the corresponding entity name prediction result as output, and performs the circular neural network model training to obtain the final entity identification model.
CN202111157229.2A 2021-09-30 2021-09-30 Method and system for recognizing named entities oriented to judicial texts Pending CN113869053A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111157229.2A CN113869053A (en) 2021-09-30 2021-09-30 Method and system for recognizing named entities oriented to judicial texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111157229.2A CN113869053A (en) 2021-09-30 2021-09-30 Method and system for recognizing named entities oriented to judicial texts

Publications (1)

Publication Number Publication Date
CN113869053A true CN113869053A (en) 2021-12-31

Family

ID=79000815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111157229.2A Pending CN113869053A (en) 2021-09-30 2021-09-30 Method and system for recognizing named entities oriented to judicial texts

Country Status (1)

Country Link
CN (1) CN113869053A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757184A (en) * 2022-04-11 2022-07-15 中国航空综合技术研究所 Method and system for realizing knowledge question answering in aviation field
CN115859983A (en) * 2022-12-14 2023-03-28 成都信息工程大学 Fine-grained Chinese named entity recognition method
CN116756596A (en) * 2023-08-17 2023-09-15 智慧眼科技股份有限公司 Text clustering model training method, text clustering device and related equipment
CN116821286A (en) * 2023-08-23 2023-09-29 北京宝隆泓瑞科技有限公司 Correlation rule analysis method and system for gas pipeline accidents
CN116842957A (en) * 2023-08-28 2023-10-03 佰墨思(成都)数字技术有限公司 Dual-channel neural network for extracting biological math entity and feature recognition method
CN117111540A (en) * 2023-10-25 2023-11-24 南京德克威尔自动化有限公司 Environment monitoring and early warning method and system for IO remote control bus module

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757184A (en) * 2022-04-11 2022-07-15 中国航空综合技术研究所 Method and system for realizing knowledge question answering in aviation field
CN114757184B (en) * 2022-04-11 2023-11-10 中国航空综合技术研究所 Method and system for realizing knowledge question and answer in aviation field
CN115859983A (en) * 2022-12-14 2023-03-28 成都信息工程大学 Fine-grained Chinese named entity recognition method
CN115859983B (en) * 2022-12-14 2023-08-25 成都信息工程大学 Fine-granularity Chinese named entity recognition method
CN116756596A (en) * 2023-08-17 2023-09-15 智慧眼科技股份有限公司 Text clustering model training method, text clustering device and related equipment
CN116756596B (en) * 2023-08-17 2023-11-14 智慧眼科技股份有限公司 Text clustering model training method, text clustering device and related equipment
CN116821286A (en) * 2023-08-23 2023-09-29 北京宝隆泓瑞科技有限公司 Correlation rule analysis method and system for gas pipeline accidents
CN116842957A (en) * 2023-08-28 2023-10-03 佰墨思(成都)数字技术有限公司 Dual-channel neural network for extracting biological math entity and feature recognition method
CN117111540A (en) * 2023-10-25 2023-11-24 南京德克威尔自动化有限公司 Environment monitoring and early warning method and system for IO remote control bus module
CN117111540B (en) * 2023-10-25 2023-12-29 南京德克威尔自动化有限公司 Environment monitoring and early warning method and system for IO remote control bus module

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN113869053A (en) Method and system for recognizing named entities oriented to judicial texts
CN111444726B (en) Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure
CN108090049B (en) Multi-document abstract automatic extraction method and system based on sentence vectors
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN111966917B (en) Event detection and summarization method based on pre-training language model
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN110083831A (en) A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN110263325A (en) Chinese automatic word-cut
CN111144119B (en) Entity identification method for improving knowledge migration
CN113178193A (en) Chinese self-defined awakening and Internet of things interaction method based on intelligent voice chip
CN113282711A (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114997288A (en) Design resource association method
CN114428850A (en) Text retrieval matching method and system
CN112800184A (en) Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction
CN114691864A (en) Text classification model training method and device and text classification method and device
CN111984780A (en) Multi-intention recognition model training method, multi-intention recognition method and related device
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
CN112818698B (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114282592A (en) Deep learning-based industry text matching model method and device
Wu et al. One improved model of named entity recognition by combining BERT and BiLSTM-CNN for domain of Chinese railway construction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination