CN112231472A

CN112231472A - Judicial public opinion sensitive information identification method integrated with domain term dictionary

Info

Publication number: CN112231472A
Application number: CN202010984681.5A
Authority: CN
Inventors: 余正涛; 张泽锋; 黄于欣; 郭军军; 相艳; 高盛祥
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2021-01-15
Anticipated expiration: 2040-09-18
Also published as: CN112231472B

Abstract

The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary. The method comprises the steps of firstly, respectively coding public opinion texts and domain term dictionaries by using a bidirectional circulation neural network and a multi-head attention mechanism, and extracting significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, global and local features are further extracted by utilizing a multi-head attention mechanism and a convolutional neural network, and finally sensitive information classification is realized. The invention fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and the domain knowledge can be utilized to enhance the semantic feature representation of the words related to the judicial information in the text, thereby improving the performance of the judicial public opinion sensitive information identification.

Description

Judicial public opinion sensitive information identification method integrated with domain term dictionary

Technical Field

The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary, belonging to the technical field of natural language processing.

Background

In the social network, users can express their own opinions anytime and anywhere, wherein a great deal of misunderstandings and one-sided opinions are made for judicial department to judge related work, and the social network opinion has the characteristics of rapid propagation, high sensitivity, easy initiation of network public opinions and the like. In order to better assist the judicial department in carrying out work, it is important to quickly and accurately identify sensitive information related to judicial activities from massive public sentiment news.

For the identification of sensitive information in the judicial field, the identification cannot be regarded as a simple binary task, whether the sensitive information relates to the judicial field and whether the sensitive information is sensitive or not needs to be considered at the same time, the sensitive information and the insensitive information can appear, and some sensitive information but does not relate to the judicial field. Therefore, the method converts the judicial sensitive information identification task into a four-classification task, and needs to identify sensitivity and field.

The judicial public opinion text has the problems of irregular description, more redundant information and the like, so that effective representation of the judicial public opinion text is difficult to carry out, sensitive information relating to the judicial field comprises phrases which cause text sensitivity, the phrases belong to sensitive special vocabularies of the judicial field, and the phrases play a leading role in identifying the judicial sensitive information, but the phrases cannot appear in a sensitive term dictionary of the general field, so that the sensitive information of the judicial field cannot be effectively identified by directly carrying out word matching work. In order to obtain better representation, the model can learn the expression related to judicial sensitive information, a domain sensitive term dictionary is constructed, and the term dictionary is used as external guidance to be merged into a deep learning framework, so that effective feature enhancement can be performed.

Disclosure of Invention

In order to solve the problems, the invention constructs a domain term dictionary, utilizes the domain term dictionary to guide a model to learn domain characteristics, and provides a judicial public opinion sensitive information recognition model integrated into the domain term dictionary aiming at the text description characteristics of the judicial public opinion to classify the judicial public opinion sensitive information.

The technical scheme of the invention is as follows: a method for identifying judicial public opinion sensitive information fused into a domain term dictionary, the method comprising:

constructing a judicial sensitive information recognition model fused into a domain term dictionary to recognize sensitive information; the judicial sensitive information recognition model integrated into the domain term dictionary comprises a coding layer, a domain term dictionary integration layer, a local feature extraction layer and a classification layer;

public sentiment texts and domain term dictionaries are coded and feature attention is paid through a coding layer;

calculating similarity between the domain term dictionary and the public opinion text through a domain term dictionary integration layer and integrating the similarity into text representation;

extracting important features on the basis of a domain term dictionary integration layer through a local feature extraction layer;

and predicting the class probability of the extracted important features through a classification layer.

As a further scheme of the invention, before constructing a judicial sensitive information recognition model integrated into a domain term dictionary, the judicial public opinion data is crawled, and data preprocessing is carried out according to the classification of the judicial public opinion sensitive information, and the specific steps are as follows:

step1.1, crawling public opinion texts, and forming a plurality of public opinion texts after manual screening and labeling;

step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;

step1.3, pre-training judicial sensitive word vectors by utilizing a dog searching news data set, a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm to serve as judicial sensitive prior knowledge of a judicial sensitive information recognition model.

As a further scheme of the present invention, the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:

step2.1, inputting word-embedded matrixes of public opinion text and domain term dictionary

And

step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representation

Inputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the output in two directions is simply spliced, the output H of the network layer is obtained, and each column of vectors represents the representation of the public opinion description context;

wherein Bi-LSTM represents vector representation after passing through a bidirectional circulation neural network, and D _ H and W _ H are public opinion texts and vector representations after being coded by a domain term dictionary respectively;

step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:

multiHead(Q,K,V)＝concat(head₁,…,head_h)W^O

wherehead_i＝att(QW_i ^Q,KW_i ^k,VW_i ^V)(4)

wherein softmax is a normalization operation, connect represents a splicing operation, wherein

Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:

A^h＝residualConnect(D_M^d,D_H) (5)

K^h＝residualConnect(W_M^d,W_H) (6)

where residulconnect denotes residual connection, D _ M^d，W_M^dRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, A^h，K^hRespectively representing the results of the public opinion texts after residual connection;

step2.5, characterization K of general domain term dictionary^hRepresentation with public sentiment text A^hCalculating a similarity matrix:

wherein S is_ikRepresentation term dictionary representation K^hThe ith field word and text feature A in^hThe similarity between the kth hidden vectors of (1),

representing the corresponding dictionary ith domain word token vector,

is represented by A^hThe k column vector of (1), sim represents the calculation

And

the calculation process of the trainable function of the similarity between the two functions is as follows:

wherein

Is the weight vector to be trained and,

representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and K^hCorresponds to the column vector of, a is a^hThe column vectors of (1) correspond;

step2.6, mixing S_ikAfter normalization, the words and phrases are embedded into a matrix

Multiplying to obtain a correlation matrix with weight information

Finally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information

Where softmax is a normalization function, [; splicing operation;

step2.7, representation of text that has been merged into dictionary information

Performing convolution operation, performing feature extraction on the public sentiment content information and dictionary information, and then performing max-posing operation, wherein the process is as follows:

wherein k represents an output channel of the CNN network;

step2.8, mixing

Performing multi-head attention operation to obtain a feature matrix with weight information;

step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layer^kAfter using normalized softmax, it is mapped to the classification space as follows:

P(D)＝softmax(O^k) (13)

the invention has the beneficial effects that: the method fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and can also utilize the domain knowledge to enhance the semantic feature representation of words related to the judicial information in the text, thereby improving the performance of judicial public opinion sensitive information identification;

experimental results show that the method provided by the invention is superior to a baseline system in the performance of indexes such as accuracy, recall rate, macro-average F1 value and micro-average F1 value.

Drawings

FIG. 1 is a schematic diagram of model construction in the present invention;

FIG. 2 is a flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1-2, the judicial public opinion sensitive information recognition method integrated with the domain term dictionary firstly uses a bidirectional circulation neural network and a multi-head attention mechanism to respectively encode the public opinion text and the domain term dictionary and extract the significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, further extracting global and local features by utilizing a multi-head attention mechanism and a convolutional neural network, and finally realizing sensitive information classification;

the method comprises the following specific steps:

step1, crawling judicial public opinion data and carrying out data preprocessing according to the classification of judicial public opinion sensitive information;

step1.1, crawling websites such as a Xinlang microblog and a github from 3/1/2020 to 6/1/2020, and manually screening and labeling to form 2 ten thousand public opinion texts;

step1.3, pre-training judicial sensitive word vectors as models by using a dog searching news data set (about 500M), a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm;

And

A context-aware encoding mechanism is entered. The method comprises the steps that Bi-LSTM (bidirectional long-short term memory neural network) is used as an embedding mechanism for understanding context information, feature interaction between words is simulated, output in two directions is simply spliced, output H of a network layer is obtained, and each column of vectors represents representation of public opinion description context;

wherein Bi-LSTM represents vector representation after passing through a bidirectional cyclic neural network, and D _ H and W _ H are public opinion text and vector representation after being coded by a domain term dictionary respectively.

multiHead(Q,K,V)＝concat(head₁,…,head_h)W^O

wherehead_i＝att(QW_i ^Q,KW_i ^k,VW_i ^V) (4)

A^h＝residualConnect(D_M^d,D_H) (5)

K^h＝residualConnect(W_M^d,W_H) (6)

where residulconnect denotes residual connection, D _ M^d，W_M^dRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, A^h，K^hRespectively representing the results of the public opinion texts after residual connection.

representing the corresponding dictionary ith domain word token vector,

is represented by A^hThe k column vector of (1), sim represents the calculation

And

wherein

Is the weight vector to be trained and,

representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and K^hCorresponds to the column vector of, a is a^hThe column vectors of (a) correspond.

Multiplying to obtain a correlation matrix with weight information

Where softmax is a normalization function, [; is a splicing operation.

Performing convolution operation, extracting features of the public sentiment content information and the dictionary information, and then performing max-posing (maximum pooling operation), wherein the process is as follows:

where k represents the output channel of the CNN network.

Step2.8, mixing

And performing multi-head attention operation to obtain the feature matrix with the weight information.

Step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layer^kMapping it to the classification space using softmax (normalization) is as follows:

P(D)＝softmax(O^k) (13)

and training parameters by using a gradient descent algorithm, thereby constructing a judicial sensitive information recognition model fused into a domain term dictionary.

For better effectiveness of training and verifying models, a training set, a verifying set and a testing set are constructed according to a ratio of 8:1:1, wherein specific data information is shown in table 1:

TABLE 1 data size and data set partitioning

The construction of a domain term dictionary is very important for identifying judicial sensitive information, the method utilizes a domain knowledge enhancement model to represent domain terms, and combines the judicial domain term vocabulary and the sensitive term vocabulary into the domain term dictionary together, wherein the judicial domain terms are constructed by manually screening the content of a referee document network and a Chinese court network; the sensitive term consists of two parts: (1) manually constructing according to the characteristics of judicial public sentiment data, (2) and disclosing Chinese sensitive words after screening. The composition of the terms includes words, phrases and phrases, and the specific vocabulary number and examples are shown in table 2:

TABLE 2 dictionary size of domain terms

In the invention, the training round is designed to be 20 rounds, the learning rate of the model is 0.0001, the maximum interception length of the public opinion text is set to be 300 characters, the word embedding dimension is 512, Dropout is 0.5, the number of filters in the convolutional neural network model is 256, the size of a sliding window is (2,3 and 4), and Adam is used in the optimization algorithm.

For the invention, the effect of the judicial sensitive information classification model can be evaluated more by calculating the Macro-average and the Micro-average thereof, the Micro-average F1 value (Micro-F1), the Macro-average Precision rate (Macro _ Precision), the Macro-average Recall rate (Macro _ Recall) and the Macro-average F1 value (Macro-F1) are mainly used as evaluation indexes, wherein the calculation process is shown as the formula (13-16):

these indices are based on a "confusion matrix [18 ]]", where TP denotes true positive, FP denotes false positive, TN denotes true negative, FN denotes false negative,

respectively, the average values of the corresponding elements of the confusion matrix.

The comparison model adopted by the invention is as follows:

CNN (convolutional neural network) model: kim et al propose to apply CNN to text classification, mainly including a convolutional layer and a pooling layer, and finally classifying through a full-link layer.

Bi-LSTM Attention (Attention-based mechanism-two-way long-short term memory neural network) model: the classification is carried out by using a bidirectional cyclic neural network and an Attention layer and then a full connection layer.

RCNN (cyclic convolutional neural network) model: lai et al propose a neural network model for classification with RNN and CNN, which mainly comprises a recurrent neural network layer and a convolutional layer, and then is classified by a full-link layer.

Bert (bidirectional Transformer encoder) model: and carrying out text representation through a Bart pre-training model and then classifying through a full-connection network.

Transformer model: two encoders in the transform are used for encoding, followed by a front connection layer for classification.

FastText () model: and (3) superposing and averaging the words and the n-gram vectors of the whole document to obtain a document vector, and then normalizing the document vector to perform multi-classification.

SVM (support vector machine): the linear classifier with the maximum interval on the feature space is defined and is usually used for a text classification task, and the text feature extraction and representation method of the model is consistent with the literature.

TABLE 3 comparison of MARC-SI with baseline model test results

As can be seen from Table 3, MARC-SI has good effects on the baseline model, the pre-training model and the machine learning model, which indicates that the method for merging the domain term dictionary provided in the text is effective for the judicial domain sensitive information recognition task. From the analysis of experimental results, the RCNN and Fastext models have good effects, which shows that the model architecture selected in the text and the idea based on local feature extraction are reasonable, and the BERT pre-training model is not suitable for the task due to the fixed word segmentation structure. For most Transformer models with good task effects, the effect is not good in the task, probably because too much information is blended into public sentiment texts, and the feature extraction cannot be effectively carried out by a self-attention mechanism. As can be seen from the results, the model MARC-SI proposed herein has significant advantages in the classification of judicial opinion-sensitive information.

In order to verify that each layer of the MARC-SI model is effective for the whole classification, an ablation experiment is designed, wherein a coding layer is removed to replace a Bi-LSTM orientation layer with a full connection layer, a field term dictionary integration layer is removed to replace a field term dictionary integration layer, a local feature extraction layer is replaced to replace a CNN Self-orientation layer with a full connection layer, and the experiment result is shown in Table 4.

TABLE 4 ablation experiment

Analyzing the results in Table 4, the effect of removing the coding layer is 7% lower than the F1 value of MARC-SI, which indicates that the coding of public opinion texts and domain dictionaries is still important; the effect of the integrated model can be improved by about 1% by being merged into the domain term dictionary, which indicates that the domain term dictionary has a guiding function for the learning of the model for the task; the F1 value of the local feature extraction network for removing is lower than that of MARC-SI by 2 percent, which shows that the overall network still needs to extract the features after being merged into the term dictionary. As can be seen from ablation experiments, the network model proposed herein is effective for both judicial sensitive information identification tasks.

Because the type of the domain dictionary has a large influence on the model, in order to compare the influence of the domain term dictionary on the model, terms in different domains are respectively input into the MARC-SI model for experiment. Manually constructed judicial domain term vocabulary and public sensitive information term vocabulary are respectively merged into the composition for experiment.

TABLE 5 experiment of different vocabulary integration

Analyzing the results in table 5, the manually constructed judicial domain term vocabulary was 1% improved over the disclosed sensitive vocabulary by F1, indicating that the quality of the domain term vocabulary has some impact on the characterization of the enhanced domain term vocabulary. The integration of the whole domain term dictionary has better effect than that of a small quantity of domain terms, and the coverage of domain knowledge is shown to have great influence on enhancing the representation of the domain terms. The experimental results in tables 5 and 3 are analyzed, which shows that the method for merging domain term dictionary provided herein can improve the effect compared with the method without merging domain terms in the baseline model, and reflects that the merging of domain knowledge can enhance the characterization of the professional terms.

To verify whether the MARC-SI notices too many network opinions with forwarding, special notation, information semantic hierarchy, the present invention for this purpose gives examples 1 and 2 shown in table 6, which are both judicial sensitive information. The baseline model was set to CNN and Bi-LSTM Attn (Bi-LSTM Attention) models.

TABLE 6 case analysis

As can be seen from the results in table 6, since the Bi-LSTM Attn model with too much redundant information cannot perform effective identification, the CNN model may focus on sentences with too high sensitivity density after extracting the local information, but the MARC-SI may focus on sensitive terms specific to the judicial field, such as the sensitive terms of "double open", "public drunk driving violation", and the like in example 1. As can be seen from the results, the MARC-SI designed in the text has better representation capability for the text which is not standardized and has redundant information, and can also better classify by utilizing judicial sensitive information vocabularies.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A judicial public opinion sensitive information identification method integrated with a domain term dictionary is characterized in that: the method comprises the following steps:

2. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the method comprises the following steps of crawling judicial public opinion data before constructing a judicial sensitive information recognition model integrated into a domain term dictionary and preprocessing the data according to the classification of the judicial public opinion sensitive information, and specifically comprises the following steps:

3. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:

And

Inputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the outputs in two directions are simply spliced to obtain the output H of the network layer, wherein each column of vectors represents the public opinion description contextThe characterization of (1);

multiHead(Q,K,V)＝concat(head₁,…,head_h)W^O

where head_i＝att(QW_i ^Q,KW_i ^k,VW_i ^V) (4)

A^h＝residualConnect(D_M^d,D_H) (5)

K^h＝residualConnect(W_M^d,W_H) (6)

where residulconnect denotes residual connection, D _ M^d，W_M^dRespectively representThe public opinion text and domain dictionary output results through a multi-head attention mechanism, A^h，K^hRespectively representing the results of the public opinion texts after residual connection;

representing the corresponding dictionary ith domain word token vector,

is represented by A^hThe k column vector of (1), sim represents the calculation

And

wherein

Is the weight vector to be trained and,

representing elements multiplied in sequence, (;) representing vectors on a lineLine splicing, K and K^hCorresponds to the column vector of, a is a^hThe column vectors of (1) correspond;

Multiplying to obtain a correlation matrix with weight information

Where softmax is a normalization function, [; splicing operation;

wherein k represents an output channel of the CNN network;

step2.8, mixing

P(D)＝softmax(O^k) (13)。