CN112231472A - Judicial public opinion sensitive information identification method integrated with domain term dictionary - Google Patents

Judicial public opinion sensitive information identification method integrated with domain term dictionary Download PDF

Info

Publication number
CN112231472A
CN112231472A CN202010984681.5A CN202010984681A CN112231472A CN 112231472 A CN112231472 A CN 112231472A CN 202010984681 A CN202010984681 A CN 202010984681A CN 112231472 A CN112231472 A CN 112231472A
Authority
CN
China
Prior art keywords
judicial
domain
public opinion
term dictionary
domain term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010984681.5A
Other languages
Chinese (zh)
Other versions
CN112231472B (en
Inventor
余正涛
张泽锋
黄于欣
郭军军
相艳
高盛祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202010984681.5A priority Critical patent/CN112231472B/en
Publication of CN112231472A publication Critical patent/CN112231472A/en
Application granted granted Critical
Publication of CN112231472B publication Critical patent/CN112231472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary. The method comprises the steps of firstly, respectively coding public opinion texts and domain term dictionaries by using a bidirectional circulation neural network and a multi-head attention mechanism, and extracting significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, global and local features are further extracted by utilizing a multi-head attention mechanism and a convolutional neural network, and finally sensitive information classification is realized. The invention fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and the domain knowledge can be utilized to enhance the semantic feature representation of the words related to the judicial information in the text, thereby improving the performance of the judicial public opinion sensitive information identification.

Description

Judicial public opinion sensitive information identification method integrated with domain term dictionary
Technical Field
The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary, belonging to the technical field of natural language processing.
Background
In the social network, users can express their own opinions anytime and anywhere, wherein a great deal of misunderstandings and one-sided opinions are made for judicial department to judge related work, and the social network opinion has the characteristics of rapid propagation, high sensitivity, easy initiation of network public opinions and the like. In order to better assist the judicial department in carrying out work, it is important to quickly and accurately identify sensitive information related to judicial activities from massive public sentiment news.
For the identification of sensitive information in the judicial field, the identification cannot be regarded as a simple binary task, whether the sensitive information relates to the judicial field and whether the sensitive information is sensitive or not needs to be considered at the same time, the sensitive information and the insensitive information can appear, and some sensitive information but does not relate to the judicial field. Therefore, the method converts the judicial sensitive information identification task into a four-classification task, and needs to identify sensitivity and field.
The judicial public opinion text has the problems of irregular description, more redundant information and the like, so that effective representation of the judicial public opinion text is difficult to carry out, sensitive information relating to the judicial field comprises phrases which cause text sensitivity, the phrases belong to sensitive special vocabularies of the judicial field, and the phrases play a leading role in identifying the judicial sensitive information, but the phrases cannot appear in a sensitive term dictionary of the general field, so that the sensitive information of the judicial field cannot be effectively identified by directly carrying out word matching work. In order to obtain better representation, the model can learn the expression related to judicial sensitive information, a domain sensitive term dictionary is constructed, and the term dictionary is used as external guidance to be merged into a deep learning framework, so that effective feature enhancement can be performed.
Disclosure of Invention
In order to solve the problems, the invention constructs a domain term dictionary, utilizes the domain term dictionary to guide a model to learn domain characteristics, and provides a judicial public opinion sensitive information recognition model integrated into the domain term dictionary aiming at the text description characteristics of the judicial public opinion to classify the judicial public opinion sensitive information.
The technical scheme of the invention is as follows: a method for identifying judicial public opinion sensitive information fused into a domain term dictionary, the method comprising:
constructing a judicial sensitive information recognition model fused into a domain term dictionary to recognize sensitive information; the judicial sensitive information recognition model integrated into the domain term dictionary comprises a coding layer, a domain term dictionary integration layer, a local feature extraction layer and a classification layer;
public sentiment texts and domain term dictionaries are coded and feature attention is paid through a coding layer;
calculating similarity between the domain term dictionary and the public opinion text through a domain term dictionary integration layer and integrating the similarity into text representation;
extracting important features on the basis of a domain term dictionary integration layer through a local feature extraction layer;
and predicting the class probability of the extracted important features through a classification layer.
As a further scheme of the invention, before constructing a judicial sensitive information recognition model integrated into a domain term dictionary, the judicial public opinion data is crawled, and data preprocessing is carried out according to the classification of the judicial public opinion sensitive information, and the specific steps are as follows:
step1.1, crawling public opinion texts, and forming a plurality of public opinion texts after manual screening and labeling;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors by utilizing a dog searching news data set, a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm to serve as judicial sensitive prior knowledge of a judicial sensitive information recognition model.
As a further scheme of the present invention, the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.1, inputting word-embedded matrixes of public opinion text and domain term dictionary
Figure BDA0002688811510000021
And
Figure BDA0002688811510000022
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representation
Figure BDA0002688811510000023
Inputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the output in two directions is simply spliced, the output H of the network layer is obtained, and each column of vectors represents the representation of the public opinion description context;
Figure BDA0002688811510000024
Figure BDA0002688811510000025
wherein Bi-LSTM represents vector representation after passing through a bidirectional circulation neural network, and D _ H and W _ H are public opinion texts and vector representations after being coded by a domain term dictionary respectively;
step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
Figure BDA0002688811510000026
multiHead(Q,K,V)=concat(head1,…,headh)WO
whereheadi=att(QWi Q,KWi k,VWi V)(4)
wherein softmax is a normalization operation, connect represents a splicing operation, wherein
Figure BDA0002688811510000031
Figure BDA0002688811510000032
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection;
step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
Figure BDA0002688811510000033
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),
Figure BDA0002688811510000034
representing the corresponding dictionary ith domain word token vector,
Figure BDA0002688811510000035
is represented by AhThe k column vector of (1), sim represents the calculation
Figure BDA0002688811510000036
And
Figure BDA0002688811510000037
the calculation process of the trainable function of the similarity between the two functions is as follows:
Figure BDA0002688811510000038
wherein
Figure BDA0002688811510000039
Is the weight vector to be trained and,
Figure BDA00026888115100000310
representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and KhCorresponds to the column vector of, a is ahThe column vectors of (1) correspond;
step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrix
Figure BDA00026888115100000311
Multiplying to obtain a correlation matrix with weight information
Figure BDA00026888115100000312
Finally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Figure BDA00026888115100000313
Figure BDA00026888115100000314
Figure BDA00026888115100000315
Where softmax is a normalization function, [; splicing operation;
step2.7, representation of text that has been merged into dictionary information
Figure BDA00026888115100000316
Performing convolution operation, performing feature extraction on the public sentiment content information and dictionary information, and then performing max-posing operation, wherein the process is as follows:
Figure BDA0002688811510000041
wherein k represents an output channel of the CNN network;
step2.8, mixing
Figure BDA0002688811510000042
Performing multi-head attention operation to obtain a feature matrix with weight information;
Figure BDA0002688811510000043
step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkAfter using normalized softmax, it is mapped to the classification space as follows:
P(D)=softmax(Ok) (13)
the invention has the beneficial effects that: the method fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and can also utilize the domain knowledge to enhance the semantic feature representation of words related to the judicial information in the text, thereby improving the performance of judicial public opinion sensitive information identification;
experimental results show that the method provided by the invention is superior to a baseline system in the performance of indexes such as accuracy, recall rate, macro-average F1 value and micro-average F1 value.
Drawings
FIG. 1 is a schematic diagram of model construction in the present invention;
FIG. 2 is a flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1-2, the judicial public opinion sensitive information recognition method integrated with the domain term dictionary firstly uses a bidirectional circulation neural network and a multi-head attention mechanism to respectively encode the public opinion text and the domain term dictionary and extract the significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, further extracting global and local features by utilizing a multi-head attention mechanism and a convolutional neural network, and finally realizing sensitive information classification;
the method comprises the following specific steps:
step1, crawling judicial public opinion data and carrying out data preprocessing according to the classification of judicial public opinion sensitive information;
step1.1, crawling websites such as a Xinlang microblog and a github from 3/1/2020 to 6/1/2020, and manually screening and labeling to form 2 ten thousand public opinion texts;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors as models by using a dog searching news data set (about 500M), a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm;
as a further scheme of the present invention, the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.1, inputting word-embedded matrixes of public opinion text and domain term dictionary
Figure BDA0002688811510000051
And
Figure BDA0002688811510000052
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representation
Figure BDA0002688811510000053
A context-aware encoding mechanism is entered. The method comprises the steps that Bi-LSTM (bidirectional long-short term memory neural network) is used as an embedding mechanism for understanding context information, feature interaction between words is simulated, output in two directions is simply spliced, output H of a network layer is obtained, and each column of vectors represents representation of public opinion description context;
Figure BDA0002688811510000054
Figure BDA0002688811510000055
wherein Bi-LSTM represents vector representation after passing through a bidirectional cyclic neural network, and D _ H and W _ H are public opinion text and vector representation after being coded by a domain term dictionary respectively.
Step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
Figure BDA0002688811510000056
multiHead(Q,K,V)=concat(head1,…,headh)WO
whereheadi=att(QWi Q,KWi k,VWi V) (4)
wherein softmax is a normalization operation, connect represents a splicing operation, wherein
Figure BDA0002688811510000057
Figure BDA0002688811510000058
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection.
Step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
Figure BDA0002688811510000059
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),
Figure BDA0002688811510000061
representing the corresponding dictionary ith domain word token vector,
Figure BDA0002688811510000062
is represented by AhThe k column vector of (1), sim represents the calculation
Figure BDA0002688811510000063
And
Figure BDA0002688811510000064
the calculation process of the trainable function of the similarity between the two functions is as follows:
Figure BDA0002688811510000065
wherein
Figure BDA0002688811510000066
Is the weight vector to be trained and,
Figure BDA0002688811510000067
representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and KhCorresponds to the column vector of, a is ahThe column vectors of (a) correspond.
Step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrix
Figure BDA0002688811510000068
Multiplying to obtain a correlation matrix with weight information
Figure BDA0002688811510000069
Finally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Figure BDA00026888115100000610
Figure BDA00026888115100000611
Figure BDA00026888115100000612
Where softmax is a normalization function, [; is a splicing operation.
Step2.7, representation of text that has been merged into dictionary information
Figure BDA00026888115100000613
Performing convolution operation, extracting features of the public sentiment content information and the dictionary information, and then performing max-posing (maximum pooling operation), wherein the process is as follows:
Figure BDA00026888115100000614
where k represents the output channel of the CNN network.
Step2.8, mixing
Figure BDA00026888115100000615
And performing multi-head attention operation to obtain the feature matrix with the weight information.
Figure BDA00026888115100000616
Step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkMapping it to the classification space using softmax (normalization) is as follows:
P(D)=softmax(Ok) (13)
and training parameters by using a gradient descent algorithm, thereby constructing a judicial sensitive information recognition model fused into a domain term dictionary.
For better effectiveness of training and verifying models, a training set, a verifying set and a testing set are constructed according to a ratio of 8:1:1, wherein specific data information is shown in table 1:
TABLE 1 data size and data set partitioning
Figure BDA0002688811510000071
The construction of a domain term dictionary is very important for identifying judicial sensitive information, the method utilizes a domain knowledge enhancement model to represent domain terms, and combines the judicial domain term vocabulary and the sensitive term vocabulary into the domain term dictionary together, wherein the judicial domain terms are constructed by manually screening the content of a referee document network and a Chinese court network; the sensitive term consists of two parts: (1) manually constructing according to the characteristics of judicial public sentiment data, (2) and disclosing Chinese sensitive words after screening. The composition of the terms includes words, phrases and phrases, and the specific vocabulary number and examples are shown in table 2:
TABLE 2 dictionary size of domain terms
Figure BDA0002688811510000072
In the invention, the training round is designed to be 20 rounds, the learning rate of the model is 0.0001, the maximum interception length of the public opinion text is set to be 300 characters, the word embedding dimension is 512, Dropout is 0.5, the number of filters in the convolutional neural network model is 256, the size of a sliding window is (2,3 and 4), and Adam is used in the optimization algorithm.
For the invention, the effect of the judicial sensitive information classification model can be evaluated more by calculating the Macro-average and the Micro-average thereof, the Micro-average F1 value (Micro-F1), the Macro-average Precision rate (Macro _ Precision), the Macro-average Recall rate (Macro _ Recall) and the Macro-average F1 value (Macro-F1) are mainly used as evaluation indexes, wherein the calculation process is shown as the formula (13-16):
Figure BDA0002688811510000073
Figure BDA0002688811510000074
Figure BDA0002688811510000075
Figure BDA0002688811510000076
these indices are based on a "confusion matrix [18 ]]", where TP denotes true positive, FP denotes false positive, TN denotes true negative, FN denotes false negative,
Figure BDA0002688811510000077
respectively, the average values of the corresponding elements of the confusion matrix.
The comparison model adopted by the invention is as follows:
CNN (convolutional neural network) model: kim et al propose to apply CNN to text classification, mainly including a convolutional layer and a pooling layer, and finally classifying through a full-link layer.
Bi-LSTM Attention (Attention-based mechanism-two-way long-short term memory neural network) model: the classification is carried out by using a bidirectional cyclic neural network and an Attention layer and then a full connection layer.
RCNN (cyclic convolutional neural network) model: lai et al propose a neural network model for classification with RNN and CNN, which mainly comprises a recurrent neural network layer and a convolutional layer, and then is classified by a full-link layer.
Bert (bidirectional Transformer encoder) model: and carrying out text representation through a Bart pre-training model and then classifying through a full-connection network.
Transformer model: two encoders in the transform are used for encoding, followed by a front connection layer for classification.
FastText () model: and (3) superposing and averaging the words and the n-gram vectors of the whole document to obtain a document vector, and then normalizing the document vector to perform multi-classification.
SVM (support vector machine): the linear classifier with the maximum interval on the feature space is defined and is usually used for a text classification task, and the text feature extraction and representation method of the model is consistent with the literature.
TABLE 3 comparison of MARC-SI with baseline model test results
Figure BDA0002688811510000081
As can be seen from Table 3, MARC-SI has good effects on the baseline model, the pre-training model and the machine learning model, which indicates that the method for merging the domain term dictionary provided in the text is effective for the judicial domain sensitive information recognition task. From the analysis of experimental results, the RCNN and Fastext models have good effects, which shows that the model architecture selected in the text and the idea based on local feature extraction are reasonable, and the BERT pre-training model is not suitable for the task due to the fixed word segmentation structure. For most Transformer models with good task effects, the effect is not good in the task, probably because too much information is blended into public sentiment texts, and the feature extraction cannot be effectively carried out by a self-attention mechanism. As can be seen from the results, the model MARC-SI proposed herein has significant advantages in the classification of judicial opinion-sensitive information.
In order to verify that each layer of the MARC-SI model is effective for the whole classification, an ablation experiment is designed, wherein a coding layer is removed to replace a Bi-LSTM orientation layer with a full connection layer, a field term dictionary integration layer is removed to replace a field term dictionary integration layer, a local feature extraction layer is replaced to replace a CNN Self-orientation layer with a full connection layer, and the experiment result is shown in Table 4.
TABLE 4 ablation experiment
Figure BDA0002688811510000091
Analyzing the results in Table 4, the effect of removing the coding layer is 7% lower than the F1 value of MARC-SI, which indicates that the coding of public opinion texts and domain dictionaries is still important; the effect of the integrated model can be improved by about 1% by being merged into the domain term dictionary, which indicates that the domain term dictionary has a guiding function for the learning of the model for the task; the F1 value of the local feature extraction network for removing is lower than that of MARC-SI by 2 percent, which shows that the overall network still needs to extract the features after being merged into the term dictionary. As can be seen from ablation experiments, the network model proposed herein is effective for both judicial sensitive information identification tasks.
Because the type of the domain dictionary has a large influence on the model, in order to compare the influence of the domain term dictionary on the model, terms in different domains are respectively input into the MARC-SI model for experiment. Manually constructed judicial domain term vocabulary and public sensitive information term vocabulary are respectively merged into the composition for experiment.
TABLE 5 experiment of different vocabulary integration
Figure BDA0002688811510000092
Analyzing the results in table 5, the manually constructed judicial domain term vocabulary was 1% improved over the disclosed sensitive vocabulary by F1, indicating that the quality of the domain term vocabulary has some impact on the characterization of the enhanced domain term vocabulary. The integration of the whole domain term dictionary has better effect than that of a small quantity of domain terms, and the coverage of domain knowledge is shown to have great influence on enhancing the representation of the domain terms. The experimental results in tables 5 and 3 are analyzed, which shows that the method for merging domain term dictionary provided herein can improve the effect compared with the method without merging domain terms in the baseline model, and reflects that the merging of domain knowledge can enhance the characterization of the professional terms.
To verify whether the MARC-SI notices too many network opinions with forwarding, special notation, information semantic hierarchy, the present invention for this purpose gives examples 1 and 2 shown in table 6, which are both judicial sensitive information. The baseline model was set to CNN and Bi-LSTM Attn (Bi-LSTM Attention) models.
TABLE 6 case analysis
Figure BDA0002688811510000101
As can be seen from the results in table 6, since the Bi-LSTM Attn model with too much redundant information cannot perform effective identification, the CNN model may focus on sentences with too high sensitivity density after extracting the local information, but the MARC-SI may focus on sensitive terms specific to the judicial field, such as the sensitive terms of "double open", "public drunk driving violation", and the like in example 1. As can be seen from the results, the MARC-SI designed in the text has better representation capability for the text which is not standardized and has redundant information, and can also better classify by utilizing judicial sensitive information vocabularies.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. A judicial public opinion sensitive information identification method integrated with a domain term dictionary is characterized in that: the method comprises the following steps:
constructing a judicial sensitive information recognition model fused into a domain term dictionary to recognize sensitive information; the judicial sensitive information recognition model integrated into the domain term dictionary comprises a coding layer, a domain term dictionary integration layer, a local feature extraction layer and a classification layer;
public sentiment texts and domain term dictionaries are coded and feature attention is paid through a coding layer;
calculating similarity between the domain term dictionary and the public opinion text through a domain term dictionary integration layer and integrating the similarity into text representation;
extracting important features on the basis of a domain term dictionary integration layer through a local feature extraction layer;
and predicting the class probability of the extracted important features through a classification layer.
2. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the method comprises the following steps of crawling judicial public opinion data before constructing a judicial sensitive information recognition model integrated into a domain term dictionary and preprocessing the data according to the classification of the judicial public opinion sensitive information, and specifically comprises the following steps:
step1.1, crawling public opinion texts, and forming a plurality of public opinion texts after manual screening and labeling;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors by utilizing a dog searching news data set, a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm to serve as judicial sensitive prior knowledge of a judicial sensitive information recognition model.
3. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.1, inputting word-embedded matrixes of public opinion text and domain term dictionary
Figure FDA0002688811500000011
And
Figure FDA0002688811500000012
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representation
Figure FDA0002688811500000013
Inputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the outputs in two directions are simply spliced to obtain the output H of the network layer, wherein each column of vectors represents the public opinion description contextThe characterization of (1);
Figure FDA0002688811500000014
Figure FDA0002688811500000021
wherein Bi-LSTM represents vector representation after passing through a bidirectional circulation neural network, and D _ H and W _ H are public opinion texts and vector representations after being coded by a domain term dictionary respectively;
step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
Figure FDA0002688811500000022
multiHead(Q,K,V)=concat(head1,…,headh)WO
where headi=att(QWi Q,KWi k,VWi V) (4)
wherein softmax is a normalization operation, connect represents a splicing operation, wherein
Figure FDA0002688811500000023
Figure FDA0002688811500000024
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representThe public opinion text and domain dictionary output results through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection;
step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
Figure FDA0002688811500000025
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),
Figure FDA0002688811500000026
representing the corresponding dictionary ith domain word token vector,
Figure FDA0002688811500000027
is represented by AhThe k column vector of (1), sim represents the calculation
Figure FDA0002688811500000028
And
Figure FDA0002688811500000029
the calculation process of the trainable function of the similarity between the two functions is as follows:
Figure FDA00026888115000000210
wherein
Figure FDA00026888115000000211
Is the weight vector to be trained and,
Figure FDA00026888115000000212
representing elements multiplied in sequence, (;) representing vectors on a lineLine splicing, K and KhCorresponds to the column vector of, a is ahThe column vectors of (1) correspond;
step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrix
Figure FDA00026888115000000213
Multiplying to obtain a correlation matrix with weight information
Figure FDA00026888115000000214
Finally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Figure FDA00026888115000000215
Figure FDA0002688811500000031
Figure FDA0002688811500000032
Where softmax is a normalization function, [; splicing operation;
step2.7, representation of text that has been merged into dictionary information
Figure FDA0002688811500000033
Performing convolution operation, performing feature extraction on the public sentiment content information and dictionary information, and then performing max-posing operation, wherein the process is as follows:
Figure FDA0002688811500000034
wherein k represents an output channel of the CNN network;
step2.8, mixing
Figure FDA0002688811500000035
Performing multi-head attention operation to obtain a feature matrix with weight information;
Figure FDA0002688811500000036
step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkAfter using normalized softmax, it is mapped to the classification space as follows:
P(D)=softmax(Ok) (13)。
CN202010984681.5A 2020-09-18 2020-09-18 Judicial public opinion sensitive information identification method integrated with domain term dictionary Active CN112231472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010984681.5A CN112231472B (en) 2020-09-18 2020-09-18 Judicial public opinion sensitive information identification method integrated with domain term dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010984681.5A CN112231472B (en) 2020-09-18 2020-09-18 Judicial public opinion sensitive information identification method integrated with domain term dictionary

Publications (2)

Publication Number Publication Date
CN112231472A true CN112231472A (en) 2021-01-15
CN112231472B CN112231472B (en) 2022-07-29

Family

ID=74107203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010984681.5A Active CN112231472B (en) 2020-09-18 2020-09-18 Judicial public opinion sensitive information identification method integrated with domain term dictionary

Country Status (1)

Country Link
CN (1) CN112231472B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836054A (en) * 2021-03-08 2021-05-25 重庆大学 Service classification method based on symbiotic attention representation learning
CN113177831A (en) * 2021-03-12 2021-07-27 西安理工大学 Financial early warning system and method constructed by applying public data
CN113609301A (en) * 2021-07-05 2021-11-05 上海交通大学 Dialogue method, medium and system based on knowledge graph
CN113762237A (en) * 2021-04-26 2021-12-07 腾讯科技(深圳)有限公司 Text image processing method, device and equipment and storage medium
CN116108171A (en) * 2022-12-19 2023-05-12 中国邮政速递物流股份有限公司广东省分公司 Judicial material processing system based on AI circulating neural network deep learning technology
CN117009533A (en) * 2023-09-27 2023-11-07 戎行技术有限公司 Dark language identification method based on classification extraction and word vector model
CN117453863A (en) * 2023-12-22 2024-01-26 珠海博维网络信息有限公司 Public opinion text classifying method and system
CN113177831B (en) * 2021-03-12 2024-05-17 西安理工大学 Financial early warning system constructed by application of public data and early warning method

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208597A1 (en) * 2007-02-27 2008-08-28 Tetsuro Chino Apparatus, method, and computer program product for processing input speech
US20140329557A1 (en) * 2011-12-12 2014-11-06 Samsung Electronics Co., Ltd. Method and apparatus for reporting dual mode capabilities in a long term evolution network
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web
WO2016190861A1 (en) * 2015-05-27 2016-12-01 Hewlett Packard Enterprise Development Lp Identifying algorithmically generated domains
CN107038249A (en) * 2017-04-28 2017-08-11 安徽博约信息科技股份有限公司 Network public sentiment information sensibility classification method based on dictionary
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN108984667A (en) * 2018-06-29 2018-12-11 郑州中博奥信息技术有限公司 A kind of public sentiment monitoring system
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN109582875A (en) * 2018-12-17 2019-04-05 武汉泰乐奇信息科技有限公司 A kind of personalized recommendation method and system of online medical education resource
CN110083700A (en) * 2019-03-19 2019-08-02 北京中兴通网络科技股份有限公司 A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
CN111597304A (en) * 2020-05-15 2020-08-28 上海财经大学 Secondary matching method for accurately identifying Chinese enterprise name entity

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080208597A1 (en) * 2007-02-27 2008-08-28 Tetsuro Chino Apparatus, method, and computer program product for processing input speech
US20140329557A1 (en) * 2011-12-12 2014-11-06 Samsung Electronics Co., Ltd. Method and apparatus for reporting dual mode capabilities in a long term evolution network
WO2016190861A1 (en) * 2015-05-27 2016-12-01 Hewlett Packard Enterprise Development Lp Identifying algorithmically generated domains
CN105022725A (en) * 2015-07-10 2015-11-04 河海大学 Text emotional tendency analysis method applied to field of financial Web
CN107038249A (en) * 2017-04-28 2017-08-11 安徽博约信息科技股份有限公司 Network public sentiment information sensibility classification method based on dictionary
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
CN108984667A (en) * 2018-06-29 2018-12-11 郑州中博奥信息技术有限公司 A kind of public sentiment monitoring system
CN109543180A (en) * 2018-11-08 2019-03-29 中山大学 A kind of text emotion analysis method based on attention mechanism
CN109582875A (en) * 2018-12-17 2019-04-05 武汉泰乐奇信息科技有限公司 A kind of personalized recommendation method and system of online medical education resource
CN110083700A (en) * 2019-03-19 2019-08-02 北京中兴通网络科技股份有限公司 A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN111209401A (en) * 2020-01-03 2020-05-29 西安电子科技大学 System and method for classifying and processing sentiment polarity of online public opinion text information
CN111597304A (en) * 2020-05-15 2020-08-28 上海财经大学 Secondary matching method for accurately identifying Chinese enterprise name entity

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
FAN G 等: "Deep semantic feature learning with embedded static metrics for software defect prediction", 《2019 26TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE》 *
LEE J 等: "Learning probabilistic kernel feature subspace with side-information for classification", 《2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 *
WANJIN CHE 等: "Research on Chinese and Vietnamese Bilingual Event Graph Extraction Method", 《PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MECHATRONICS AND INTELLIGENT ROBOTICS》 *
周钦青 等: "超图拉普拉斯稀疏编码在图像识别中的应用", 《计算机应用与软件》 *
郁圣卫 等: "基于领域情感词典特征表示的细粒度意见挖掘", 《中文信息学报》 *
韩露 等: "领域知识关系对领域文本分类的影响", 《第二十七届中国控制会议论文集》 *
韩鹏宇 等: "基于案件要素指导的涉案舆情新闻文本摘要方法", 《中文信息学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836054B (en) * 2021-03-08 2022-07-26 重庆大学 Service classification method based on symbiotic attention representation learning
CN112836054A (en) * 2021-03-08 2021-05-25 重庆大学 Service classification method based on symbiotic attention representation learning
CN113177831A (en) * 2021-03-12 2021-07-27 西安理工大学 Financial early warning system and method constructed by applying public data
CN113177831B (en) * 2021-03-12 2024-05-17 西安理工大学 Financial early warning system constructed by application of public data and early warning method
CN113762237B (en) * 2021-04-26 2023-08-18 腾讯科技(深圳)有限公司 Text image processing method, device, equipment and storage medium
CN113762237A (en) * 2021-04-26 2021-12-07 腾讯科技(深圳)有限公司 Text image processing method, device and equipment and storage medium
CN113609301A (en) * 2021-07-05 2021-11-05 上海交通大学 Dialogue method, medium and system based on knowledge graph
CN116108171A (en) * 2022-12-19 2023-05-12 中国邮政速递物流股份有限公司广东省分公司 Judicial material processing system based on AI circulating neural network deep learning technology
CN116108171B (en) * 2022-12-19 2023-10-31 中国邮政速递物流股份有限公司广东省分公司 Judicial material processing system based on AI circulating neural network deep learning technology
CN117009533A (en) * 2023-09-27 2023-11-07 戎行技术有限公司 Dark language identification method based on classification extraction and word vector model
CN117009533B (en) * 2023-09-27 2023-12-26 戎行技术有限公司 Dark language identification method based on classification extraction and word vector model
CN117453863A (en) * 2023-12-22 2024-01-26 珠海博维网络信息有限公司 Public opinion text classifying method and system
CN117453863B (en) * 2023-12-22 2024-03-29 珠海博维网络信息有限公司 Public opinion text classifying method and system

Also Published As

Publication number Publication date
CN112231472B (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN110990564B (en) Negative news identification method based on emotion calculation and multi-head attention mechanism
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN109284506A (en) A kind of user comment sentiment analysis system and method based on attention convolutional neural networks
CN110532557B (en) Unsupervised text similarity calculation method
CN108563638B (en) Microblog emotion analysis method based on topic identification and integrated learning
CN106557462A (en) Name entity recognition method and system
CN111160031A (en) Social media named entity identification method based on affix perception
CN110717324B (en) Judgment document answer information extraction method, device, extractor, medium and equipment
CN111046670B (en) Entity and relationship combined extraction method based on drug case legal documents
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN110717843A (en) Reusable law strip recommendation framework
CN112183064B (en) Text emotion reason recognition system based on multi-task joint learning
CN112883732A (en) Method and device for identifying Chinese fine-grained named entities based on associative memory network
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN110414009A (en) The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device
CN112561718A (en) Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing
CN110287298A (en) A kind of automatic question answering answer selection method based on question sentence theme
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN112270187A (en) Bert-LSTM-based rumor detection model
CN112052319B (en) Intelligent customer service method and system based on multi-feature fusion
Sun et al. Transformer based multi-grained attention network for aspect-based sentiment analysis
CN115759092A (en) Network threat information named entity identification method based on ALBERT

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant