CN112231472A - Judicial public opinion sensitive information identification method integrated with domain term dictionary - Google Patents
Judicial public opinion sensitive information identification method integrated with domain term dictionary Download PDFInfo
- Publication number
- CN112231472A CN112231472A CN202010984681.5A CN202010984681A CN112231472A CN 112231472 A CN112231472 A CN 112231472A CN 202010984681 A CN202010984681 A CN 202010984681A CN 112231472 A CN112231472 A CN 112231472A
- Authority
- CN
- China
- Prior art keywords
- judicial
- domain
- public opinion
- term dictionary
- domain term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 239000011159 matrix material Substances 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims abstract description 17
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 14
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 42
- 238000000605 extraction Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 10
- 230000010354 integration Effects 0.000 claims description 10
- 238000010606 normalization Methods 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 10
- 238000012216 screening Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000012512 characterization method Methods 0.000 claims description 6
- 230000009193 crawling Effects 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 12
- 238000002474 experimental method Methods 0.000 description 7
- 230000035945 sensitivity Effects 0.000 description 4
- 238000002679 ablation Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary. The method comprises the steps of firstly, respectively coding public opinion texts and domain term dictionaries by using a bidirectional circulation neural network and a multi-head attention mechanism, and extracting significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, global and local features are further extracted by utilizing a multi-head attention mechanism and a convolutional neural network, and finally sensitive information classification is realized. The invention fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and the domain knowledge can be utilized to enhance the semantic feature representation of the words related to the judicial information in the text, thereby improving the performance of the judicial public opinion sensitive information identification.
Description
Technical Field
The invention relates to a judicial public opinion sensitive information identification method integrated into a domain term dictionary, belonging to the technical field of natural language processing.
Background
In the social network, users can express their own opinions anytime and anywhere, wherein a great deal of misunderstandings and one-sided opinions are made for judicial department to judge related work, and the social network opinion has the characteristics of rapid propagation, high sensitivity, easy initiation of network public opinions and the like. In order to better assist the judicial department in carrying out work, it is important to quickly and accurately identify sensitive information related to judicial activities from massive public sentiment news.
For the identification of sensitive information in the judicial field, the identification cannot be regarded as a simple binary task, whether the sensitive information relates to the judicial field and whether the sensitive information is sensitive or not needs to be considered at the same time, the sensitive information and the insensitive information can appear, and some sensitive information but does not relate to the judicial field. Therefore, the method converts the judicial sensitive information identification task into a four-classification task, and needs to identify sensitivity and field.
The judicial public opinion text has the problems of irregular description, more redundant information and the like, so that effective representation of the judicial public opinion text is difficult to carry out, sensitive information relating to the judicial field comprises phrases which cause text sensitivity, the phrases belong to sensitive special vocabularies of the judicial field, and the phrases play a leading role in identifying the judicial sensitive information, but the phrases cannot appear in a sensitive term dictionary of the general field, so that the sensitive information of the judicial field cannot be effectively identified by directly carrying out word matching work. In order to obtain better representation, the model can learn the expression related to judicial sensitive information, a domain sensitive term dictionary is constructed, and the term dictionary is used as external guidance to be merged into a deep learning framework, so that effective feature enhancement can be performed.
Disclosure of Invention
In order to solve the problems, the invention constructs a domain term dictionary, utilizes the domain term dictionary to guide a model to learn domain characteristics, and provides a judicial public opinion sensitive information recognition model integrated into the domain term dictionary aiming at the text description characteristics of the judicial public opinion to classify the judicial public opinion sensitive information.
The technical scheme of the invention is as follows: a method for identifying judicial public opinion sensitive information fused into a domain term dictionary, the method comprising:
constructing a judicial sensitive information recognition model fused into a domain term dictionary to recognize sensitive information; the judicial sensitive information recognition model integrated into the domain term dictionary comprises a coding layer, a domain term dictionary integration layer, a local feature extraction layer and a classification layer;
public sentiment texts and domain term dictionaries are coded and feature attention is paid through a coding layer;
calculating similarity between the domain term dictionary and the public opinion text through a domain term dictionary integration layer and integrating the similarity into text representation;
extracting important features on the basis of a domain term dictionary integration layer through a local feature extraction layer;
and predicting the class probability of the extracted important features through a classification layer.
As a further scheme of the invention, before constructing a judicial sensitive information recognition model integrated into a domain term dictionary, the judicial public opinion data is crawled, and data preprocessing is carried out according to the classification of the judicial public opinion sensitive information, and the specific steps are as follows:
step1.1, crawling public opinion texts, and forming a plurality of public opinion texts after manual screening and labeling;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors by utilizing a dog searching news data set, a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm to serve as judicial sensitive prior knowledge of a judicial sensitive information recognition model.
As a further scheme of the present invention, the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representationInputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the output in two directions is simply spliced, the output H of the network layer is obtained, and each column of vectors represents the representation of the public opinion description context;
wherein Bi-LSTM represents vector representation after passing through a bidirectional circulation neural network, and D _ H and W _ H are public opinion texts and vector representations after being coded by a domain term dictionary respectively;
step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
multiHead(Q,K,V)=concat(head1,…,headh)WO
whereheadi=att(QWi Q,KWi k,VWi V)(4)
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection;
step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),representing the corresponding dictionary ith domain word token vector,is represented by AhThe k column vector of (1), sim represents the calculationAndthe calculation process of the trainable function of the similarity between the two functions is as follows:
whereinIs the weight vector to be trained and,representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and KhCorresponds to the column vector of, a is ahThe column vectors of (1) correspond;
step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrixMultiplying to obtain a correlation matrix with weight informationFinally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Where softmax is a normalization function, [; splicing operation;
step2.7, representation of text that has been merged into dictionary informationPerforming convolution operation, performing feature extraction on the public sentiment content information and dictionary information, and then performing max-posing operation, wherein the process is as follows:
wherein k represents an output channel of the CNN network;
step2.8, mixingPerforming multi-head attention operation to obtain a feature matrix with weight information;
step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkAfter using normalized softmax, it is mapped to the classification space as follows:
P(D)=softmax(Ok) (13)
the invention has the beneficial effects that: the method fuses the domain word dictionary and the judicial public opinion context information, the skill utilizes the context information to make up the poor representation effect of the traditional method in the context information, and can also utilize the domain knowledge to enhance the semantic feature representation of words related to the judicial information in the text, thereby improving the performance of judicial public opinion sensitive information identification;
experimental results show that the method provided by the invention is superior to a baseline system in the performance of indexes such as accuracy, recall rate, macro-average F1 value and micro-average F1 value.
Drawings
FIG. 1 is a schematic diagram of model construction in the present invention;
FIG. 2 is a flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1-2, the judicial public opinion sensitive information recognition method integrated with the domain term dictionary firstly uses a bidirectional circulation neural network and a multi-head attention mechanism to respectively encode the public opinion text and the domain term dictionary and extract the significant features; secondly, using a domain term dictionary as classified guide knowledge, and constructing a similar matrix with public sentiment texts to obtain text representations fused into the domain term dictionary; then, further extracting global and local features by utilizing a multi-head attention mechanism and a convolutional neural network, and finally realizing sensitive information classification;
the method comprises the following specific steps:
step1, crawling judicial public opinion data and carrying out data preprocessing according to the classification of judicial public opinion sensitive information;
step1.1, crawling websites such as a Xinlang microblog and a github from 3/1/2020 to 6/1/2020, and manually screening and labeling to form 2 ten thousand public opinion texts;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors as models by using a dog searching news data set (about 500M), a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm;
as a further scheme of the present invention, the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representationA context-aware encoding mechanism is entered. The method comprises the steps that Bi-LSTM (bidirectional long-short term memory neural network) is used as an embedding mechanism for understanding context information, feature interaction between words is simulated, output in two directions is simply spliced, output H of a network layer is obtained, and each column of vectors represents representation of public opinion description context;
wherein Bi-LSTM represents vector representation after passing through a bidirectional cyclic neural network, and D _ H and W _ H are public opinion text and vector representation after being coded by a domain term dictionary respectively.
Step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
multiHead(Q,K,V)=concat(head1,…,headh)WO
whereheadi=att(QWi Q,KWi k,VWi V) (4)
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representing the output results of the public sentiment text and the domain dictionary through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection.
Step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),representing the corresponding dictionary ith domain word token vector,is represented by AhThe k column vector of (1), sim represents the calculationAndthe calculation process of the trainable function of the similarity between the two functions is as follows:
whereinIs the weight vector to be trained and,representing elements multiplied in sequence, (;) representing vectors spliced in lines, K and KhCorresponds to the column vector of, a is ahThe column vectors of (a) correspond.
Step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrixMultiplying to obtain a correlation matrix with weight informationFinally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Where softmax is a normalization function, [; is a splicing operation.
Step2.7, representation of text that has been merged into dictionary informationPerforming convolution operation, extracting features of the public sentiment content information and the dictionary information, and then performing max-posing (maximum pooling operation), wherein the process is as follows:
where k represents the output channel of the CNN network.
Step2.8, mixingAnd performing multi-head attention operation to obtain the feature matrix with the weight information.
Step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkMapping it to the classification space using softmax (normalization) is as follows:
P(D)=softmax(Ok) (13)
and training parameters by using a gradient descent algorithm, thereby constructing a judicial sensitive information recognition model fused into a domain term dictionary.
For better effectiveness of training and verifying models, a training set, a verifying set and a testing set are constructed according to a ratio of 8:1:1, wherein specific data information is shown in table 1:
TABLE 1 data size and data set partitioning
The construction of a domain term dictionary is very important for identifying judicial sensitive information, the method utilizes a domain knowledge enhancement model to represent domain terms, and combines the judicial domain term vocabulary and the sensitive term vocabulary into the domain term dictionary together, wherein the judicial domain terms are constructed by manually screening the content of a referee document network and a Chinese court network; the sensitive term consists of two parts: (1) manually constructing according to the characteristics of judicial public sentiment data, (2) and disclosing Chinese sensitive words after screening. The composition of the terms includes words, phrases and phrases, and the specific vocabulary number and examples are shown in table 2:
TABLE 2 dictionary size of domain terms
In the invention, the training round is designed to be 20 rounds, the learning rate of the model is 0.0001, the maximum interception length of the public opinion text is set to be 300 characters, the word embedding dimension is 512, Dropout is 0.5, the number of filters in the convolutional neural network model is 256, the size of a sliding window is (2,3 and 4), and Adam is used in the optimization algorithm.
For the invention, the effect of the judicial sensitive information classification model can be evaluated more by calculating the Macro-average and the Micro-average thereof, the Micro-average F1 value (Micro-F1), the Macro-average Precision rate (Macro _ Precision), the Macro-average Recall rate (Macro _ Recall) and the Macro-average F1 value (Macro-F1) are mainly used as evaluation indexes, wherein the calculation process is shown as the formula (13-16):
these indices are based on a "confusion matrix [18 ]]", where TP denotes true positive, FP denotes false positive, TN denotes true negative, FN denotes false negative,respectively, the average values of the corresponding elements of the confusion matrix.
The comparison model adopted by the invention is as follows:
CNN (convolutional neural network) model: kim et al propose to apply CNN to text classification, mainly including a convolutional layer and a pooling layer, and finally classifying through a full-link layer.
Bi-LSTM Attention (Attention-based mechanism-two-way long-short term memory neural network) model: the classification is carried out by using a bidirectional cyclic neural network and an Attention layer and then a full connection layer.
RCNN (cyclic convolutional neural network) model: lai et al propose a neural network model for classification with RNN and CNN, which mainly comprises a recurrent neural network layer and a convolutional layer, and then is classified by a full-link layer.
Bert (bidirectional Transformer encoder) model: and carrying out text representation through a Bart pre-training model and then classifying through a full-connection network.
Transformer model: two encoders in the transform are used for encoding, followed by a front connection layer for classification.
FastText () model: and (3) superposing and averaging the words and the n-gram vectors of the whole document to obtain a document vector, and then normalizing the document vector to perform multi-classification.
SVM (support vector machine): the linear classifier with the maximum interval on the feature space is defined and is usually used for a text classification task, and the text feature extraction and representation method of the model is consistent with the literature.
TABLE 3 comparison of MARC-SI with baseline model test results
As can be seen from Table 3, MARC-SI has good effects on the baseline model, the pre-training model and the machine learning model, which indicates that the method for merging the domain term dictionary provided in the text is effective for the judicial domain sensitive information recognition task. From the analysis of experimental results, the RCNN and Fastext models have good effects, which shows that the model architecture selected in the text and the idea based on local feature extraction are reasonable, and the BERT pre-training model is not suitable for the task due to the fixed word segmentation structure. For most Transformer models with good task effects, the effect is not good in the task, probably because too much information is blended into public sentiment texts, and the feature extraction cannot be effectively carried out by a self-attention mechanism. As can be seen from the results, the model MARC-SI proposed herein has significant advantages in the classification of judicial opinion-sensitive information.
In order to verify that each layer of the MARC-SI model is effective for the whole classification, an ablation experiment is designed, wherein a coding layer is removed to replace a Bi-LSTM orientation layer with a full connection layer, a field term dictionary integration layer is removed to replace a field term dictionary integration layer, a local feature extraction layer is replaced to replace a CNN Self-orientation layer with a full connection layer, and the experiment result is shown in Table 4.
TABLE 4 ablation experiment
Analyzing the results in Table 4, the effect of removing the coding layer is 7% lower than the F1 value of MARC-SI, which indicates that the coding of public opinion texts and domain dictionaries is still important; the effect of the integrated model can be improved by about 1% by being merged into the domain term dictionary, which indicates that the domain term dictionary has a guiding function for the learning of the model for the task; the F1 value of the local feature extraction network for removing is lower than that of MARC-SI by 2 percent, which shows that the overall network still needs to extract the features after being merged into the term dictionary. As can be seen from ablation experiments, the network model proposed herein is effective for both judicial sensitive information identification tasks.
Because the type of the domain dictionary has a large influence on the model, in order to compare the influence of the domain term dictionary on the model, terms in different domains are respectively input into the MARC-SI model for experiment. Manually constructed judicial domain term vocabulary and public sensitive information term vocabulary are respectively merged into the composition for experiment.
TABLE 5 experiment of different vocabulary integration
Analyzing the results in table 5, the manually constructed judicial domain term vocabulary was 1% improved over the disclosed sensitive vocabulary by F1, indicating that the quality of the domain term vocabulary has some impact on the characterization of the enhanced domain term vocabulary. The integration of the whole domain term dictionary has better effect than that of a small quantity of domain terms, and the coverage of domain knowledge is shown to have great influence on enhancing the representation of the domain terms. The experimental results in tables 5 and 3 are analyzed, which shows that the method for merging domain term dictionary provided herein can improve the effect compared with the method without merging domain terms in the baseline model, and reflects that the merging of domain knowledge can enhance the characterization of the professional terms.
To verify whether the MARC-SI notices too many network opinions with forwarding, special notation, information semantic hierarchy, the present invention for this purpose gives examples 1 and 2 shown in table 6, which are both judicial sensitive information. The baseline model was set to CNN and Bi-LSTM Attn (Bi-LSTM Attention) models.
TABLE 6 case analysis
As can be seen from the results in table 6, since the Bi-LSTM Attn model with too much redundant information cannot perform effective identification, the CNN model may focus on sentences with too high sensitivity density after extracting the local information, but the MARC-SI may focus on sensitive terms specific to the judicial field, such as the sensitive terms of "double open", "public drunk driving violation", and the like in example 1. As can be seen from the results, the MARC-SI designed in the text has better representation capability for the text which is not standardized and has redundant information, and can also better classify by utilizing judicial sensitive information vocabularies.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (3)
1. A judicial public opinion sensitive information identification method integrated with a domain term dictionary is characterized in that: the method comprises the following steps:
constructing a judicial sensitive information recognition model fused into a domain term dictionary to recognize sensitive information; the judicial sensitive information recognition model integrated into the domain term dictionary comprises a coding layer, a domain term dictionary integration layer, a local feature extraction layer and a classification layer;
public sentiment texts and domain term dictionaries are coded and feature attention is paid through a coding layer;
calculating similarity between the domain term dictionary and the public opinion text through a domain term dictionary integration layer and integrating the similarity into text representation;
extracting important features on the basis of a domain term dictionary integration layer through a local feature extraction layer;
and predicting the class probability of the extracted important features through a classification layer.
2. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the method comprises the following steps of crawling judicial public opinion data before constructing a judicial sensitive information recognition model integrated into a domain term dictionary and preprocessing the data according to the classification of the judicial public opinion sensitive information, and specifically comprises the following steps:
step1.1, crawling public opinion texts, and forming a plurality of public opinion texts after manual screening and labeling;
step1.2, constructing a domain term dictionary, wherein the domain term dictionary comprises judicial domain vocabularies and sensitive vocabularies, the judicial domain vocabularies are constructed by a referee document network and a Chinese court network, and the sensitive vocabularies comprise two parts: (1) manually constructing according to the characteristics of judicial public opinion data, (2) screening open Chinese sensitive vocabularies, wherein the vocabularies consist of characters, words and phrases;
step1.3, pre-training judicial sensitive word vectors by utilizing a dog searching news data set, a judicial public opinion sensitive information data set, a domain term dictionary and a word2vec algorithm to serve as judicial sensitive prior knowledge of a judicial sensitive information recognition model.
3. The method for recognizing judicial public opinion sensitive information fused into a domain term dictionary according to claim 1, wherein: the specific steps of constructing the judicial sensitive information recognition model merged into the domain term dictionary are as follows:
step2.2, because the previous vector representation does not consider the context semantic features, the public opinion text vector representationInputting a context-aware coding mechanism; the Bi-directional long-short term memory neural network Bi-LSTM is used as an embedding mechanism for understanding context information, the characteristic interaction between words is simulated, the outputs in two directions are simply spliced to obtain the output H of the network layer, wherein each column of vectors represents the public opinion description contextThe characterization of (1);
wherein Bi-LSTM represents vector representation after passing through a bidirectional circulation neural network, and D _ H and W _ H are public opinion texts and vector representations after being coded by a domain term dictionary respectively;
step2.3, the calculation of the weights for the context tokens H here using a multi-head attention mechanism:
multiHead(Q,K,V)=concat(head1,…,headh)WO
where headi=att(QWi Q,KWi k,VWi V) (4)
Step2.4, in order to prevent the original text semantics from being lost, residual error connection is carried out on the output result:
Ah=residualConnect(D_Md,D_H) (5)
Kh=residualConnect(W_Md,W_H) (6)
where residulconnect denotes residual connection, D _ Md,W_MdRespectively representThe public opinion text and domain dictionary output results through a multi-head attention mechanism, Ah,KhRespectively representing the results of the public opinion texts after residual connection;
step2.5, characterization K of general domain term dictionaryhRepresentation with public sentiment text AhCalculating a similarity matrix:
wherein S isikRepresentation term dictionary representation KhThe ith field word and text feature A inhThe similarity between the kth hidden vectors of (1),representing the corresponding dictionary ith domain word token vector,is represented by AhThe k column vector of (1), sim represents the calculationAndthe calculation process of the trainable function of the similarity between the two functions is as follows:
whereinIs the weight vector to be trained and,representing elements multiplied in sequence, (;) representing vectors on a lineLine splicing, K and KhCorresponds to the column vector of, a is ahThe column vectors of (1) correspond;
step2.6, mixing SikAfter normalization, the words and phrases are embedded into a matrixMultiplying to obtain a correlation matrix with weight informationFinally, the similar matrix is spliced with the original text to obtain the text representation merged into the dictionary information
Where softmax is a normalization function, [; splicing operation;
step2.7, representation of text that has been merged into dictionary informationPerforming convolution operation, performing feature extraction on the public sentiment content information and dictionary information, and then performing max-posing operation, wherein the process is as follows:
wherein k represents an output channel of the CNN network;
step2.8, mixingPerforming multi-head attention operation to obtain a feature matrix with weight information;
step2.9, in order to obtain the text classification probability distribution in the classification layer, O obtained in the local feature extraction layerkAfter using normalized softmax, it is mapped to the classification space as follows:
P(D)=softmax(Ok) (13)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010984681.5A CN112231472B (en) | 2020-09-18 | 2020-09-18 | Judicial public opinion sensitive information identification method integrated with domain term dictionary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010984681.5A CN112231472B (en) | 2020-09-18 | 2020-09-18 | Judicial public opinion sensitive information identification method integrated with domain term dictionary |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112231472A true CN112231472A (en) | 2021-01-15 |
CN112231472B CN112231472B (en) | 2022-07-29 |
Family
ID=74107203
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010984681.5A Active CN112231472B (en) | 2020-09-18 | 2020-09-18 | Judicial public opinion sensitive information identification method integrated with domain term dictionary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231472B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836054A (en) * | 2021-03-08 | 2021-05-25 | 重庆大学 | Service classification method based on symbiotic attention representation learning |
CN113177831A (en) * | 2021-03-12 | 2021-07-27 | 西安理工大学 | Financial early warning system and method constructed by applying public data |
CN113609301A (en) * | 2021-07-05 | 2021-11-05 | 上海交通大学 | Dialogue method, medium and system based on knowledge graph |
CN113762237A (en) * | 2021-04-26 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text image processing method, device and equipment and storage medium |
CN116108171A (en) * | 2022-12-19 | 2023-05-12 | 中国邮政速递物流股份有限公司广东省分公司 | Judicial material processing system based on AI circulating neural network deep learning technology |
CN117009533A (en) * | 2023-09-27 | 2023-11-07 | 戎行技术有限公司 | Dark language identification method based on classification extraction and word vector model |
CN117453863A (en) * | 2023-12-22 | 2024-01-26 | 珠海博维网络信息有限公司 | Public opinion text classifying method and system |
CN113177831B (en) * | 2021-03-12 | 2024-05-17 | 西安理工大学 | Financial early warning system constructed by application of public data and early warning method |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080208597A1 (en) * | 2007-02-27 | 2008-08-28 | Tetsuro Chino | Apparatus, method, and computer program product for processing input speech |
US20140329557A1 (en) * | 2011-12-12 | 2014-11-06 | Samsung Electronics Co., Ltd. | Method and apparatus for reporting dual mode capabilities in a long term evolution network |
CN105022725A (en) * | 2015-07-10 | 2015-11-04 | 河海大学 | Text emotional tendency analysis method applied to field of financial Web |
WO2016190861A1 (en) * | 2015-05-27 | 2016-12-01 | Hewlett Packard Enterprise Development Lp | Identifying algorithmically generated domains |
CN107038249A (en) * | 2017-04-28 | 2017-08-11 | 安徽博约信息科技股份有限公司 | Network public sentiment information sensibility classification method based on dictionary |
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN108984667A (en) * | 2018-06-29 | 2018-12-11 | 郑州中博奥信息技术有限公司 | A kind of public sentiment monitoring system |
CN109543180A (en) * | 2018-11-08 | 2019-03-29 | 中山大学 | A kind of text emotion analysis method based on attention mechanism |
CN109582875A (en) * | 2018-12-17 | 2019-04-05 | 武汉泰乐奇信息科技有限公司 | A kind of personalized recommendation method and system of online medical education resource |
CN110083700A (en) * | 2019-03-19 | 2019-08-02 | 北京中兴通网络科技股份有限公司 | A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks |
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
CN111209401A (en) * | 2020-01-03 | 2020-05-29 | 西安电子科技大学 | System and method for classifying and processing sentiment polarity of online public opinion text information |
CN111597304A (en) * | 2020-05-15 | 2020-08-28 | 上海财经大学 | Secondary matching method for accurately identifying Chinese enterprise name entity |
-
2020
- 2020-09-18 CN CN202010984681.5A patent/CN112231472B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080208597A1 (en) * | 2007-02-27 | 2008-08-28 | Tetsuro Chino | Apparatus, method, and computer program product for processing input speech |
US20140329557A1 (en) * | 2011-12-12 | 2014-11-06 | Samsung Electronics Co., Ltd. | Method and apparatus for reporting dual mode capabilities in a long term evolution network |
WO2016190861A1 (en) * | 2015-05-27 | 2016-12-01 | Hewlett Packard Enterprise Development Lp | Identifying algorithmically generated domains |
CN105022725A (en) * | 2015-07-10 | 2015-11-04 | 河海大学 | Text emotional tendency analysis method applied to field of financial Web |
CN107038249A (en) * | 2017-04-28 | 2017-08-11 | 安徽博约信息科技股份有限公司 | Network public sentiment information sensibility classification method based on dictionary |
CN107133220A (en) * | 2017-06-07 | 2017-09-05 | 东南大学 | Name entity recognition method in a kind of Geography field |
CN108984667A (en) * | 2018-06-29 | 2018-12-11 | 郑州中博奥信息技术有限公司 | A kind of public sentiment monitoring system |
CN109543180A (en) * | 2018-11-08 | 2019-03-29 | 中山大学 | A kind of text emotion analysis method based on attention mechanism |
CN109582875A (en) * | 2018-12-17 | 2019-04-05 | 武汉泰乐奇信息科技有限公司 | A kind of personalized recommendation method and system of online medical education resource |
CN110083700A (en) * | 2019-03-19 | 2019-08-02 | 北京中兴通网络科技股份有限公司 | A kind of enterprise's public sentiment sensibility classification method and system based on convolutional neural networks |
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
CN111209401A (en) * | 2020-01-03 | 2020-05-29 | 西安电子科技大学 | System and method for classifying and processing sentiment polarity of online public opinion text information |
CN111597304A (en) * | 2020-05-15 | 2020-08-28 | 上海财经大学 | Secondary matching method for accurately identifying Chinese enterprise name entity |
Non-Patent Citations (7)
Title |
---|
FAN G 等: "Deep semantic feature learning with embedded static metrics for software defect prediction", 《2019 26TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE》 * |
LEE J 等: "Learning probabilistic kernel feature subspace with side-information for classification", 《2004 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS》 * |
WANJIN CHE 等: "Research on Chinese and Vietnamese Bilingual Event Graph Extraction Method", 《PROCEEDINGS OF INTERNATIONAL CONFERENCE ON MECHATRONICS AND INTELLIGENT ROBOTICS》 * |
周钦青 等: "超图拉普拉斯稀疏编码在图像识别中的应用", 《计算机应用与软件》 * |
郁圣卫 等: "基于领域情感词典特征表示的细粒度意见挖掘", 《中文信息学报》 * |
韩露 等: "领域知识关系对领域文本分类的影响", 《第二十七届中国控制会议论文集》 * |
韩鹏宇 等: "基于案件要素指导的涉案舆情新闻文本摘要方法", 《中文信息学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112836054B (en) * | 2021-03-08 | 2022-07-26 | 重庆大学 | Service classification method based on symbiotic attention representation learning |
CN112836054A (en) * | 2021-03-08 | 2021-05-25 | 重庆大学 | Service classification method based on symbiotic attention representation learning |
CN113177831A (en) * | 2021-03-12 | 2021-07-27 | 西安理工大学 | Financial early warning system and method constructed by applying public data |
CN113177831B (en) * | 2021-03-12 | 2024-05-17 | 西安理工大学 | Financial early warning system constructed by application of public data and early warning method |
CN113762237B (en) * | 2021-04-26 | 2023-08-18 | 腾讯科技(深圳)有限公司 | Text image processing method, device, equipment and storage medium |
CN113762237A (en) * | 2021-04-26 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text image processing method, device and equipment and storage medium |
CN113609301A (en) * | 2021-07-05 | 2021-11-05 | 上海交通大学 | Dialogue method, medium and system based on knowledge graph |
CN116108171A (en) * | 2022-12-19 | 2023-05-12 | 中国邮政速递物流股份有限公司广东省分公司 | Judicial material processing system based on AI circulating neural network deep learning technology |
CN116108171B (en) * | 2022-12-19 | 2023-10-31 | 中国邮政速递物流股份有限公司广东省分公司 | Judicial material processing system based on AI circulating neural network deep learning technology |
CN117009533A (en) * | 2023-09-27 | 2023-11-07 | 戎行技术有限公司 | Dark language identification method based on classification extraction and word vector model |
CN117009533B (en) * | 2023-09-27 | 2023-12-26 | 戎行技术有限公司 | Dark language identification method based on classification extraction and word vector model |
CN117453863A (en) * | 2023-12-22 | 2024-01-26 | 珠海博维网络信息有限公司 | Public opinion text classifying method and system |
CN117453863B (en) * | 2023-12-22 | 2024-03-29 | 珠海博维网络信息有限公司 | Public opinion text classifying method and system |
Also Published As
Publication number | Publication date |
---|---|
CN112231472B (en) | 2022-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112231472B (en) | Judicial public opinion sensitive information identification method integrated with domain term dictionary | |
CN110019839B (en) | Medical knowledge graph construction method and system based on neural network and remote supervision | |
CN110990564B (en) | Negative news identification method based on emotion calculation and multi-head attention mechanism | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN109284506A (en) | A kind of user comment sentiment analysis system and method based on attention convolutional neural networks | |
CN110532557B (en) | Unsupervised text similarity calculation method | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
CN106557462A (en) | Name entity recognition method and system | |
CN111160031A (en) | Social media named entity identification method based on affix perception | |
CN110717324B (en) | Judgment document answer information extraction method, device, extractor, medium and equipment | |
CN111046670B (en) | Entity and relationship combined extraction method based on drug case legal documents | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN110717843A (en) | Reusable law strip recommendation framework | |
CN112183064B (en) | Text emotion reason recognition system based on multi-task joint learning | |
CN112883732A (en) | Method and device for identifying Chinese fine-grained named entities based on associative memory network | |
CN115357719B (en) | Power audit text classification method and device based on improved BERT model | |
CN110414009A (en) | The remote bilingual parallel sentence pairs abstracting method of English based on BiLSTM-CNN and device | |
CN112561718A (en) | Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing | |
CN110287298A (en) | A kind of automatic question answering answer selection method based on question sentence theme | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN112270187A (en) | Bert-LSTM-based rumor detection model | |
CN112052319B (en) | Intelligent customer service method and system based on multi-feature fusion | |
Sun et al. | Transformer based multi-grained attention network for aspect-based sentiment analysis | |
CN115759092A (en) | Network threat information named entity identification method based on ALBERT |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |