CN113064967B

CN113064967B - Complaint reporting credibility analysis method based on deep migration network

Info

Publication number: CN113064967B
Application number: CN202110310932.6A
Authority: CN
Inventors: 范青武; 韩华政
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2024-03-22
Anticipated expiration: 2041-03-23
Also published as: CN113064967A

Abstract

The invention discloses a complaint reporting credibility analysis method based on a deep migration network, and belongs to the technical field of artificial intelligence. The method specifically comprises the following steps: firstly, respectively representing a microblog text, a complaint report text and a microblog text mixed complaint report text as a matrix through a Word2vec text vectorization model; then, inputting the vectorized text into three groups of bidirectional LSTM networks for feature extraction, and respectively extracting a source domain private feature vector, a source domain target domain shared feature vector and a target domain private feature vector; then, carrying out feature fusion on the shared features and private features of the source domain and the target domain respectively through a self-attention mechanism to obtain final source domain features and target domain features; and finally, inputting the source domain features and the target domain features into the multi-layer perceptron to output a final classification result. The method solves the problem that manual analysis is difficult and effective data labeling is lacking in the analysis of the reliability of complaint reporting, and provides a thought for the analysis of the reliability of environmental complaint reporting.

Description

Complaint reporting credibility analysis method based on deep migration network

Technical Field

The invention relates to an environment complaint reporting credibility analysis method, in particular to an environment complaint reporting credibility analysis method based on a deep migration network.

Background

The environmental complaint report refers to complaints of citizens on environmental pollution phenomena or events affecting the production and life of the citizens or violating the national relevant regulations. Complaints are typically described in text form for complaint reports. Among the many complaint reporting events are non-trusted complaint reporting events that tamper, exaggerate or graft facts. These complaints report can directly improve the difficulty of the management part in handling the water pollution event, reduce administrative efficiency. In order to improve the administrative management efficiency and avoid the waste of management resources, the administrative management department is urgent to analyze the credibility of the complaint reporting event of the netizen.

At present, related work for carrying out reliability analysis on complaint reporting events is rare in the field of water environment complaint reporting, and related work for carrying out complaint reporting reliability analysis based on complaint reporting text is relatively less. But in other areas there is a similar effort to perform trust analysis based on text content. After deep learning appears, various methods based on deep learning technology are proposed, and very good effects are obtained in the reliability analysis work based on text content, such as false news detection, rumor detection, etc. Machine learning and deep learning methods are mostly based on a large amount of data with confidence labels. Complaint report text data in environment-type complaint report credibility analysis often lacks credibility labels, and manual credibility analysis of complaint reports is very difficult.

In order to solve the problems, microblog text is used for assisting complaint reporting credibility analysis. The microblog text and the complaint report text are both expression of emotion and attitude of an author, and meanwhile, the microblog rumors and false complaint report are often falsified and distorted of facts, so that the microblog text and the complaint report text have certain semantic similarity; and combining with a semi-supervised transfer learning method, utilizing a transfer learning theory to transfer knowledge in the microblog text to a complaint report text credibility analysis process by using technologies such as feature transfer, field adaptation and the like, and improving performance indexes of the complaint report credibility analysis.

In conclusion, the analysis of the reporting credibility of the environmental complaints based on the deep migration network is an innovative research problem, and has important research significance and application value.

Disclosure of Invention

The invention aims to solve the problems that manual analysis is difficult and effective credibility labels are lacking in credibility analysis of environmental complaints, and an effective credibility analysis model cannot be trained. A deep migration network is proposed to solve the above problems. According to the method, a microblog text is used as a source domain, a complaint report text is used as a target domain, an effective feature extraction, feature migration and field adaptation method is designed, and the microblog text is used for assisting in complaint report credibility analysis.

The environment complaint reporting credibility analysis method based on the deep migration network comprises the following steps:

s1, data collection;

s2, preprocessing microblog text data (source domain) and complaint report text data (target domain);

s3, inputting the preprocessed text into a Word2vec model for Word vector training, and generating Word vectors;

s4, encoding the microblog text word vector and the complaint report text word vector, and respectively designing a source domain feature encoder, a domain sharing feature encoder and a target domain feature encoder to extract source domain private features, domain sharing features and target domain private features;

s5, field feature fusion: carrying out feature fusion on the source domain private feature and the domain sharing feature by using a self-attention method to obtain a source domain feature; and carrying out feature fusion on the private features of the target domain and the domain sharing features by using a self-attention method to obtain the features of the target domain.

S6, MK-MMD distance of the source domain feature and the target domain feature is calculated, and feature transformation is carried out on the source domain feature and the target domain feature, so that field adaptation is completed.

S7, the source domain features and the target domain features are processed through a multi-layer perceptron network to obtain classification results.

Drawings

FIG. 1 is a detailed schematic diagram of a method for analyzing the reliability of complaint report based on a deep migration network.

Fig. 2 is a schematic diagram of a bi-directional LSTM encoding process.

FIG. 3 is a flow chart of a method of analyzing complaint reporting credibility based on depth migration network.

Detailed Description

The invention provides a method for analyzing the reliability of reporting environmental complaints based on a deep migration network, which mainly comprises the following steps:

detailed description of the embodiments the present invention is described in detail with reference to fig. 1:

step S1, obtaining a microblog source text extracted from social media; extracting complaint report text data from a large water environment large data management platform, and constructing a data set:representing a source field (microblog text), where N ^S Representing the number of samples, +.>Representing a microblog text sample,/->The method is characterized in that the method is a microblog text credibility label; complaint report text data set:representing a target field (complaint report text), wherein +.>Represent training sample number, ++>For the number of test samples, +.>Reporting text samples for complaints, +.>And reporting the text credibility label for complaints.

Step S2, preprocessing microblog text data (source domain) and complaint report text data (target domain): preprocessing includes data cleaning and word segmentation, and does not include operation of deactivating words, and text after word segmentationExpressed as a set of word sequences:

where o ε { s, t }, s represents the source domain and t represents the target domain;for sentences->The included words; t (T) _i Is the sentence length.

Step S3, text vectorization:

inputting the text subjected to pretreatment Word segmentation into a Word2vec model for Word training, and then vectorizing the textText sequence +.>One time input into Word2vec model to obtain +.>Is represented by a matrix of:where n is the number of texts, d is the dimension of the word vector, and the dimension of the generated word vector is 300 dimensions.

And S4, encoding the quantized text. Coding refers to a process of sending the vectorized text into a neural network to perform feature extraction, and three encoders are designed: source domain private feature encoderExtracting source domain (microblog text), target domain private feature encoder->Extracting target domain (complaint report text) and domain sharing feature encoder (E) _c ) And extracting the sharing characteristics of the complaint report text and the microblog text, wherein the three encoders have identical network structures and are all based on a bidirectional LSTM network. As shown in fig. 2, the specific encoding process is as follows:

step S401, for text after vectorizationOutput of LSTM model connecting the front and back directions ∈>And->As output of Bi-LSTM at time t:

wherein,is T _i In the time steps, inputting at the t time step; c _t Is the unit state of LSTM at t time, h _t The output of the t time step is calculated by the formula (2):

wherein W is _f ,W _i ,W _o ,W _c As a weight matrix, b _f ,b _i ,b _o ,b _c Is a bias vector. Sigma is a sigmoid function, and by element-wise multiplication. f (f) _t I is a forgetful door _t O is an input door _t Is an output gate. In the whole process, the door f is forgotten first _t Some information of the previous state is selectively filtered out. Then input gate i _t Deciding which data is updated; LSTM cell state c _t By forgetting the history information and adding new informationThe old state is covered by the new state value, and the state update is completed. Finally, the output gate o _t Determining output information, outputting h at the current time step _t Through o _t Filtering the information to obtain the product.

Step S402, taking the output of the last time stepAnd->As the encoding output result of the i-th sentence:

wherein,is->Forward hidden layer output of text sequence, +.>For text->Outputting the sequence to an implicit layer; />Encoding the output text for LSTM networks>I.e. the output of the encoder.

Step S403, three groups of encoders extract the domain sharing feature e respectively _c ∈R ^n1×m ＝[e ₁ ,e ₂ ,...,e _n1 ]The method comprises the steps of carrying out a first treatment on the surface of the The source domain private feature and the target domain private feature encoder output are respectively as follows Wherein m is the dimension of the Bi-LSTM output vector, ">n2＝N ^s ，/>Are the number of texts.

Step S5, field feature fusion: the domain sharing feature encoder extracts sharing features of the source domain and the target domain. The domain private feature encoder can extract domain private features, and overcomes the defect that the shared feature extractor cannot extract specific domain information. In order to obtain the shared information of the source domain and the target domain and keep more complete specific domain information, the specific domain is required to be usedInformation of (2)And sharing domain information e _c Fusion is performed.

Step S501, matrix W _V Key matrix W _K Query matrix W _Q Dot product with the input vector and score the result:

wherein b is { c, p }, c represents domain sharing, p is domain privacy; the product is a scaling dot product; d is a constant (typically a word vector dimension) set to prevent the number after the dot product from becoming excessive, typically the dimension of the input word vector;

step S502, performing Softmax normalization operation on the scores to obtain attention weights

Step S503, multiplying the self-attention weight point by the value vector to obtain the final source domain feature e ^o (target domain feature):

wherein o ε { s, t } s represents the source domain, t represents the target domain, e ^o Is a fused feature.

Step S6, field adaptation: source domain feature e after domain feature fusion ^s And target domain feature e ^t Is different, so to e ^s And e ^t And performing field adaptation. The field adaptation aims to realize field adaptation and enable data distribution of two fields to be converged. Domain adaptation by means of feature alignment, i.e. distributing data of source domain and target domain by means of feature transformationAnd (5) converging. And calculating the distance between the source domain and the target domain data by an MK-MMD method, adding the distance into a loss function, and updating the network weight together with the label loss to realize domain adaptation. The MK-MMD distance formula of the source domain and the target domain is:

wherein, a mapping phi (·) exists in a regenerated Hilbert space H to map the primary variables into RKHS, MMD ² (e ^s ,e ^t ) Is the distance between the source domain feature and the target domain feature.

And S7, credibility classification, namely sending the source domain characteristics and the target domain characteristics into the MLP network to output classification results, and updating network parameters according to classification loss and field adaptation loss.

Step S701, source domain feature e after domain feature fusion ^s And target domain feature e ^t Feeding MLP:

is a predictive vector, i.e., a predictive result; MLP represents a multi-layer perceptron; />And->Representing a predicted probability; sigmoid is the activation function.

Step S702, calculating a loss function according to the classification result to update network parameters, wherein the deep migration network learns the data difference between the source domain and the target domain to realize domain adaptation, and learns the label loss. The final objective function (loss function of the entire network) is lost by the MK-MMD statistics source domain labels representing domain differences, so the loss function of the entire migration network is (9):

L＝L _cls +λL _da (9)

wherein lambda is the adjustment parameter; l (L) _da For adapting losses in the field, i.e. MMD ² (e ^s ,e ^t )；L _cls For tag loss, including source domain tag lossAnd target Domain tag loss->Cross-entropy criterion is used in this classification task to reduce the loss function:

wherein y ε {0,1} is the confidence label; θ is a parameter that needs to be optimized.

The index of the accuracy of the reliability analysis of the model is the standardized AUC: in the task of classifying the reliability of the complaint and report of the water environment, we should pay more attention to avoiding the condition that the pollution time is not treated timely due to the occurrence of false judgment of the reliability complaint and report, namely, the True Positive Rate (TPR) is improved on the basis of low False Positive Rate (FPR) (the low reliability text is a positive sample, and the high reliability text is a negative sample). This task should be focused on considering the Area (AUC) of the upper partial region of the ROC curve when FPR.ltoreq.maxfpr _FPR≤maxfpr ). When maxfpr is particularly small, the range of AUC variation is small and does not compare model performance well, so normalized AUC (sparc is used _FPR≤maxfpr )：

Wherein s is _max In the fpr experiment, fpr was taken as 0.05,so SPACC _FPR≤fpr Varying between 0.5 and 1. Experimental results show that the LSTM-based coding network can well analyze the reliability of the environmental complaints.

The method adopts a method for extracting microblog source texts (comprising 133346 texts, wherein the total of the texts with high credibility is 66131 texts and the total of the texts with low credibility is 67215 texts) from social media and extracting complaint report text data (total 200K complaint report text data) from a large water environment big data management platform, wherein 1482 complaint reports with credibility labels are provided, and the complaint report text data comprises 889 complaint reports with high credibility and 593 complaint reports with low credibility.

As shown in Table 1, the experiments were characterized by the extractors CNN, transfomer, GRU-2, RNN, LSTM_Attention, and LSTM, respectively. "Attention" refers to the fusion of private features of both the source domain and the target domain with shared features; "Source_Attention" only merges Source domain private features and domain sharing features; "target_attribute" means that only the private feature and the domain sharing feature of the Target domain are fused; "No_Attention" means that feature fusion is not performed, and only domain sharing features are used. The deep migration network based on the bidirectional LSTM has the best performance in the task, and also proves the superiority of the deep migration network architecture and the feasibility of reporting credibility analysis by using microblog text to assist complaints. Ablation experiments were performed depending on whether feature fusion was performed using the attention mechanism. As shown in table 1, in the case of using the deep migration network, each feature extractor performs better than the method using only the domain shared feature after performing feature fusion by using the attention mechanism, and the effect of fusing the source domain private feature and the shared feature is better than that of fusing the target domain private feature and the shared feature.

Table 1 results of complaint reporting credibility classification experiments

In conclusion, the method can well utilize knowledge in the microblog text field to assist in complaint reporting reliability analysis, and can well complete a complaint reporting reliability analysis task.

Claims

1. The environment complaint reporting credibility analysis method based on the deep migration network comprises the following specific steps:

s1, data collection;

s2, preprocessing a source domain and a target domain;

s4, encoding the microblog text and the complaint report text after text vectorization, and extracting high-level features;

s5, fusing the domain private features and the domain sharing features by using a self-attention method;

s6, calculating MK-MMD distances of the source domain features and the target domain features, performing feature transformation on the source domain features and the target domain features, and performing domain adaptation;

s7, obtaining a classification result through the multi-layer perceptron network by the source domain features and the target domain features;

the source domain is microblog text data, and the target domain is complaint report text data.

2. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

in step S1, obtaining a microblog source text extracted from social media; extracting complaint report text data from a large water environment large data management platform, and constructing a data set:representing a source domain, where N ^S Representing the number of samples, +.>Representing a microblog text sample,/->The method is characterized in that the method is a microblog text credibility label; complaint report text data set: />Representing a target domain, wherein->Represent training sample number, ++>For the number of test samples, +.>For the complaint report of a text sample,and reporting the text credibility label for complaints.

3. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

in step S2, the preprocessing includes data cleansing and word segmentation, without the de-stop word operation, the word segmentation text being represented as a set of words:where o εs, t s represents the source domain and t represents the target domain; />For sentences->The included words; t (T) _i Is the sentence length.

4. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

in step S3, a Word2vec model is used for implementationPresent text vectorization and text after word segmentationRepresented as a matrix:where n is the number of text and d is the word vector dimension.

5. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 3, wherein the method comprises the following steps: text vectorization is achieved by using a Word2vec model, and the dimension of the generated Word vector d is 300 dimensions.

6. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

the encoder adopted in the step S4 is a Bi-directional long-short-term memory network Bi-LSTM, and three groups of encoders with identical network structures are used for extracting private features and shared features of a source domain and a target domain;

the specific coding mode is as follows:

step S401 uses bidirectional LSTM as the core module of the encoder for text sequencesOutput of LSTM model connecting the front and back directions ∈>And->As output of Bi-LSTM at time t:

where o ε { s, t }, s represents the source domain, t represents the target domain,is T _i In the time steps, inputting at the t time step; c _t Is the unit state of LSTM at t time, h _t The output of the t time step is calculated by the formula (2):

wherein W is _f ,W _i ,W _o ,W _c As a weight matrix, b _f ,b _i ,b _o ,b _c Is a bias vector; sigma is a sigmoid function, and by; f (f) _t I is a forgetful door _t O is an input door _t Is an output door; in the whole process, the door f is forgotten first _t Selectively filtering out some information of a previous state; then input gate i _t Deciding which data is updated; LSTM cell state c _t By forgetting the history information and adding new informationThe old state is covered by the new state value, and the state update is completed; finally, the output gate o _t Determining output information, outputting h at the current time step _t Through o _t Filtering the information to obtain the information;

step S402 takes the output of the last time stepAnd->As->The encoded output result of (2):

wherein,is->Forward hidden layer output of text sequence, +.>For text->Outputting the sequence to an implicit layer; />Encoding the output text for LSTM networks>I.e. the output of the encoder;

step S403, three groups of encoders extract the domain sharing feature e respectively _c ∈R ^n1×m ＝[e ₁ ,e ₂ ,...,e _n1 ]The method comprises the steps of carrying out a first treatment on the surface of the The source domain private feature and the target domain private feature encoder output are respectively as follows Wherein m is the dimension +.>n2＝N ^s ，/>Are the number of texts.

7. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

in step S5, the domain feature fusion is to fuse the source domain private feature, the target domain private feature and the domain sharing feature through a self-attention mechanism, and the specific feature fusion method is as follows:

wherein b is { c, p }, c represents domain sharing, p is domain privacy; the product is a scaling dot product; d is a constant set to prevent the numerical value after the dot product from becoming too large, and is usually the dimension of the word vector, and is usually the dimension of the input word vector;

step S502 performs Softmax normalization operation on the scoring to obtain attention weight:

step S503, multiplying the weight point by the value vector to obtain the final source domain feature and the target domain feature:

wherein o ε { s, t }, s represents the source domain, t represents the target domain, e ^o Is a post-fusion feature.

8. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

the domain adaptation described in step S6 refers to computing the source domain feature e using the maximum mean difference MK-MMD ^s And target domain feature e ^t And adding the distance to the loss function, and performing special processing along with the iterative processField adaptation is accomplished in sign transformation:

wherein, there is a mapping φ (-) in a regenerated Hilbert space Reproducing Kernel Hilbert Space in RKHS to map the primary variables into RKHS.

9. The method for analyzing the reliability of environmental complaint reporting based on the deep migration network according to claim 1, wherein the method is characterized by comprising the following steps:

in step S7, the source domain feature e after the fusion of the domain features ^s And target domain feature e ^t And respectively sending the classified results to the MLP network to output the classified results:

is a predictive vector, i.e., a predictive result; MLP represents a multi-layer perceptron; />And->Representing a predicted probability; sigmoid is an activation function;

meanwhile, according to the classification result, a loss function is calculated to update network parameters, and the deep migration network learns the data difference between the source field and the target field to realize field adaptation, and learns label loss; and the final objective function, namely the loss function of the whole network, is lost by MK-MMD statistic source domain labels representing domain differences, so that the loss function of the whole migration network is as follows:

L＝L _cls +λL _da (9)