CN113064967A

CN113064967A - Complaint reporting credibility analysis method based on deep migration network

Info

Publication number: CN113064967A
Application number: CN202110310932.6A
Authority: CN
Inventors: 范青武; 韩华政
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-02
Anticipated expiration: 2041-03-23
Also published as: CN113064967B

Abstract

The invention discloses a complaint reporting credibility analysis method based on a deep migration network, and belongs to the technical field of artificial intelligence. The method specifically comprises the following steps: firstly, expressing a microblog text, a complaint report text and a microblog text mixed complaint report text as matrixes through a Word2vec text vectorization model respectively; secondly, inputting the vectorized text into three groups of bidirectional LSTM networks for feature extraction, and respectively extracting a source domain private feature vector, a source domain target domain shared feature vector and a target domain private feature vector; then, respectively carrying out feature fusion on the shared features and the private features of the source domain and the target domain through a self-attention mechanism to obtain final source domain features and target domain features; and finally, inputting the source domain characteristics and the target domain characteristics into a multilayer perceptron and outputting the final classification result. The method solves the problems of difficult manual analysis and lack of effective data marking in the analysis of the reliability of the complaint reporting, and provides a thought for the analysis of the reliability of the complaint reporting in the environmental category.

Description

Complaint reporting credibility analysis method based on deep migration network

Technical Field

The invention relates to an environment type complaint reporting credibility analysis method, in particular to an environment type complaint reporting credibility analysis method based on a deep migration network.

Background

The environmental complaint report means that citizens complain about environmental pollution phenomena or events which affect the production and life of citizens or violate national relevant regulations. Complainters often describe complaints in the form of text. Among the many complaint reporting events are non-authentic complaint reporting events that tamper with, exaggerate, or engrave the fact. These complaints will directly increase the difficulty of the authorities in handling water pollution events, reducing administrative efficiency. In order to improve the administrative management efficiency and avoid the waste of management resources, the administrative management department urgently needs to analyze the credibility of the event of reporting the netizen complaints.

At present, in the field of water environment complaint reporting, the related work of reliability analysis of complaint reporting events is rare, and the related work of the complaint reporting reliability analysis based on complaint reporting texts is relatively less. But similar efforts exist in other areas for performing credibility analysis based on textual content. After the occurrence of deep learning, various methods based on deep learning techniques have been proposed, which have very good effects in the reliability analysis work based on text contents, such as false news detection, rumor detection, and the like. However, most of the machine learning and deep learning methods need to be based on a large amount of data containing credibility labels. The text data of the complaint reports in the environmental complaint report credibility analysis often lack credibility labels, and the credibility analysis of the complaint reports by manpower is very difficult.

In order to solve the problems, microblog texts are used for assisting in analysis of the reliability of complaint reports. The microblog text and the complaint report text are both expressed on the emotion and attitude of an author, and the microblog rumor and the false complaint report are usually falsified and distorted on the fact, so that the microblog text and the complaint report text have certain semantic similarity; by combining a semi-supervised migration learning method, the knowledge in the microblog text is migrated to the complaint reporting text credibility analysis process by using the technologies of feature migration, field adaptation and the like by using a migration learning theory, and the performance index of the complaint reporting credibility analysis is improved.

In conclusion, the environmental complaint reporting reliability analysis based on the deep migration network is an innovative research problem and has important research significance and application value.

Disclosure of Invention

The invention aims to solve the problems that manual analysis is difficult, an effective credibility label is lacked, and an effective credibility analysis model cannot be trained in the credibility analysis of the environmental complaint reports. A deep migration network is proposed to solve the above problems. The method takes the microblog text as a source field and the complaint report text as a target field, designs effective feature extraction, feature migration and field adaptation methods, and utilizes the microblog text to assist in analysis of the complaint report credibility.

The method for analyzing the credibility of the environmental complaint report based on the deep migration network comprises the following steps:

s1, collecting data;

s2, preprocessing microblog text data (source field) and complaint report text data (target field);

s3, inputting the preprocessed text into a Word2vec model for Word vector training to generate Word vectors;

s4, encoding the microblog text word vectors and the complaint report text word vectors, and respectively designing a source domain feature encoder, a field sharing feature encoder and a target domain feature encoder to extract source domain private features, field sharing features and target domain private features;

and S5 domain feature fusion: performing feature fusion on the source domain private feature and the domain sharing feature by using a self-attention method to obtain a source domain feature; and performing feature fusion on the target domain private features and the domain sharing features by using a self-attention method to obtain target domain features.

S6, calculating MK-MMD distance of the source domain feature and the target domain feature, and performing feature transformation on the source domain feature and the target domain feature to complete the domain adaptation.

And S7, obtaining a classification result by the source domain characteristics and the target domain characteristics through a multi-layer perceptron network.

Drawings

Fig. 1 is a schematic diagram illustrating details of a complaint reporting credibility analysis method based on a deep migration network.

Fig. 2 is a schematic diagram of a bi-directional LSTM encoding process.

FIG. 3 is a flow chart of a method for analyzing the credibility of complaint reporting in a deep migration-based network.

Detailed Description

The invention provides an environmental complaint reporting credibility analysis method based on a deep migration network, which mainly comprises the following steps of:

the detailed description of the present invention is provided with reference to the accompanying figure 1:

step S1, acquiring a microblog source text extracted from social media; extracting complaint report text data from a large water environment big data management platform, and constructing a data set:

representing a source field (microblog text), where N^SThe number of samples is represented by the number of samples,

a sample of the microblog's text is represented,

a microblog text credibility label; complaint report text dataset:

represents a target field (complaint report text) in which

Represents the number of training samples and the number of training samples,

in order to test the number of samples,

in order to report a sample of the text for a complaint,

text confidence tags for complaints.

Step S2, microblog text data (source)Fields) and complaint report text data (target fields): the preprocessing comprises data cleaning and word segmentation, does not comprise word-stop operation, and word segmentation of the text

Expressed as a set of word sequences:

wherein o belongs to { s, t }, s represents a source domain, and t represents a target domain;

as sentences

The words contained; t is_iIs the sentence length.

Step S3, text vectorization:

inputting the preprocessed text after Word segmentation into a Word2vec model for Word training, and then vectorizing the text

Text sequence of

Inputting the data into a Word2vec model once to obtain

Represents:

wherein n is the number of texts, d is the dimension of the word vector, and the dimension of the generated word vector is 300 dimensions.

Step S4, the text after the vector quantization is encoded. The encoding refers to a process of sending the vectorized text into a neural network for feature extraction, and three encoders are designed: source domain private feature encoder

Encoder for extracting private characteristics of source domain (microblog text) and target domain

Encoder for extracting target field (complaint report text) and field sharing characteristic (E)_c) And extracting the sharing characteristics of the complaint report text and the microblog text, wherein the three encoders have the same network structure and are all based on a bidirectional LSTM network. As shown in fig. 2, the specific encoding process:

step S401, for the text after vectorization

Output of LSTM model connecting forward and backward directions

And

output as Bi-LSTM at time t:

wherein,

is T_iInputting at the t-th time step in the time steps; c. C_tCell states at time t of LSTM, h_tIs the output of the t time step, calculated by equation (2):

wherein, W_f,W_i,W_o,W_cAs a weight matrix, b_f,b_i,b_o,b_cIs a bias vector. σ is sigmoid function,. alpha.is element-wise multiplication。f_tTo forget the door, i_tTo the input gate o_tIs an output gate. In the whole process, firstly, the door f is forgotten_tSome information of the previous state is selectively filtered out. Then input into the gate i_tDeciding which data is updated; LSTM cell state c_tBy forgetting historical information and adding new information

And covering the old state with the new state value to complete the state updating. Finally, an output gate o_tDetermining the output information, output h at the current time step_tThrough o_tAnd filtering the information to obtain the information.

Step S402, taking the output of the last time step

And

outputting, as an encoded output of the ith sentence:

wherein,

is composed of

The forward hidden layer output of the text sequence,

as text

Outputting the sequence to a hidden layer;

text output for LSTM network coding

I.e. the output of the encoder.

Step S403, three groups of encoders respectively extract the domain sharing characteristics e_c∈R^n1×m＝[e₁,e₂,...,e_n1](ii) a The source domain private characteristic and the target domain private characteristic encoder output are respectively

Where m is the dimension of the Bi-LSTM output vector,

n2＝N^s，

are the number of texts.

Step S5, domain feature fusion: and the domain sharing feature encoder extracts the sharing features of the source domain and the target domain. The domain private feature encoder can extract domain private features, and the defect that the shared feature extractor cannot extract specific domain information is overcome. In order to obtain the shared information of the source domain and the target domain and simultaneously retain more complete information of the specific domain, the information of the specific domain needs to be combined

And sharing the domain information e_cFusion is performed.

Step S501, matrix W_VKey matrix W_KQuery matrix W_QDot-multiply with the input vector and score the result:

wherein b belongs to { c, p }, c represents the domain sharing, and p is the domain private; < is the zoom dot product; d is a constant (usually, a word vector dimension) set to prevent an excessively large value after dot product, and usually, a dimension of an input word vector is taken;

step S502, performing Softmax normalization operation on the scores to obtain attention weight

Step S503, multiplying the attention weight point by the value vector to obtain the final source domain feature e^o(target domain characteristics):

where o e { s, t } s represents the source domain, t represents the target domain, e^oIs a feature after fusion.

Step S6, domain adaptation: source domain feature e after domain feature fusion^sAnd target domain characteristics e^tAre different, so as to e^sAnd e^tAnd performing domain adaptation. The purpose of the domain adaptation is to realize domain adaptation, so that the data distribution of the two domains converges. And performing domain self-adaptation in a feature alignment mode, namely enabling the data distribution of the source domain and the data distribution of the target domain to converge by a feature transformation method. And calculating the distance between the source domain data and the target domain data by an MK-MMD method, adding the distance into a loss function, and updating the network weight together with the tag loss to realize the domain adaptation. The MK-MMD distance formula of the source domain and the target domain is as follows:

wherein a mapping exists in a regenerated Hilbert space HPhi (-) mapping original variables into RKHS, MMD²(e^s,e^t) Is the distance between the source domain feature and the target domain feature.

And step S7, reliability classification, namely, sending the source domain characteristics and the target domain characteristics into an MLP network to output a classification result, and updating network parameters according to classification loss and field adaptation loss.

Step S701, source domain feature e after fusion of domain features^sAnd target domain characteristics e^tFeeding into an MLP:

is a prediction vector, i.e., a prediction result; MLP represents a multi-layer perceptron;

and

representing the probability of prediction; sigmoid is an activation function.

And step S702, calculating a loss function according to the classification result to update network parameters, wherein the deep migration network learns the data difference between the source field and the target field to realize field adaptation on one hand and learns the label loss on the other hand. The final objective function (the overall network loss function) is lost by the MK-MMD statistics source domain tags representing the domain differences, so the overall migration network loss function is (9):

L＝L_cls+λL_da (9)

wherein, λ is an adjusting parameter; l is_daFor field adaptation losses, i.e. MMD²(e^s,e^t)；L_clsFor label loss, including source domain label loss

And loss of target domain label

Cross-entry criterion was used in this classification task to reduce the loss function:

wherein y ∈ {0,1} is a confidence label; theta is a parameter to be optimized.

The reliability analysis accuracy index of the model is a standardized AUC: in a water environment complaint reporting credibility classification task, attention should be paid to avoid the situation that pollution time is not processed timely due to the occurrence of credible complaint reporting misjudgment, namely, the True Positive Rate (TPR) is improved on the basis of low False Positive Rate (FPR) (low credibility texts are positive samples, and high credibility texts are negative samples). The task should focus on considering the area of the sub-region on the ROC curve (AUC) when FPR is less than or equal to maxfpr_FPR≤maxfpr). When maxfpr is particularly small, the AUC variation range is very small, and model performance cannot be well compared, so standardized AUC (SPAUC) is used_FPR≤maxfpr)：

Wherein s is_maxIn the fpr experiment, 0.05 is taken as fpr,

so SPAUC_FPR≤fprVarying between 0.5 and 1. The experimental result shows that the reliability analysis of the environmental complaint reporting can be well carried out based on the LSTM coding network.

The method adopts social media to extract microblog source texts (comprising 133346 texts, wherein the total number of high-reliability texts is 66131, and the total number of low-reliability texts is 67215) and extract complaint report text data (the total number of 200K complaint report text data is 1482 including 889 high-reliability complaints and 593 low-reliability complaints with reliability labels) from a large water environment big data management platform.

As shown in Table 1, the experiments were characterized by CNN, Transfomer, GRU-2, RNN, LSTM _ Attention and LSTM, respectively, extractors. "Attention" means that the private features of the source domain and the target domain are both fused with the shared feature; "Source _ Attention" only fuses the Source domain private feature and the domain sharing feature; "Target _ Attention" means that only Target domain private features and domain sharing features are fused; "No _ Attention" means that No feature fusion is performed, only domain sharing features are used. The deep migration network based on the bidirectional LSTM has the best performance in the task, and the superiority of the deep migration network architecture and the feasibility of using microblog texts to assist in complaint reporting credibility analysis are also proved. Ablation experiments were performed depending on whether feature fusion was performed using the attention mechanism. As shown in the ablation results in table 1, in the case of using the deep migration network, the performance of each feature extractor after feature fusion using the attention mechanism is better than the method using only the domain sharing feature, and the fusion effect of the source domain private feature and the sharing feature is better than that of the target domain private feature and the sharing feature.

TABLE 1 complaint reporting credibility classification experimental results

In conclusion, the method can well utilize knowledge in the microblog text field to assist the analysis of the complaint reporting credibility, and can well complete the complaint reporting credibility analysis task.

Claims

1. A method for analyzing the credibility of environmental complaint reporting based on a deep migration network comprises the following specific steps:

s1, collecting data;

s2 preprocessing the source domain and the target domain;

s4, encoding the microblog text and the complaint report text after text vectorization, and extracting high-level features;

s5 fusing the domain private feature and the domain sharing feature by using a self-attention method;

s6, calculating MK-MMD distance of the source domain feature and the target domain feature, performing feature transformation on the source domain feature and the target domain feature, and performing domain adaptation;

s7, obtaining a classification result by the source domain characteristic and the target domain characteristic through a multi-layer perceptron network;

the domain source is microblog text data, and the target domain is complaint report text data.

2. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein:

in step S1, extracting microblog source text from social media; extracting complaint report text data from a large water environment big data management platform, and constructing a data set:

represents a source domain, where N^SThe number of samples is represented by the number of samples,

a sample of the microblog's text is represented,

a microblog text credibility label; complaint report text dataset:

representing a target domain, wherein

Represents the number of training samples and the number of training samples,

in order to test the number of samples,

in order to report a sample of the text for a complaint,

text confidence tags for complaints.

3. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

in step S2, preprocessing includes data cleansing and word segmentation, does not include a stop word operation, and the segmented text is represented as a set of words:

wherein o e { s, t } s represents a source domain and t represents a target domain;

as sentences

The words contained; t is_iIs the sentence length.

4. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

in step S3, text vectorization is realized by using Word2vec model, and the text after Word segmentation is used

Represented as a matrix:

where n is the number of texts and d is the word vector dimension.

5. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 3, wherein: text vectorization is realized by using a Word2vec model, and the dimension of the generated Word vector d is 300 dimensions.

6. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

the encoder in step S4 is a Bi-directional long short term memory network (Bi-LSTM), and extracts the private features and the shared features of the source domain and the target domain using three sets of encoders with identical network structures;

the specific coding mode is as follows:

step S401 adopts bidirectional LSTM as core module of encoder, for text sequence

Output of LSTM model connecting forward and backward directions

And

output as Bi-LSTM at time t:

where o e s, t, s denotes the source domain, t denotes the target domain,

wherein, W_f,W_i,W_o,W_cAs a weight matrix, b_f,b_i,b_o,b_cIs a bias vector. σ is a sigmoid function, which is an element-wise multiplication. f. of_tTo forget the door, i_tTo the input gate o_tIs an output gate. In the whole process, firstly, the door f is forgotten_tSome information of the previous state is selectively filtered out. Then input into the gate i_tDeciding which data is updated; LSTM cell state c_tBy forgetting historical information and adding new information

And covering the old state with the new state value to complete the state updating. Finally, an output gate o_tDetermining the output information, output h at the current time step_tThrough o_tFiltering the information to obtain;

step S402 takes the output of the last time step

And

as

The encoding output result of (1):

wherein,

is composed of

The forward hidden layer output of the text sequence,

as text

Outputting the sequence to a hidden layer;

text output for LSTM network coding

I.e. the output of the encoder;

Where m is the dimension of the Bi-LSTM output vector

n2＝N^s，

Are the number of texts.

7. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

the domain feature fusion described in step S5 is to fuse the source domain private feature, the target domain private feature, and the domain sharing feature respectively through an attention-driven mechanism, and the specific feature fusion method is as follows:

step S502 performs Softmax normalization on the scores to obtain attention weights:

step S503, multiplying the vector of the value by the weight point to obtain the final source domain feature (target domain feature):

wherein, o is belonged to { s, t }, s represents a source domain, t represents a target domain, e^oIs a post-fusion feature.

8. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

the domain adaptation described in step S6 refers to using maximum mean difference (MK-MMD (Ma)ximum mean disparity)) calculates the source domain feature e^sAnd target domain characteristics e^tAnd adding the distance into a loss function, and performing feature transformation along with an iteration process to finish field adaptation:

wherein there is a mapping phi (-) in a Regenerative Kernel Hilbert Space (RKHS) that maps the original variables to the RKHS.

9. The method for analyzing the credibility of the environmental complaint report based on the deep migration network as claimed in claim 1, wherein the method comprises:

in step S7, the source domain feature e is obtained by fusing the domain features^sAnd target domain characteristics e^tAnd respectively sending the classification result to an MLP network to output a classification result:

and

representing the probability of prediction; sigmoid is an activation function.

Meanwhile, a loss function is calculated according to the classification result to update network parameters, on one hand, the deep migration network learns the data difference between the source field and the target field to realize field adaptation, and on the other hand, the deep migration network learns the label loss. The final objective function (the overall network loss function) is lost by the MK-MMD statistics source domain tags representing the domain differences, so the overall migration network loss function is:

L＝L_cls+λL_da (9)

And loss of target domain label