CN115017299A - Unsupervised social media summarization method based on de-noised image self-encoder - Google Patents
Unsupervised social media summarization method based on de-noised image self-encoder Download PDFInfo
- Publication number
- CN115017299A CN115017299A CN202210393787.7A CN202210393787A CN115017299A CN 115017299 A CN115017299 A CN 115017299A CN 202210393787 A CN202210393787 A CN 202210393787A CN 115017299 A CN115017299 A CN 115017299A
- Authority
- CN
- China
- Prior art keywords
- post
- posts
- network
- relationship
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 230000006870 function Effects 0.000 claims abstract description 44
- 238000012549 training Methods 0.000 claims abstract description 24
- 239000011159 matrix material Substances 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 25
- 230000014509 gene expression Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 9
- 238000003780 insertion Methods 0.000 claims description 6
- 230000037431 insertion Effects 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 208000015181 infectious disease Diseases 0.000 claims description 3
- 230000002458 infectious effect Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 230000011273 social behavior Effects 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 6
- 230000015556 catabolic process Effects 0.000 description 4
- 238000006731 degradation reaction Methods 0.000 description 4
- JEIPFZHSYJVQDO-UHFFFAOYSA-N iron(III) oxide Inorganic materials O=[Fe]O[Fe]=O JEIPFZHSYJVQDO-UHFFFAOYSA-N 0.000 description 4
- 244000097202 Rathbunia alamosensis Species 0.000 description 3
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 101100238516 Rattus norvegicus Mrgprx1 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an unsupervised social media summarization method based on a denoising image self-encoder, which comprises the steps of constructing a post level social relation network according to a sociology theory, and obtaining content codes of posts by utilizing a pre-training BERT model to be used as initial content representation of the posts; defining two noise relationship types and setting corresponding noise functions to construct a pseudo social relationship network with noise relationships; simultaneously representing the sampled pseudo social relationship network instance and the initial content of the post as the input of a residual map attention network encoder, and encoding the post by the residual map attention network encoder according to the initial content representation and the social relationship of the post to obtain the vector representation of the post; constructing a decoder, wherein the residual image attention network encoder and the decoder jointly form a de-noising image self-encoder structure, and the de-noising image self-encoder can learn to remove the noise relationship in the post level social relationship network, so as to finally obtain accurate post representation; and selecting a final abstract by adopting an abstract extractor based on sparse reconstruction.
Description
Technical Field
The invention relates to the technical field of natural language processing and social media data mining, in particular to an unsupervised social media summarization method based on a de-noising image self-encoder.
Background
With the development and popularization of internet technology, the social media platform gradually becomes a novel information production and transmission medium, and gradually occupies an increasingly important position in various aspects of social production, life and the like. However, the amount of content on social media is increased sharply, which causes a serious information overload problem, and provides a more serious challenge for efficient retrieval of information, and it is often difficult for a common user to search and obtain effective and interesting information from massive noisy information, which seriously reduces the efficiency of information retrieval.
The technology can effectively relieve the problem of information overload on social media and help improve the retrieval efficiency of effective information of users. The current main summarization method can be generally divided into two modes: abstract and generate abstract. Wherein, the extraction type abstract is mainly to select the most representative text unit words, sentences or segments with large information amount, low redundancy and wide coverage from the input original text to form the final abstract; the generation type abstract method relates to a text generation process, and generates corresponding abstract description by understanding the semantics of an original input text and utilizing a text generation technology. In recent years, the techniques of abstraction and generative automatic text summarization have been significantly advanced due to the development of numerous new technologies such as sequence-to-sequence framework (Seq2Seq), Transformer model, contrast learning, and large-scale pre-training model.
However, the existing method usually needs to rely on large-scale labeled paired training data (i.e. text-abstract pairs), and currently, the acquisition of the labeled training data usually needs to label the data manually, so that the construction cost is huge, and the method cannot be used in large-scale training scenes. In the social media field, the construction of annotation data is more difficult: on one hand, when a annotator marks the abstract of the content of a certain specific topic, all posts related to the topic need to be read, and then corresponding abstracts are written for the posts, and the number of posts on social media is too large, so that the unassailable labor cost is generated by artificial reading; on the other hand, since the content on the social media has high real-time performance and topic sensitivity, the result of tagging under a certain topic cannot be applied to other topic fields, and therefore, data tagging work needs to be performed under each topic, which consumes a lot of manpower and material resources. Further, when the traditional text summarization method is migrated to the social media data, because the text features on the social media are greatly different from the traditional long documents, such as the features of shorter text length, diversified expressions, informality, and the like, the traditional summarization method is generally difficult to obtain satisfactory results.
The existing social media abstract research mainly carries out feature extraction on each post independently based on the content of the post, and then extracts the posts with higher importance as an abstract by adopting algorithms such as graph sorting or clustering and the like. These methods have certain disadvantages: (1) because the length of the post on the social media is usually short, the content of a single post often contains incomplete or ambiguous information and cannot provide sufficient information, so that the characteristics of the post have the problems of sparseness and inaccuracy; (2) the social media rely on users to actively transmit and receive information through social contact, the process effectively promotes the transmission of the information, therefore, posts on the social media are embedded in a social network structure and are not independent of each other, and the previous methods only focus on text content features of the posts and ignore social structure features of the posts, so that social relationship information of the posts is lost.
Some work has attempted to facilitate analysis of content on social media by leveraging some simple social signals provided on social media, such as the number of fans of the author, the number of forwards posts, the number of praises, and so on; further work verifies the influence of social relationships on the content relevance in social networks from the perspective of social theory, and proposes that posts with social relationships tend to contain similar contents and viewpoints in a short time, the social theory indicates the association between the social relationships and text contents macroscopically, but on the microscopic level, there are often noise relationships that do not conform to the social theory, which is divided into two cases: (1) two posts have a social relationship, but have a low relevance in content, and such a noise relationship is defined as a false relationship; (2) the two posts have no directly connected social relationship but have high relevance in content, and the noise relationship is defined as a potential relationship; the existence of these two relationships presents further challenges to the efficient utilization of social relationships.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide an unsupervised extraction type social media summarization method which is more robust to noise relation.
The purpose of the invention is realized by the following technical scheme:
an unsupervised social media summarization method based on a denoising map self-encoder comprises the following steps:
s1, constructing a post level social relationship network according to a sociology theory, defining a noise-free relationship in the post level social relationship network, namely a real social relationship network, and obtaining a content code of the post by utilizing a pre-trained BERT model as an initial content representation of the post;
s2, defining two noise relationship types of false relationship and potential relationship according to the social behaviors and habits of the user; adding instances of false relations and potential relations into the original post-level social relation network by setting a corresponding noise function, and constructing a post-level social relation network with a noise relation, namely a pseudo social relation network; sampling a plurality of generated pseudo social relationship networks, and simultaneously using the sampled pseudo social relationship network instances and the initial content representation of the posts as the input of a residual map attention network encoder, wherein the residual map attention network encoder comprises a multi-head attention mechanism, and the residual map attention network encoder encodes the posts according to the initial content representation and the social relationship of the posts to obtain the vector representation of the posts;
s3, constructing a decoder, wherein the decoder and the residual image attention network encoder jointly form a de-noising image self-encoder; the decoder reconstructs a real social relationship network according to the vector representation of the posts so as to capture the social relationship information among the posts, and simultaneously reconstructs the semantic relationship among the posts and the contained words so as to capture the text content information of the posts; meanwhile, as a real social relationship network without a noise relationship is reconstructed, the residual error map attention network encoder and the residual error map attention network decoder can learn and exclude the noise relationship in the post level social relationship network, and finally accurate post representation is obtained;
s4, selecting a final abstract by adopting an abstract extractor based on sparse reconstruction according to the post representation obtained in the step S3, iteratively selecting posts with the highest reconstruction coefficient, adding the posts into a final abstract set, and repeating the process until the length limit of the abstract is reached.
Further, step S1 is specifically as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.
(101) Formally describing the post-level social relationship network as follows: order toRepresenting a collection of posts, N being the number of posts, where s i (1 ≦ i ≦ N) representing the ith post;represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us orderRepresenting user u i Of neighbor users, i.e. with user u i A set of users having direct social relationships;representing all users u i A collection of published posts; building a post-level social relationship network according to the following rulesWhereinRepresenting a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if the post isWherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij E is epsilon; in expressing infectious social relationships: if the post isAnd is provided withOrThen is post s i And s j An edge e is established between ij E is epsilon; post level social relationship network is constructed according to the two rulesWherein only the post node set is includedAnd the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship networkThe corresponding adjacency matrix is marked asWherein A is ij > 0 denotes post node s i And s j Have social relationship connection between them, otherwise A ij =0;
(102) The content code of the post is obtained by utilizing the pre-training BERT model and is used as the initial content representation of the post, and the details are as follows:
for each post s i Inputting a post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):
x i =BERT(s i ) (1)
wherein x i Representing a post s i The initial content representation of all N posts is finally obtained, where X is the initial content representation of all N posts 1 ,…,x N ]。
Further, step S2 specifically includes:
(201) the two noise relationships, spurious and latent, are defined as follows:
in social networks, posts connected by social relationships often have more similar content or opinions between them. However, in actual production life, most of social networks on social media belong to pseudo social relationship networks, noise relationships are included in the pseudo social relationship networks, and according to observation of real social media data, two types of noise relationships are defined:
(a) false relationship: if two posts have a social relationship therebetween, but their relevance in content is less than a set threshold, defining the social relationship between the two posts as a false relationship;
(b) the potential relationship is as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;
setting a noise function corresponding to the false relationship as relationship insertion, and setting a noise function corresponding to the potential relationship as relationship loss, specifically as follows:
(c) relationship insertion: randomly adding an edge for any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;
(d) loss of relationship: randomly removing edges between any two connected post nodes in the post level social relationship network;
a pseudo social relationship network is constructed as training data by adding instances of noisy relationships to a real social relationship network.
(202) Encoding the post using a residual map attention network encoder;
after the initial content representation of the pseudo social relationship network and the posts is obtained, in order to model the social relationship between the posts, a residual graph attention network encoder is adopted to encode the posts according to the social relationship between the initial content representation of the posts and the posts so as to integrate the social relationship information and the text content information of the posts; the residual graph attention network encoder may be viewed as an information propagation model that learns representations of nodes in a post-level social relationship network by aggregating information of neighboring nodes that have edges connected in the post-level social relationship network. Meanwhile, compared with a traditional Graph Convolutional network encoder (GCN), the residual Graph attention network encoder can endow different weights to different neighbor nodes of the same node, so that the attention weight of an important neighbor node is improved, the weight of a neighbor node with low correlation is reduced, and more accurate node representation is learned;
formally, the residual graph attention network encoder represents the initial content of the nodeAdjacency matrix corresponding to post-level social relationship networkAs input, where D is the dimension of the node feature representation and N is the number of nodes. The propagation rule formula (2) and formula (3) of the residual map attention network encoder are shown as follows:
wherein H (l) Is a hidden representation of a residual map attention network encoder at the l-th layer, A is an adjacency matrix corresponding to a post-level social relationship network, A ij Representative post s i And s j I is an identity matrix,is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,indicating post s after increasing attention weight i And s j The weight of the relationship between, σ (-) represents a non-lineA sexual activation function;is a post s i And post s j Attention scores at the l-th level in between; w is a group of (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] The general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representation to K different subspaces, K representing the total number of attention heads in the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;andrespectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is the dimension of the implicit representation in the attention calculation process, where the superscript (l) representing the number of layers is omitted and a superscript head is used k To indicate the kth attention head;andis the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); k represents the total number of attention heads, K in total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the common graph attention network is replaced by residual connection to form a residual graph attention network, so that the residual graph attention network can directly transmit the input information to the output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; during the encoding process, the depth of the residual map attention network encoderDetermining information transfer in post level social relationship networkThe distance of broadcast, residual map attention network encoder encodes the post according to the encoding rules of formula (7) and formula (8), and the output of the last layer thereofI.e. a vector representation of the encoded post, whereinPost s encoded by network encoder representing residual map attention i A vector representation of (a);
further, step S3 is specifically as follows:
(301) reconstructing the true social relationship network and the content of the post using a reconstruction-based decoder.
In order to make the learned vector representation of posts contain both textual content information and social relationship information between posts, a decoder based on two reconstruction objectives was devised. The decoder reconstructs a real social relationship network without noise relationship to capture the social relationship information among the posts on one hand, and reconstructs the text content contained in the posts on the other hand, thereby capturing the text content information of the posts and enriching the vector representation of the posts.
For reconstruction of a real social relationship network, the decoder predicts whether a social relationship exists between two post nodes according to vector representations of the two nodes, and particularly predicts a probability of the existence of the social relationship between the two nodes by using an inner product of the vector representations of the two nodes:
wherein (·) T Representing a transpose operation on a vector representation; for each pair of posts s i And s j The decoder predicts the probability of social relationship among them, wherePost level community being decoder outputThe adjacency matrix corresponding to the cross-relationship network,post s representing decoder prediction i And s j The probability of a social relationship existing there between,andrespectively representing posts s i And post s j A vector representation encoded by a residual map attention network encoder, σ (-) representing a nonlinear activation function;
for text content reconstruction, the relationship between reconstructed posts and words is proposed, and text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:
whereinAndis a learning parameter of the decoder, V represents the vocabulary size;is a prediction result of a decoder, whereinRepresenting a post s i Containing the word w j The probability of (d);
designing corresponding loss functions aiming at the two reconstruction targets respectively, wherein the overall training target comprises two parts, the first partA part of the loss is the loss of reconstructing the real social relationship network, denoted as L g Calculating the predicted resultBinary cross-entropy loss between adjacency matrices a corresponding to real post-level social relationship networks:
the second partial loss function is the loss of reconstructed post content, denoted L c Calculating the prediction result of the decoderWith true result s i Binary cross entropy loss between:
s ij for real training labels, posts s are represented i Whether or not to contain the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0. Finally, the two partial losses are combined using an equilibrium parameter λ to obtain a final loss function L:
L=λL g +(1-λ)L c (13)
training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing noise relationship after training is finished 1 ,h 3 ,…,h N ];
Further, step S4 specifically includes:
(401) and performing abstract extraction according to the vector representation of the post by adopting an abstract extractor based on sparse reconstruction.
In order to extract representative important postsAnd taking the son as a final abstract, and extracting the abstract by adopting an abstract extractor based on sparse reconstruction. Formally, the accurate post representation after removing the noise relationship encoded by the residual graph attention network encoder is given as H ═ H 1 ,h 2 ,…,h N ]The abstract extraction process is modeled as a sparse reconstruction process:
wherein | | · | | represents a Frobenius norm,is a matrix of reconstruction coefficients, each element V of which i,j Representing a post s j For restructuring posts s i The degree of contribution of (c); to prevent extracting the content of repeated redundancy, a similarity matrix is introducedTo remove redundant information, wherein if the post s i And post s j If the cosine similarity is higher than the specific threshold eta, then orderOtherwise-hadamard product manipulation is expressed; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind 2,1 Represents the norm L21, defined as follows:
adding the L21 constraint to the reconstruction coefficient matrix V can make each row of the reconstruction coefficient matrix have a sparse characteristic, that is, most elements of each row in the reconstruction coefficient matrix are 0, which means that each post can only be reconstructed by a limited number of posts to limit the length of the summary; the final score for each post is defined as the sum of the post's contributions to the reconstruction of all other posts:
wherein score(s) i ) Representing a post s i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final summary set, and the process is repeated until the length limit of the summary is reached.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. the invention can extract the abstract under the condition of no labeled data. By introducing the social relationship information among posts, capturing the social relationship characteristics among the posts to solve the problem of sparse characteristics caused by short content of a single post;
2. the invention provides a denoised image self-encoder structure which can automatically identify and remove unreliable noise relations in a social relation network under the condition of no labeled data, and alleviate errors caused by the noise relations, so that the reliability and the accuracy of post representation are improved. After learning the accurate representation of the posts, identifying the importance and the redundancy degree of each post by adopting a sparse reconstruction technology frame, and finally extracting the posts with higher importance and low redundancy to form a final abstract.
3. Compared with the existing abstract model, the social media text content abstract obtained by the invention improves the performance on ROUGE evaluation indexes, and simultaneously, according to the experimental result, the denoising model can effectively reduce the ratio of noise relationship in a social network, improve the network structure and improve the accuracy of the abstract.
4. Compared with the traditional graph convolution network encoder, the residual graph attention network encoder used in the invention can endow different weights to different neighbor nodes of the same node, thereby improving the attention weight of important neighbor nodes, reducing the weight of irrelevant neighbor nodes and learning more accurate post representation; since the post vector obtained by the residual graph attention network encoder simultaneously contains text content information and social relationship information, the extraction process of the abstract can identify the importance and novelty of the post from the two aspects of content and social relationship, so as to generate the abstract with higher quality.
5. Since different attention heads capture relationship information between nodes from different spaces, and the relationships of two nodes in different attention heads may have large differences, the present invention employs a max-pooling operation to automatically select the strongest relationship in all subspaces as the true relationship between two nodes, and unifies the attention weights in the K attention heads into a final attention score.
6. The common graph attention network often has the problem of being too smooth, and the invention further replaces the connection between the layers in the common graph attention network with residual connection to form the residual graph attention network, so that the residual graph attention network can directly transmit input information to an output layer.
7. The invention designs a decoder based on two reconstruction targets, so that the learned vector representation of the posts contains both the content information of texts and the social relationship information among the posts. Because the feature representation of the post simultaneously contains the content information and the social relationship information of the post, the abstract process can identify the importance and novelty of the post from two aspects of text content and social structure, thereby generating the abstract content with large information amount, high diversity and wide coverage.
Drawings
Fig. 1 is a schematic diagram of the overall architecture of an unsupervised social media summarization method based on a denoised image self-encoder provided by the invention.
FIG. 2 shows the performance results achieved by the present invention in a social network under each topic in two data sets.
FIG. 3 shows the distribution of false relationships, potential relationships, and the sum of both relationships in a social network.
Fig. 4a and 4b show different denoising methods in the denoising function and the influence of different noise ratios on the result.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides an unsupervised social media summarization method based on a denoising map self-encoder, which adopts two main social media data sets to evaluate the performance of the method. The overall framework of the method is shown in figure 1. In fig. 1, a post-level social relationship network and a post set are input of models, a noise function adds a noise relationship instance to the post-level social relationship network to obtain a pseudo-social relationship network with a noise relationship, the post set is encoded by a BERT model to obtain an initial content representation of a post, the initial content representation of the post and the pseudo-social relationship network with the noise are input to a residual error map attention network encoder together for encoding, and finally a post representation fusing a social relationship and text content is output. And training the whole text model according to the real social relationship network reconstruction and the text content reconstruction, and finally obtaining accurate post representation with the noise relationship removed after training is finished. And inputting the final learned accurate post representation into a summary extraction based on sparse reconstruction, and extracting a final summary.
(1) Post-level social relationship network construction
In the embodiment, mainstream social media network platform data at home and abroad are respectively selected for experimental verification, and Twitter (Twitter) is respectively selected [11] And the Sina microblog [12] The two social media platforms carry out experimental verification, and the posting time of posts under each topic in the data is guaranteed to be within a range of 5 days. For twitter data, the main language of text content in twitter is english, which contains 44,034 posts and 11240 users, wherein each user issues at least one post and each user has at least one social relationship. With 4 standard reference excerpts under each topicTo be used for evaluating the results. In the experiment, links, user names and other special characters in posts are removed, stem extraction and stop word removal are carried out, and posts with the length being less than 3 words are filtered. For microblog data, the Sina microblog is one of the most popular social media platforms in China, so in the embodiment, data collected from the Sina microblog is used, and the data comprises 130k posts and 126k users, which comprise 10 different topics in total, wherein the posts are organized in a tree structure according to interaction relations (such as replying and forwarding), and 3 standard reference abstracts are arranged under each topic for evaluating results. The data statistics details of the two data sets after preprocessing are shown in table 1. In the experiment, the ROUGE evaluation standard is adopted, and four evaluation results of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU are mainly reported.
Formally, makeRepresenting a collection of posts, N being the number of posts, where s i (1 is more than or equal to i and less than or equal to N) represents the ith post;represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us orderRepresenting user u i Of neighbor users, i.e. with user u i A set of users having direct social relationships;representing all users u i A set of published posts; building a post-level social relationship network according to the following rulesWhereinRepresenting a node set, wherein each node represents a post, epsilon represents an edge set between nodes, and each edge represents a social relationship between posts; expressing consistent social relationships: if a post is madeWherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij E is epsilon; in expressing infectious social relationships: if the post isAnd is provided withOrThen is post s i And s j An edge e is established between ij E is epsilon; post level social relationship network is constructed according to the two rulesIn which only node sets are includedAnd the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship networkThe corresponding adjacency matrix is marked asWherein A is ij > 0 denotes post node s i And s j Have a social relationship with each other, otherwise A ij =0;
Table 1 social media data set details
(2) Noise distribution observation
In order to analyze the distribution of noise relationships in the constructed post-level social relationship network, the embodiment provides a method for simply estimating the distribution of noise relationships in the network. Generally speaking, two posts with social relationship are considered to have a false relationship if the relevance of the posts in the content is lower than a set threshold value theta; two posts without social relationship are considered to have a potential relationship with each other if the correlation between the posts in the content is higher than a set threshold theta, and the cosine similarity between the TFIDF representations of the posts is taken as the correlation between the posts in the content. Order toAn adjacency matrix representing the constructed post-level social relationship network, wherein A ij > 0 denotes post node s i And post s j Have social relationship connection between them, otherwise A ij 0. For each node pair(s) in the social relationship network i ,s j ) If they have a social relationship to each other (i.e. A) ij > 0) and the correlation phi on the content between them ij Below the threshold θ, the post s is considered i And s j Have a false relationship between them; if post s i And s j There is no social relationship connection between (i.e. A) ij 0) and correlation Φ in content between them ij Above the threshold θ, they are considered to have a potential relationship. In an experiment, TFIDF of posts is calculated, cosine similarity represented by TFIDF between two posts is calculated as correlation on contents between the posts, the higher the cosine similarity represented by TFIDF of two posts is, the higher the correlation on the contents of the two posts is, and conversely, the lower the cosine similarity represented by TFIDF of two posts is, the lower the correlation on the contents of the two posts is. The statistics may preliminarily reflect the score of the noise relationship in the post-level social relationship networkThe situation of cloth. Since the strengths of social relationships in different social relationship networks may be very different, the average value of all social relationships is used as the value of the threshold θ in the embodiment:
wherein phi ij Is a post s i And post s j In relation to the content. The final statistics for both twitter and microblog data sets are shown in table 2. The results show the average percentage of noise relationships in the post-level social relationship network under all topics, with the average noise percentage in the table including false relationships and potential relationships.
TABLE 2 twitter and microblog data set noise relationship distribution statistics in social relationship networks
Data set | Ratio of false relations | Ratio of potential relationships | Average noise ratio |
Twitter data set | 38.61% | 55.79% | 55.37% |
Microblog data set | 83.17% | 52.66% | 52.67% |
(3) De-noised image self-encoder
Firstly, a pre-trained BERT model is used for extracting the characteristics of the post text content, and the process is as follows:
x i =BERT(s i ) (1)
wherein s is i Represents the ith post, x i Namely, the initial content representation of the post is obtained, all N posts are coded and represented, and finally, a matrix is obtainedWhere D is the dimension of each post feature vector. Post-level social relationship network for inputSocial relationship network to post level using noise functionAdding noise relationship example to construct pseudo social relationship networkReal social relationship networkWith corresponding pseudo social relationship networkForming paired training dataAfter obtaining the initial content representation of the post and constructing the pseudo-social relationship network, associating the initial content representation of the post X with the pseudo-social relationship networkTogether as an input to the residual map attention network encoder, formally, the residual map attention network encoder performs information dissemination according to the following rules:
wherein H (l) Is a hidden representation of a residual map attention network encoder at the l-th layer, A is an adjacency matrix corresponding to a post-level social relationship network, A ij Representative post s i And s j I is an identity matrix,is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,indicating post s after increasing attention weight i And s j The relation weight between, σ (-) represents the nonlinear activation function;is a post s i And post s j Attention scores at the l-th level in between; w (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] Expanding the common attention mechanism to a multi-head attention mechanism, by mapping the potential representation to K different subspaces, each subspace being called an attention head, and in each subspaceThe attention weights are calculated separately:
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;andrespectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is the dimension of the implicit representation in the attention calculation process, where the superscript (l) representing the number of layers is omitted and a superscript head is used k To indicate the kth attention head;andis the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); in the embodiment, K represents the total number of attention heads, and K is total; k represents the kth thereof. Adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and unifying the attention weights in K attention heads into a final attention score:
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the common graph attention network is replaced by residual connection to form a residual graph attention network, so that the residual graph attention network can directly transmit the input information to the output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; depth of residual graph attention network encoder during encodingDetermining the information propagation distance in the post level social relationship network, encoding the post according to the encoding rules of formula (7) and formula (8) by a residual map attention network encoder, and outputting the post at the last layerI.e. a vector representation of the encoded post, whereinPost s encoded by network encoder representing residual map attention i A vector representation of (a);
after the vector representation of the post is obtained by the residual image attention network encoder, decoding the vector representation of the post by a decoder, and reconstructing a real social relationship network which does not contain a noise relationship, so as to learn, identify and remove the noise relationship in the pseudo social relationship network; in the training process, the model is trained according to the following loss function:
wherein,post s representing decoder prediction i And s j Probability of social relationship between them, A ij As posts s i And post s j True social relationship situation between them;is a prediction result of a decoder, whereinRepresenting a post s i Containing the word w j The probability of (d); s ij For real training labels, posts s are represented i Whether or not to contain the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0. Wherein L is g Is the loss of reconstructing the original network structure, L c Reconstructing the loss of the original post content, and finally combining the two losses by using a balance parameter lambda to obtain the final loss L:
L=λL g +(1-λ)L c (11)
training the whole model of the text according to the loss function, taking the initial content representation of the real social relationship network and the posts as input in a testing stage after the training is finished, and coding the input through a residual image attention network coder to obtain an accurate post vector table which integrates social relationship information and text content information and removes noise relationshipIs given by H ═ H 1 ,h 2 ,…,h N ];
(4) Sparse reconstruction based abstraction extraction
After obtaining accurate post vector representation H ═ H 1 ,h 2 ,…,h N ]And then, identifying the importance of the post by adopting a frame based on sparse reconstruction, wherein the reconstruction process is modeled as follows:
wherein the symbols are as described above. The final importance score for each post is calculated as follows:
and then sorting the posts according to the importance scores of the posts, iteratively extracting the posts with the highest scores and adding the posts into the candidate abstract set until the length limit of the abstract is reached.
In the specific implementation process, various hyper-parameters are set in advance, the representing dimension D of the post is set to 768, and the probability distribution of two kinds of noise is often different in different social networks, so that the probability of relation insertion and relation loss in a noise function is set to be 0.4 and 0.1 in twitter data; for microblog data, the probability of both noise functions is set to be 0.3. The balance parameter λ of the two part losses in the final loss function is set to 0.8. The super parameter β ═ γ ═ 1 in the digest selection stage, and the redundancy threshold θ is set to 0.1.
To verify the effectiveness of the method of the invention, the method of the invention (DSNSum) was compared with the two types of methods, respectively. The first method only uses text content in social media to extract the abstract, and specifically comprises the following steps:
Centroid [2] the centrality-based features are used to identify sentences that are highly similar to the center of the cluster as a summary.
LSA [3] Decomposing the feature matrix by using SVD technique, andand identifying the importance of the post according to the size of the singular value after matrix decomposition.
Lexrank [4] The method is a graph sorting algorithm similar to the PageRank, firstly, a similarity network is built according to the similarity of contents among posts, then, the graph sorting algorithm similar to the PageRank is adopted in the similarity network to identify the importance of each post node, and the posts with higher importance are extracted as abstracts.
DSDR [5] Considering the summarization process as a reconstruction task, the most representative posts are extracted as summaries by minimizing the reconstruction loss.
MDS-Sparse [6] The multi-document abstract is extracted by adopting a sparse coding-based technology, and the loss of the reconstructed original document is reduced as much as possible under the sparse constraint, so that the simplicity and the importance of the abstract are ensured.
PacSum [8] Is a graph-based abstract method, which uses BERT to extract the features of sentences and models documents as a directed graph structure, while taking into account the relative position information between sentences.
Spectral [9] A spectrum-based hypothesis is provided, a concept of spectrum importance is defined, and sentences with higher importance are extracted as abstracts according to the spectrum importance of the sentences.
The second method not only uses the text content characteristics of posts, but also introduces the social relationship information among posts, and comprises the following steps:
SNSR [7] based on the social theory, the social relation among posts is modeled into a regular term and is introduced into a sparse reconstructed frame, so that the social relation is used for additionally guiding the abstract extraction process.
SCMGR [10] And encoding post representation fusing text content and a social structure on a social relationship network among posts by using a graph volume network, and inputting the learned fused representation into a sparse reconstruction frame for extracting important posts.
The evaluation indexes of the experimental performance adopt a ROUGE evaluation standard, and specifically comprise four indexes of ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU. The overlap ratio of the N-element words between the output abstract and the standard abstract is measured by the ROUGE-N, and the evaluation is carried out by adopting ROUGE-1 and ROUGE-2 standards in the experiment; ROUGE-L measures the longest common subsequence between the output summary and the standard summary; the route-SU measures the degree of match between the output summary and the standard summary of 1-gram words and 2-gram phrases, while allowing discontinuities between words. In the subsequent experiments, the four indexes are respectively marked as R-1, R-2, R-L and R-SU.
Table 3 shows the experimental results of the model and all comparison methods on both data sets. Higher value of the ROUGE score indicates better performance of the model. Tables 4 and 5 show the model in twitter, respectively [11] And a degradation experimental result on the microblog data, wherein DSNSum is an experimental result of the complete model, and w/o differentiating represents the performance after the denoising module is removed; w/o GAT then represents the performance after the residual map attention encoder is removed.
TABLE 3 Performance of the method of the present invention and other methods on twitter and microblog data sets
TABLE 4 results of degradation experiments on the twitter data by the method of the present invention
Twitter data | R-1 | R-2 | R-L | R-SU* |
DSNSum | 46.51 | 14.29 | 44.16 | 20.76 |
w/o denoising | 45.02 | 13.33 | 42.72 | 19.83 |
w/o GAT | 44.14 | 12.65 | 41.68 | 19.10 |
TABLE 5 degradation experiment results of the method of the present invention on microblog data
Microblog data | R-1 | R-2 | R-L | R-SU* |
DSNSum | 37.01 | 10.98 | 14.22 | 13.06 |
w/o denoising | 35.31 | 9.76 | 13.43 | 12.07 |
w/o GAT | 34.36 | 8.93 | 13.29 | 11.12 |
As can be seen from the results in Table 3, the method of the present invention achieves the highest performance on the twitter data, which exceeds all other comparison methods; on microblog data, the microblog data are slightly lower than an SCGR model under an R-L standard and exceed all other comparison models on other standards, and experimental results prove the effectiveness of the method disclosed by the invention. In the degradation experiments, as can be seen from tables 4 and 5, the removal of any one module results in the performance reduction, and it is proved that each module has a certain promoting effect on the whole model. After the denoising module is removed, the performance of the model is reduced, which proves that extra noise information can be introduced into the summarization process by the noise relationship, so that the summarization result is damaged, and the denoising module reduces the influence of the noise relationship on the summary by identifying and removing the noise relationship in the network, so that the quality of the summary is improved. In addition, the performance of the model is greatly reduced after the graph attention network is removed, and the phenomenon shows that the analysis of the content in the social media environment can be effectively promoted by considering the social relationship information in the post level social relationship network. On the one hand, the graph attention network can alleviate the problem of insufficient content of a single post by aggregating related background information from adjacent neighbor nodes in the post-level social relationship network, and on the other hand, the topological feature of the post-level social relationship network can provide an additional clue for the importance identification of the post from a sociological perspective.
In order to further analyze whether the denoised image self-encoder module provided by the method has the relation of removing noise and improve the function of a network structure, additional experimental verification is carried out. With the post representation kept unchanged (post representation encoded using the same pre-trained BERT model), the ratio of the noise relationship in the network was calculated using the network after denoising, and the results are shown in table 6.
Table 6 shows the ratio of noise relationship in the network after denoising in the twitter data and the microblog data, and the value in parentheses represents the amplitude of the drop before denoising
Data set | Rate of false relations | Rate of potential relationship | Mean noise ratio |
Twitter data | 13.60%(↓25.01%) | 54.93%(↓0.86%) | 54.50%(↓0.87%) |
Microblog data | 45.29%(↓37.88%) | 49.48%(↓3.18%) | 46.57%(↓6.10%) |
From the results in the table, under the condition that the representation of the text content of the post is not changed, the overall noise ratio in the network after denoising is reduced, and the effectiveness of the denoising process is proved. The reduction range of the proportion of the false relation in the twitter and microblog data after denoising reaches 25.01% and 37.88% respectively, which shows that the denoising module is more good at removing the false relation in the network.
In order to verify whether the post representation learned by a denoised graph self-encoder (DGAE) is better than the original BERT representation or not, under the condition that the social relationship network at the post level is kept unchanged, comparing the distribution situation of the noise relationship in the network when DGAE representation learned by using the method of the invention and BERT representation are used, because the value of the threshold theta can seriously influence the distribution situation of the noise relationship when the noise relationship is calculated, the noise distribution situation under different theta value situations is shown in an experiment, specifically, the calculation mode of the threshold theta is shown according to the following formula:
θ=minΦ+δ*(maxΦ-minΦ)
where Φ is the semantic similarity matrix between posts and δ is the tuning parameter, the experimental results are shown in fig. 3.
As can be seen from fig. 3, as the value of the threshold θ increases, the potential relationship rate decreases, and the false relationship rate increases. The overall noise relationship rate is generally maintained at a high level. After DGAE denoising, the potential relation rate is greatly reduced, and the false relation rate also presents a lower level. Most importantly, the total noise relation rate has a remarkable descending trend compared with that before denoising, and the DGAE proves that the DGAE can effectively remove the noise relation in the network.
The x-axis in fig. 3 represents the value of the threshold δ. Subgraphs (a) and (c) correspond to the case of representations encoded using the BERT model, and subgraphs (b) and (d) correspond to representations learned from the encoder model using the denoised image.
Additional experiments were performed in order to analyze the order of the two noise relationships in the noise function and the effect of the ratio on the model performance. And observing the change trend of the model performance by adjusting the probability of the relation between two kinds of noise in the noise function. In addition, whether the adding sequence of the two noise relationships influences the performance of the model or not is further adjusted, the adding sequence of the two noise relationships is recorded as inserting first and then losing and inserting first and then losing, and the change condition of the performance of the model is observed. The results are shown in FIGS. 4a and 4 b.
Fig. 4a and 4b show the influence of different denoising modes and different noise relation addition probabilities in the denoising function on the experimental result. Wherein fig. 4a shows the case where the spurious relation is added first and then the potential relation is added in the noise function, and fig. 4b shows the case where the spurious relation is added first and then the potential relation is added. The horizontal axis represents the insertion probability in the noise relationship, and the vertical axis represents the loss probability in the noise relationship.
The above contents are intended to schematically illustrate the technical solution of the present invention, and the present invention is not limited to the above described embodiments. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.
Reference documents:
[1]Vaswani A,Shazeer N,Parmar N,et al.Attention is All You Need[C].In Proceedings of the 31st International Conference on Neural Information Processing Systems,2017:6000–6010.
[2]Dragomir Radev,Sasha Blair-Goldensohn,and Zhu Zhang.2001.Experiments in Single and Multi-Document Summarization Using MEAD.In First Document Understanding Conference.1-8
[3]Yihong Gong and Xin Liu.2001.Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis.In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.19–25
[4]Gunes Erkan and Dragomir Radev.2011.LexRank:Graph-based Lexical Centrality As Salience in Text Summarization.Journal of Artifcial Intelligence Research 22(Sept.2011),457–479
[5]Z.He,C.Chen,J.Bu,C.Wang,L.Zhang,D.Cai,and X.He.2012.Document summarization based on data reconstruction.In Twenty-sixth AAAI Conference on Artifcial Intelligence.620–626
[6]He Liu,Hongliang Yu,and Zhi-Hong Deng.2015.Multi-Document Summarization Based on Two-Level Sparse Representation Model.In Proceedings of the Twenty-Ninth AAAI Conference on Artifcial Intelligence.196–202
[7]Ruifang He and Xingyi Duan.2018.Twitter Summarization Based on Social Network and Sparse Reconstruction.In Proceedings of the Thirty-Second AAAI Conference on Artifcial Intelligence.5787–5794
[8]Hao Zheng and Mirella Lapata.2019.Sentence Centrality Revisited for Unsupervised Summarization.In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.6236–6247
[9]Baobao Chang Kexiang Wang and Zhifang Sui.2020.A Spectral Method for Unsupervised Multi-Document Summarization.In Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing.435–445
[10]Huanyu Liu,Ruifang He,Liangliang Zhao,Haocheng Wang,and Ruifang Wang.2021.SCMGR:Using Social Context and Multi-Granularity Relations for Unsupervised Social Summarization.In Proceedings of the 30 th ACM International Conference on Information and Knowledge Management.1058-1068
[11]Ruifang He,Liangliang Zhao,and Huanyu Liu.2020.TWEETSUM:Event oriented Social Summarization Dataset.In Proceedings of the 28th International Conference on Computational Linguistics.5731–5736
[12]Jing Li,Wei Gao,Zhongyu Wei,Baolin Peng,and Kam-Fai Wong.2015.Using Content-level Structures for Summarizing Microblog Repost Trees.In Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing.2168–2178.
the present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make various changes in form and details without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. An unsupervised social media summarization method based on a denoised graph self-encoder is characterized by comprising the following steps of:
s1, constructing a post level social relationship network according to a sociology theory, defining a noise-free relationship in the post level social relationship network, namely a real social relationship network, and obtaining a content code of the post by utilizing a pre-trained BERT model as an initial content representation of the post;
s2, defining two noise relationship types of false relationship and potential relationship according to the social behaviors and habits of the user; adding instances of false relations and potential relations into the original post-level social relation network by setting a corresponding noise function, and constructing a post-level social relation network with a noise relation, namely a pseudo social relation network; sampling a plurality of generated pseudo social relationship networks, and simultaneously using the sampled pseudo social relationship network instances and the initial content representation of the posts as the input of a residual map attention network encoder, wherein the residual map attention network encoder comprises a multi-head attention mechanism, and the residual map attention network encoder encodes the posts according to the initial content representation and the social relationship of the posts to obtain the vector representation of the posts;
s3, constructing a decoder, wherein the decoder and the residual image attention network encoder jointly form a de-noising image self-encoder; the decoder reconstructs a real social relationship network according to the vector representation of the posts so as to capture the social relationship information among the posts, and simultaneously reconstructs the semantic relationship among the posts and the contained words so as to capture the text content information of the posts; meanwhile, as a real social relationship network without a noise relationship is reconstructed, the residual error map attention network encoder and the residual error map attention network decoder can learn and exclude the noise relationship in the post level social relationship network, and finally accurate post representation is obtained;
s4, according to the post representation obtained in the step S3, a summary extractor based on sparse reconstruction is adopted to select a final summary, the posts with the highest score are selected iteratively and added into a final summary set, and the process is repeated until the length limit of the summary is reached.
2. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S1 is as follows: the post level social relationship network consists of a node set and an edge set, wherein each node in the node set represents a post, and each edge in the edge set represents a social relationship between corresponding posts; posts comprise two social relations of expression consistency relation and expression infectivity relation; the expression consistency relationship refers to the relationship among posts published by the same user, and an edge is established among post nodes with the expression consistency relationship when a post level social relationship network is established; the expression communicable relationship refers to the relationship among posts issued by users with direct interactive relationship, wherein the direct interactive relationship refers to the interactive relationship of attention, forwarding and comment among the users, and an edge is established among the post nodes with the expression communicable relationship when a post level social relationship network is established.
3. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 2, wherein in step S1:
(101) the post-level social relationship network is formally described as follows: order toRepresenting a collection of posts, N being the number of posts, where s i (1 ≦ i ≦ N) representing the ith post;represents a set of users, containing a total of M users, where u i (1 ≦ i ≦ M) for the ith user; for user u i Let us orderRepresenting user u i Of a neighbor user, i.e. with user u i A set of users having direct social relationships;representing all users u i A collection of published posts; building a post-level social relationship network according to the following rulesWhereinRepresenting a node set, wherein each node corresponds to one post, epsilon represents an edge set between nodes, and each edge corresponds to a social relationship set between posts; expressing consistent social relationships: if a post is madeWherein u is k Indicating the kth user, then post s i And post s j Between them establishes a side e ij Epsilon is; in expressing infectious social relationships: if the post isAnd is provided withOrThen is post s i And s j BetweenEstablishing a side e ij E is epsilon; post level social relationship network is constructed according to the two rulesWherein only the post node set is includedAnd the relationship between nodes, i.e., the set of edges ε ═ e 11 ,e 12 ,…,e NN }; constructed post-level social relationship networkThe corresponding adjacency matrix is marked asWherein A is ij > 0 denotes post node s i And s j Have social relationship connection between them, otherwise A ij =0;
(102) The content code of the post is obtained by utilizing the pre-training BERT model and is used as the initial content representation of the post, and the details are as follows:
for each post s i Inputting the post into a pre-trained BERT model, and then regarding the representation of the sentence head symbol of the last layer of the pre-trained BERT model as the initial content representation of the post; as shown in equation (1):
x i =BERT(s i ) (1)
wherein x i Representing a post s i The initial content representation of all N posts is finally obtained as X ═ X 1 ,…,x N ]。
4. The unsupervised social media summarization method based on a denoised image self-encoder according to claim 1, wherein in step S2,
(201) the two noise relationships, spurious and latent, are defined as follows:
(a) false relationship: if two posts have a social relationship therebetween, but their relevance in content is less than a set threshold, defining the social relationship between the two posts as a false relationship;
(b) potential relationships are as follows: defining a potential relationship between two posts if there is no social relationship between the posts, but their relevance in content is greater than a set threshold;
setting a noise function corresponding to the false relation as relation insertion, and setting a noise function of the potential relation as relation loss, wherein the specific steps are as follows:
(c) relationship insertion: randomly adding an edge to any two unconnected post nodes in the post level social relationship network, and connecting the two nodes;
(d) loss of relationship: randomly removing edges between any two connected post nodes in the post level social relationship network;
a pseudo social relationship network is constructed as training data by adding instances of noisy relationships to a real social relationship network.
5. The unsupervised social media summarization method based on a denoising map self-encoder as claimed in claim 4, wherein in step S2, a residual map attention network encoder is used to encode the post according to the social relationship between the initial content representation of the post and the post, so as to integrate the text content information and the social relationship information of the post; the residual graph attention network encoder is considered to be an information propagation model that learns the representation of nodes in the post-level social relationship network by aggregating information of neighboring nodes connected to the nodes with edges, wherein the neighboring nodes refer to nodes connected with edges in the post-level social relationship network; the method comprises the following specific steps:
residual graph attention network encoder represents initial content of nodesAdjacency matrix corresponding to post-level social relationship networkAs input, where D is the dimension of the node feature representation and N is the number of posts; the propagation rules of the residual map attention network encoder are shown in equations (2) and (3):
wherein H (l) Is the hidden representation of the residual image attention network encoder in the l-th layer, A is the adjacency matrix corresponding to the post level social relationship network, A ij Representative post s i And s j I is an identity matrix,is the adjacency matrix corresponding to the post-level social relationship network after the attention weight has been added,indicating post s after increasing attention weight i And s j The relation weight between, σ (·) represents a nonlinear activation function;is a post s i And post s j Attention scores at stratum i; w (l) And b (l) Is the learning parameter of the residual image attention network encoder at the l-th layer; to further integrate the initial content representation of the post, the initial content representation of the post is X ═ X 1 ,…,x N ]As input to the residual graph attention network encoder, let H (0) X; wherein the calculation of the attention weight employs scaling the dot product attention [1] General willThe general attention mechanism is expanded to a multi-head attention mechanism by mapping the potential representations to K different subspaces, K representing the total number of heads of the multi-head attention mechanism, each subspace being referred to as an attention head, and calculating an attention weight in each subspace separately:
wherein h is i And h j Respectively representing posts s i And post s j Vector representation encoded by a residual map attention network encoder;andrespectively representing posts s in the kth attention head i And s j Attention scores therebetween and normalized attention weights; (.) T Representing a transpose operation; d h Is a dimension of the hidden representation in the attention calculation process; the superscript (l) indicating the number of layers is omitted here and a superscript head is used k To indicate the kth attention head;andis the corresponding learning parameter in the kth attention head; obtaining K attention weights through calculation of a formula (4) and a formula (5); adopting maximum pooling operation to automatically select the strongest relation in all subspaces as the real relation between two post nodes, and taking attention in K attention headsThe force weights are unified into a final attention score:
α ij representing a post s i And s j Final attention weight in between; the connection between each layer in the ordinary graph attention network is replaced by residual connection to form a residual graph attention network encoder, so that the residual graph attention network encoder can directly transmit input information to an output layer, and therefore, the encoding rule of the residual graph attention network encoder is modified into the following form:
where f (-) is a mapping function, implemented by a feed-forward neural network with a nonlinear activation function:
f(H (l) )=σ(W f H (l) +b f ) (8)
wherein W f And b f Is a corresponding learning parameter in the mapping function, sigma (-) represents a nonlinear activation function; during the encoding process, the depth of the residual map attention network encoderDetermining the information propagation distance in the post level social relationship network, encoding the post according to the encoding rules of formula (7) and formula (8) by a residual map attention network encoder, and outputting the post at the last layerI.e. a vector representation of the encoded post, whereinPost s encoded by network encoder representing residual map attention i For a subsequent sparse reconstruction based summarization extraction process.
6. The unsupervised social media summarization method based on a denoising map self-encoder according to claim 1, wherein step S3 is as follows:
by setting a decoder based on two reconstruction targets; enabling a decoder to reconstruct a real social relationship network without a noise relationship to capture social relationship information among posts on one hand, and reconstructing text content contained in the posts on the other hand, thereby capturing the text content information of the posts and further enriching vector representation of the posts;
for reconstruction of a real social relationship network, the decoder predicts whether a social relationship exists between two post nodes according to vector representations of the two nodes, and particularly predicts a probability of the existence of the social relationship between the two nodes by using an inner product of the vector representations of the two nodes:
wherein (·) T Representing a transpose operation on a vector representation; for each pair of posts s i And s j The decoder predicts the probability of social relationship among them, whereIs an adjacency matrix corresponding to the post-level social relationship network output by the decoder,post s representing decoder prediction i And s j The probability of a social relationship existing between them,andrespectively representing posts s i And post s j The vector representation coded by the residual graph attention network coder, sigma (·) represents the nonlinear activation function;
for text content reconstruction, the relationship between reconstructed posts and words is proposed, and the text content information of the posts is reserved by reconstructing the words contained in each post; since each post typically contains several words, the text content reconstruction process is modeled as a multi-label classification task:
whereinAnd withIs the learning parameter of the decoder, Z is the dimension represented by the vector of the post obtained by the encoder, V represents the vocabulary size;is a prediction result of a decoder, whereinRepresenting a post s i Containing the word w j The probability of (d);
corresponding loss functions are respectively designed aiming at the two reconstruction targets, the overall training target comprises two parts, the loss of the first part is the loss of reconstructing a real social relationship network and is recorded as L g Calculating the predicted resultBinary cross entropy loss between adjacency matrices a corresponding to real social relationship networks:
the second partial loss function is the loss of reconstructed post content, denoted L c Calculating the prediction result of the decoderWith true result s i Binary cross entropy loss between:
s ij for real training labels, posts s are represented i Whether or not to include the word w j If post s i Containing the word w j Then s ij 1, otherwise s ij 0; finally, the two losses are combined using the balance parameter λ to obtain the final loss function L:
L=λL g +(1-λ)L c (13)
training a residual error graph attention network encoder and a residual error graph attention network decoder according to a loss function, and obtaining accurate post representation H [ [ H ] ] fusing social relationship information and text content information and removing a noise relationship after the training is finished 1 ,h 2 ,…,h N ]。
7. The unsupervised social media summarization method based on a denoised graph auto-encoder as claimed in claim 1, wherein step S4 is as follows:
accurate post representation after removing noise relation encoded by given residual graph attention network encoder H ═ H 1 ,h 2 ,…,h N ]The abstract extraction process is modeled as a sparse reconstruction process:
wherein | | · | | represents the Frobenius norm,is a matrix of reconstruction coefficients, each element V of which i,j Representing a post s j For restructuring posts s i The degree of contribution of (c); to prevent extracting the content of repeated redundancy, a similarity matrix is introducedTo remove redundant information, wherein if the post s i And post s j If the cosine similarity is higher than the specific threshold eta, then orderOtherwise Representing a Hadamard product operation; in order to prevent the post from being reconstructed, the value of the diagonal element of the reconstruction coefficient matrix V is limited to be 0 in the reconstruction process; beta and gamma are hyper-parameters controlling the weight of the corresponding regularization term; h is an accurate post representation; i | · | purple wind 2,1 Represents the norm L21, defined as follows:
adding the L21 constraint to the reconstruction coefficient matrix V can make each row of the reconstruction coefficient matrix have a sparse characteristic, that is, most elements of each row in the reconstruction coefficient matrix are 0, which means that each post can only be reconstructed by a limited number of posts to limit the length of the summary; the final score for each post is defined as the sum of the post's contributions to the reconstruction of all other posts:
wherein score(s) i ) Representing a post s i Finally, all posts are sorted according to the final scores, the posts with the highest scores are iteratively selected to be added into the final abstract set, and the process is repeated until the length limit of the abstract is reached.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210393787.7A CN115017299A (en) | 2022-04-15 | 2022-04-15 | Unsupervised social media summarization method based on de-noised image self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210393787.7A CN115017299A (en) | 2022-04-15 | 2022-04-15 | Unsupervised social media summarization method based on de-noised image self-encoder |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115017299A true CN115017299A (en) | 2022-09-06 |
Family
ID=83066492
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210393787.7A Pending CN115017299A (en) | 2022-04-15 | 2022-04-15 | Unsupervised social media summarization method based on de-noised image self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115017299A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115545349A (en) * | 2022-11-24 | 2022-12-30 | 天津师范大学 | Time sequence social media popularity prediction method and device based on attribute sensitive interaction |
CN115934933A (en) * | 2023-03-09 | 2023-04-07 | 合肥工业大学 | Text abstract generation method and system based on double-end comparison learning |
CN117131187A (en) * | 2023-10-26 | 2023-11-28 | 中国科学技术大学 | Dialogue abstracting method based on noise binding diffusion model |
US20240004907A1 (en) * | 2022-06-30 | 2024-01-04 | International Business Machines Corporation | Knowledge graph question answering with neural machine translation |
CN117372631A (en) * | 2023-12-07 | 2024-01-09 | 之江实验室 | Training method and application method of multi-view image generation model |
-
2022
- 2022-04-15 CN CN202210393787.7A patent/CN115017299A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20240004907A1 (en) * | 2022-06-30 | 2024-01-04 | International Business Machines Corporation | Knowledge graph question answering with neural machine translation |
US12013884B2 (en) * | 2022-06-30 | 2024-06-18 | International Business Machines Corporation | Knowledge graph question answering with neural machine translation |
CN115545349A (en) * | 2022-11-24 | 2022-12-30 | 天津师范大学 | Time sequence social media popularity prediction method and device based on attribute sensitive interaction |
CN115545349B (en) * | 2022-11-24 | 2023-04-07 | 天津师范大学 | Time sequence social media popularity prediction method and device based on attribute sensitive interaction |
CN115934933A (en) * | 2023-03-09 | 2023-04-07 | 合肥工业大学 | Text abstract generation method and system based on double-end comparison learning |
CN115934933B (en) * | 2023-03-09 | 2023-07-04 | 合肥工业大学 | Text abstract generation method and system based on double-end contrast learning |
CN117131187A (en) * | 2023-10-26 | 2023-11-28 | 中国科学技术大学 | Dialogue abstracting method based on noise binding diffusion model |
CN117131187B (en) * | 2023-10-26 | 2024-02-09 | 中国科学技术大学 | Dialogue abstracting method based on noise binding diffusion model |
CN117372631A (en) * | 2023-12-07 | 2024-01-09 | 之江实验室 | Training method and application method of multi-view image generation model |
CN117372631B (en) * | 2023-12-07 | 2024-03-08 | 之江实验室 | Training method and application method of multi-view image generation model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108197111B (en) | Text automatic summarization method based on fusion semantic clustering | |
CN110413986B (en) | Text clustering multi-document automatic summarization method and system for improving word vector model | |
CN107766324B (en) | Text consistency analysis method based on deep neural network | |
CN115017299A (en) | Unsupervised social media summarization method based on de-noised image self-encoder | |
CN112395393B (en) | Remote supervision relation extraction method based on multitask and multiple examples | |
CN113806547B (en) | Deep learning multi-label text classification method based on graph model | |
CN109840324B (en) | Semantic enhancement topic model construction method and topic evolution analysis method | |
CN111651198A (en) | Automatic code abstract generation method and device | |
CN114218389A (en) | Long text classification method in chemical preparation field based on graph neural network | |
CN107895000A (en) | A kind of cross-cutting semantic information retrieval method based on convolutional neural networks | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN115329088B (en) | Robustness analysis method of graph neural network event detection model | |
CN114969304A (en) | Case public opinion multi-document generation type abstract method based on element graph attention | |
CN113378573A (en) | Content big data oriented small sample relation extraction method and device | |
CN113988012B (en) | Unsupervised social media abstract method integrating social context and multi-granularity relationship | |
CN113988075A (en) | Network security field text data entity relation extraction method based on multi-task learning | |
CN112926340A (en) | Semantic matching model for knowledge point positioning | |
CN115952794A (en) | Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph | |
CN116127099A (en) | Combined text enhanced table entity and type annotation method based on graph rolling network | |
CN118227790A (en) | Text classification method, system, equipment and medium based on multi-label association | |
Sandhan et al. | Evaluating neural word embeddings for Sanskrit | |
CN113158659B (en) | Case-related property calculation method based on judicial text | |
CN114218921A (en) | Problem semantic matching method for optimizing BERT | |
CN115629800A (en) | Code abstract generation method and system based on multiple modes | |
CN115600602A (en) | Method, system and terminal device for extracting key elements of long text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |