CN112883716B

CN112883716B - Twitter abstract generation method based on topic correlation

Info

Publication number: CN112883716B
Application number: CN202110151630.9A
Authority: CN
Inventors: 陈子忠; 曹洋洋; 夏书银
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2022-05-03
Anticipated expiration: 2041-02-03
Also published as: CN112883716A

Abstract

The invention discloses a twitter abstract generating method based on topic correlation, which comprises the steps of establishing a word bank of each topic through the distribution of nouns in each topic; calculating the correlation between a piece of tweet and a certain theme through a special theme word bank of each theme and a word vector model obtained by training; calculating the public acceptance according to the network interaction information; the public acceptance and the subject correlation are integrated to obtain the final significance of the tweet; and performing redundancy removal processing by adopting a maximum marginal correlation algorithm, and outputting an abstract. According to the method, the tweets are selected as the abstracts from the topic correlation and the tweet significance, and the redundancy of the final abstracts is controlled, so that the generated tweet abstracts comprehensively consider the abstract topics, diversity and social recognition. Thereby obtaining the abstract with higher topic relevance, novelty and summarization.

Description

Twitter abstract generation method based on topic correlation

Technical Field

The technical field relates to a text summarization technology in natural language processing, which is used for automatically generating a subject summary of a tweet specific language. Specifically, given a particular topic and several pieces of textual tweets, a summary is obtained that is relevant to that topic.

Background

With the rapid development of social network media and self-media, abstract research for summarizing and summarizing mass data is promoted. Because social network data does not have a large-scale public data set, the summary research on the social network data is mostly a traditional unsupervised method at present. The method based on the statistical characteristics is mainly researched according to the relative position of sentences, word frequency characteristics and the like, and the method is easy to realize, but the obtained characteristics are often relatively simple; the method is based on a graph model, sentences in texts are regarded as nodes, similarity scores between the texts are regarded as edges between the nodes, the significance of each node is calculated based on the nodes and the weight values between the nodes, and the sentences with high significance are selected as abstracts; based on a data reconstruction method, the text is converted into a two-dimensional matrix, and n sentences which can maximally reconstruct the source text are found out through the matrix reconstruction method to serve as abstracts. Most of the research of twitter abstracts in recent years combines static and dynamic data of social networks, but still researches by taking a traditional method as a basic algorithm.

The existing twitter abstract research is used for abstracting the statement of a certain subject or a certain event, and people rarely study the abstract of the given subject. And the existing automatic summarization method does not utilize the common characteristics of large-scale social network data.

Disclosure of Invention

Aiming at the problem that specific subjects and social network data are not introduced into the existing abstract generating method, the invention establishes large-scale main social network data of different subjects based on statistics, and further designs an abstract generating method based on a subject thesaurus.

In order to achieve the above object, the technical solution adopted by the present invention is a twitter summary generation method based on topic correlation, comprising the following steps:

1) preprocessing and data cleaning are carried out on the original data to obtain a text pushing set, and network interaction information of the text pushing is extracted.

2) And counting the word frequencies of nouns, verbs and adjectives in each word set in the tweet set, then taking words with the word frequency ranking at the top 1% as candidate subject words, and filtering out the candidate subject words with the word frequency higher than k in other subjects as final subject word sets.

3) And selecting a topic which is closer to the source text from the topics as a given topic, and calculating the relevance of the tweet to the given topic according to the topic word set.

4) And calculating the public acceptance according to the network interaction information.

5) And (3) integrating the public acceptance and the topic relevance to obtain the final significance of the tweet, which is expressed as: RankScore ═ ω. SS_T+(1-ω)·R，SS_TIs a sentence-to-topic T relevance measure, R is public acceptance, and omega is a hyperparameter.

6) And performing redundancy removal processing by adopting a maximum marginal correlation algorithm, and outputting an abstract.

By adopting the technical scheme, the invention has the following beneficial technical effects:

the invention provides a new method for measuring the relevance of topics aiming at the characteristics of themes and data sparsity of twitter platform data. Through the specific topic word library of each topic and the word vector model obtained by training, the correlation between a piece of tweet and a certain topic can be calculated, and therefore an abstract closer to a target topic is screened.

The invention better considers the distinctiveness of the speech of different subjects and the distribution of the whole data set by establishing the word bank of each subject.

The invention adopts a new maximum marginal correlation algorithm to reduce redundant information and considers the coverage and diversity of the abstract. Therefore, the abstract with better information summarization and more novel content is obtained.

The method effectively combines social network data and integrates the social network data into public identity as a selection granularity of the abstract. For a piece of tweet published by a user, the public interaction amount represents the attention degree of people and the recognition degree of the piece of tweet information. Generally, the high degree of attention and the low degree of identity of people to a piece of information indicates that the fluency of the text is higher and the information is richer, and the purpose of text summarization is to select sentences with high information coverage, novelty and summarization. Therefore, the interactive information is integrated into the algorithm, the information is richer, and the content is more fluent.

In conclusion, the method selects the tweet as the abstract from the topic correlation and the tweet significance, and controls the redundancy of the final abstract, so that the generated tweet abstract comprehensively considers the abstract topic, diversity and social recognition. Thereby obtaining the abstract with higher topic relevance, novelty and summarization.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

In consideration of the theme and data sparsity of social network data, most of researches firstly carry out sparseness-removing screening on the tweets according to the theme and then carry out abstract research on the screened tweets. The abstract is extracted according to the tweet of a given subject, the obtained abstract has better subject relevance, the previous research is usually directed to summarization of the abstract and coverage of a source text, and few people consider the subject relevance of the abstract. In the social network data, people publish a certain statement, which is usually related to a certain topic, and the topic discussed in the social network data is different for different users and different time periods. While summarizing a piece of language, if the subject of the summary is specified, we necessarily want to obtain a summary more relevant to the subject. Therefore, the invention designs an abstract method considering the theme. The method predefines a plurality of prior topics and topic word banks on the basis of large-scale social network data training. The method has the technical effects that: given some prior theme and several pieces of text of the some prior theme.

The word frequency inverse document frequency algorithm (TF-IDF) characterizes the importance of a word in a text segment to some extent. The main idea is that the higher the frequency of a word appearing in a piece of text, and the lower the frequency in the whole document, the higher its importance. Since the overall social network data usually includes a plurality of topics, if the text data of each topic is regarded as a special class, the frequency distribution of words in each class is necessarily different, so we consider that if a word appears more in a certain class but rarely appears in other classes, the word is a "common word" of the topic class. Thus, a word stock which is specific to each theme can be established. Meanwhile, the tweets are rich in social network interaction information, for example, each tweet has forwarding amount, praise number, comment number and the like, and the social recognition represents the fluency, integrity and generalization of the expression of the tweets to a certain extent, so that the tweets are used as one granularity of the selected abstract. Based on the above discussion, the invention designs a twitter summarization method based on social network interaction information and topic relevance, and the method comprises the following specific steps, with reference to fig. 1:

1. preparing data: because of the lack of published tweet linguistic data and summary linguistic data, we collected several tweets with published Twitter API during the year 2019 and 2020. After the original data are obtained, firstly carrying out sparsification removal treatment: and counting the word frequency of the nouns in all the tweets, and screening the top n topic nouns as hot topic words. And then, filtering the text by the prior subject words, and if the speech in all the linguistic data relates to the n topics or the topic tags of the speech relate to the topics, classifying the speech into the categories of the topics related to the speech. Finally we have n tweets, each relating to a topic.

2. And (6) data cleaning. Firstly removing noisy information such as Hashtag, @, URL, numbers at the tail of the tweet and the like, and then removing the tweet with the number of words less than m in the tweet. And (3) the user pushes the praise, forwards and reviews the number of the texts, extracts the comments through the regular expression, and sets the comment to 0 if the comment is not extracted. Finally, n processed tweet sets are obtained. And training a word vector model by using the cleaned data set through a skip-gram model for the requirement of calculating the similarity between words later.

3. Making a topic word set: words of different parts of speech in the dataset are first identified by a stanza named entity recognition tool. And counting the word frequency of the adjective words of the noun verb appearing in each word set, and then taking the words with the word frequency ranking 1% as candidate subject words. Considering that some words may be common nouns or have strong relevance with multiple topics, candidate subject words with a word frequency greater than k in other topics are filtered out as a final subject word set.

4. Topic relevance: the relevance of a piece of tweet to a certain topic is calculated by the following method after the topic word set is obtained:

sim(a,b)＝(a·b^T)/(|a|·|b|)

s(w,t_i)＝sim(emb[t_i],emb[w])t_i∈T_words

F(w,T)＝max|s(w,t₁),s(w,t₂),...,s(w,t_n)|

wherein the sim function is used for calculating cosine similarity between two word vectors, and a and b respectively represent the two word vectors; sr is a regular term of the length of the sentence and is used for balancing the tendency that the model is easy to select a shorter sentence; l is a set of noun verb adjectives in a sentence, L_iThe ith sentence is pointed, and m represents the maximum text pushing amount in the text pushing set; s (w, t)_i) Function computation word w and word t_iThe similarity of (2); f (w, T) is the degree of membership from the word w to the topic T; the emb is a word embedding model for converting the word id into a word vector; the word embedding model is a word vector model obtained by training through a skip-gram method; t is a unit of_wordsA set of topic words that are a topic; SS_TThe relevance measurement from a sentence to a topic T is carried out, and sigma is an adjustable hyper-parameter; n denotes the number of tweets in the source text, L [ i ]]Representing the ith word in L.

5. Public acceptance: if one tweet is forwarded, praise and comments of the tweet are more than those of other tweets, the relative acceptance of the tweet in the document is considered to be higher than those of other tweets, and the calculation formula is as follows:

R_i＝α·c_i+β·re_i+γ·l_i

wherein, c_i、re_i、l_iThe values are respectively the dispersion standardized values of the ith twitter praise number, the forwarding number and the comment number, and alpha, beta and gamma are adjustable hyper-parameters and satisfy that alpha + beta + gamma is 1. R_iIndicating public acceptance.

6. And (3) integrating the public acceptance and the topic correlation information, wherein the final significance of the tweet is as follows:

RankScore＝ω·SS_T+(1-ω)·R

omega is an adjustable hyper-parameter and is used for coordinating two kinds of information.

7. Redundancy penalty strategy: in order to ensure that the redundancy of the screened abstract is as small as possible, the strategy of the invention adopting the improved maximum Marginal correlation MMR (maximum Marginal Relevance) is as follows:

1) initializing a set

A represents the set used to store the abstract, B represents the set of the tweets sorted by their saliency scores, x_iThe ith tweet is represented, and n represents the total tweet number. Wherein the significance score of each tweet is by S_i＝RankScore_(i)Calculating;

2) taking the ith element x from the set B_iIf x_iSatisfies the following conditions:

len(set(x_i)∩set(s^*))＜k s^*∈A

then x is_iMoving from set B to set a, where epsilon is a hyperparameter representing a threshold of similarity. Otherwise will x_iAnd is deleted from the B set. set (x)_i) Represents a pair x_iThe word set after the duplication of the word in (1) is removed, and k represents a threshold value of the word set.

Where len () is used to calculate x_iAnd s^*The set function is used for set element deduplication.

3) Repeat step 2 until

Or the number of the A set tweets reaches the expected digest length.

Reference to the literature

[1] Hoechamine, wubo, penhao, zhangyan chong, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing city: CN106886567B,2019-11-08.

[2] Tenghui, Liu Shimeng, Longfei A convolutional neural network-based microblog news abstract extraction type generation method [ P ]. Beijing City: CN110362674B,2020-08-04.

[3] Herosafa, guangchua, shanghai, wushui, huqing, topic-oriented multi-microblog time-series summarization method [ P ]. tianjin city: CN105740448B,2019-06-25.

[4] Congress, dawn, zhangxuefen, lie sanfei summary method based on social media microblog specific topics [ P ]. tianjin: CN107992634A,2018-05-04.

[5] Hoechamine, wubo, penhao, zhangyang, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing: CN106886567A,2017-06-23.

[6] A method for generating an abstract of a self-adaptive microblog topic [ P ]. beijing: CN106503064A, 2017-03-15.

Claims

1. The twitter abstract generating method based on the topic correlation is characterized by comprising the following steps of:

1) preprocessing and cleaning original data to obtain a tweet set, and extracting network interaction information of the tweet;

2) counting word frequencies of nouns, verbs and adjectives appearing in each word set in the tweet set, then taking words with the word frequencies ranked at the top 1% as candidate subject words, and filtering out candidate subject words with the word frequencies higher than k in other subjects as a final subject word set;

3) selecting a theme which is closer to the source text from the themes as a given theme, calculating the relevance of the tweet to the given theme according to the theme word set, and calculating the relevance of the tweet to a certain theme by the following method:

sim(a,b)＝(a·b^T)/(|a|·|b|)

s(w,t_i)＝sim(emb[t_i],emb[w])t_i∈T_words

F(w,T)＝max|s(w,t₁),s(w,t₂),...,s(w,t_n)|

wherein the sim function is used for calculating cosine similarity between two word vectors, and a and b respectively represent the two word vectors; sr is the length regular term of the sentence; l is the set of noun verb adjectives in the current sentence, L_iThe ith sentence is pointed, and m represents the maximum text pushing amount in the text pushing set; s (w, t)_i) Function computation word w and word t_iThe similarity of (2); f (w, T) is the degree of membership from the word w to the topic T; t is_wordsA set of topic words that are a topic; the emb is a word embedding model for converting the word id into a word vector; SS_TFor a sentence-to-topic T relevance metric, σ is an adjustable hyperparameter, n represents the number of tweets in the source text, L [ i [ ]]Represents the ith word in L;

4) calculating the public acceptance according to the network interaction information, wherein the public acceptance is calculated according to the following formula: r_i＝α·c_i+β·re_i+γ·l_iWherein c is_i、re_i、l_iRespectively the dispersion standardized values of the praise number, the forwarding number and the comment number of the ith deduction, wherein alpha, beta and gamma are adjustable hyper-parameters and satisfy that alpha + beta + gamma is 1, R_iThe public acceptance of the ith tweet is represented;

5) and (3) integrating the public acceptance and the topic relevance to obtain the final significance of the tweet, which is expressed as: RankScore ═ ω. SS_T+(1-ω)·R，SS_TThe relevance measure from a sentence to a topic T is shown, R is public identity, and omega is a hyperparameter;

2. The topic correlation-based twitter summary generation method according to claim 1, wherein: step 1) the pretreatment comprises: firstly, carrying out sparsification removal processing on original data, counting noun word frequencies of all tweets, and screening out the top n topic nouns as hot topic words; and then, filtering the text pushing through the prior subject words, if the speech in all the linguistic data relates to the n topics or the topic labels carried by the speech relate to the n topics, classifying the speech into the categories of the topics related to the speech, and finally obtaining n text pushing sets, wherein each text pushing set relates to one topic.

3. The topic correlation-based twitter summary generation method according to claim 2, wherein: step 1) the data cleaning comprises removing Hashtag, @ and URL and numbers of the last tail of the tweet, and then removing the tweet with the number of words less than m in the tweet.

4. The topic correlation-based twitter summary generation method according to claim 1 or 3, wherein: the network interaction information for extracting the tweet comprises the praise, forwarding and comment quantity extracted by the regular expression.

5. The topic correlation-based twitter summary generation method according to claim 1, wherein: the word embedding model is obtained by training a skip-gram model by using the cleaned data set.

6. The topic correlation-based twitter summary generation method according to claim 1, wherein: the specific steps of the maximum marginal relevance algorithm for redundancy removal processing are as follows:

1) initializing a set

A represents the set used to store the abstract, B represents the set of the tweets sorted by their saliency scores, x_iThe ith tweet is represented, and n represents the total tweet quantity;

len(set(x_i)∩set(s^*))＜k s^*∈A

then x is_iMove from B set to A set, otherwise x_iDeleting from the B set; len function is used to calculate x_iAnd s^*The set function is used for set element deduplication; set (x)_i) Represents a pair x_iThe words in the Chinese language are subjected to de-duplication to form a word set, and k represents a threshold value of the word set;

3) repeat step 2 until

Or set AThe number of the conjunctive grammars reaches the expected digest length.