CN112883716B - Twitter abstract generation method based on topic correlation - Google Patents

Twitter abstract generation method based on topic correlation Download PDF

Info

Publication number
CN112883716B
CN112883716B CN202110151630.9A CN202110151630A CN112883716B CN 112883716 B CN112883716 B CN 112883716B CN 202110151630 A CN202110151630 A CN 202110151630A CN 112883716 B CN112883716 B CN 112883716B
Authority
CN
China
Prior art keywords
topic
word
tweet
correlation
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110151630.9A
Other languages
Chinese (zh)
Other versions
CN112883716A (en
Inventor
陈子忠
曹洋洋
夏书银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110151630.9A priority Critical patent/CN112883716B/en
Publication of CN112883716A publication Critical patent/CN112883716A/en
Application granted granted Critical
Publication of CN112883716B publication Critical patent/CN112883716B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a twitter abstract generating method based on topic correlation, which comprises the steps of establishing a word bank of each topic through the distribution of nouns in each topic; calculating the correlation between a piece of tweet and a certain theme through a special theme word bank of each theme and a word vector model obtained by training; calculating the public acceptance according to the network interaction information; the public acceptance and the subject correlation are integrated to obtain the final significance of the tweet; and performing redundancy removal processing by adopting a maximum marginal correlation algorithm, and outputting an abstract. According to the method, the tweets are selected as the abstracts from the topic correlation and the tweet significance, and the redundancy of the final abstracts is controlled, so that the generated tweet abstracts comprehensively consider the abstract topics, diversity and social recognition. Thereby obtaining the abstract with higher topic relevance, novelty and summarization.

Description

Twitter abstract generation method based on topic correlation
Technical Field
The technical field relates to a text summarization technology in natural language processing, which is used for automatically generating a subject summary of a tweet specific language. Specifically, given a particular topic and several pieces of textual tweets, a summary is obtained that is relevant to that topic.
Background
With the rapid development of social network media and self-media, abstract research for summarizing and summarizing mass data is promoted. Because social network data does not have a large-scale public data set, the summary research on the social network data is mostly a traditional unsupervised method at present. The method based on the statistical characteristics is mainly researched according to the relative position of sentences, word frequency characteristics and the like, and the method is easy to realize, but the obtained characteristics are often relatively simple; the method is based on a graph model, sentences in texts are regarded as nodes, similarity scores between the texts are regarded as edges between the nodes, the significance of each node is calculated based on the nodes and the weight values between the nodes, and the sentences with high significance are selected as abstracts; based on a data reconstruction method, the text is converted into a two-dimensional matrix, and n sentences which can maximally reconstruct the source text are found out through the matrix reconstruction method to serve as abstracts. Most of the research of twitter abstracts in recent years combines static and dynamic data of social networks, but still researches by taking a traditional method as a basic algorithm.
The existing twitter abstract research is used for abstracting the statement of a certain subject or a certain event, and people rarely study the abstract of the given subject. And the existing automatic summarization method does not utilize the common characteristics of large-scale social network data.
Disclosure of Invention
Aiming at the problem that specific subjects and social network data are not introduced into the existing abstract generating method, the invention establishes large-scale main social network data of different subjects based on statistics, and further designs an abstract generating method based on a subject thesaurus.
In order to achieve the above object, the technical solution adopted by the present invention is a twitter summary generation method based on topic correlation, comprising the following steps:
1) preprocessing and data cleaning are carried out on the original data to obtain a text pushing set, and network interaction information of the text pushing is extracted.
2) And counting the word frequencies of nouns, verbs and adjectives in each word set in the tweet set, then taking words with the word frequency ranking at the top 1% as candidate subject words, and filtering out the candidate subject words with the word frequency higher than k in other subjects as final subject word sets.
3) And selecting a topic which is closer to the source text from the topics as a given topic, and calculating the relevance of the tweet to the given topic according to the topic word set.
4) And calculating the public acceptance according to the network interaction information.
5) And (3) integrating the public acceptance and the topic relevance to obtain the final significance of the tweet, which is expressed as: RankScore ═ ω. SST+(1-ω)·R,SSTIs a sentence-to-topic T relevance measure, R is public acceptance, and omega is a hyperparameter.
6) And performing redundancy removal processing by adopting a maximum marginal correlation algorithm, and outputting an abstract.
By adopting the technical scheme, the invention has the following beneficial technical effects:
the invention provides a new method for measuring the relevance of topics aiming at the characteristics of themes and data sparsity of twitter platform data. Through the specific topic word library of each topic and the word vector model obtained by training, the correlation between a piece of tweet and a certain topic can be calculated, and therefore an abstract closer to a target topic is screened.
The invention better considers the distinctiveness of the speech of different subjects and the distribution of the whole data set by establishing the word bank of each subject.
The invention adopts a new maximum marginal correlation algorithm to reduce redundant information and considers the coverage and diversity of the abstract. Therefore, the abstract with better information summarization and more novel content is obtained.
The method effectively combines social network data and integrates the social network data into public identity as a selection granularity of the abstract. For a piece of tweet published by a user, the public interaction amount represents the attention degree of people and the recognition degree of the piece of tweet information. Generally, the high degree of attention and the low degree of identity of people to a piece of information indicates that the fluency of the text is higher and the information is richer, and the purpose of text summarization is to select sentences with high information coverage, novelty and summarization. Therefore, the interactive information is integrated into the algorithm, the information is richer, and the content is more fluent.
In conclusion, the method selects the tweet as the abstract from the topic correlation and the tweet significance, and controls the redundancy of the final abstract, so that the generated tweet abstract comprehensively considers the abstract topic, diversity and social recognition. Thereby obtaining the abstract with higher topic relevance, novelty and summarization.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In consideration of the theme and data sparsity of social network data, most of researches firstly carry out sparseness-removing screening on the tweets according to the theme and then carry out abstract research on the screened tweets. The abstract is extracted according to the tweet of a given subject, the obtained abstract has better subject relevance, the previous research is usually directed to summarization of the abstract and coverage of a source text, and few people consider the subject relevance of the abstract. In the social network data, people publish a certain statement, which is usually related to a certain topic, and the topic discussed in the social network data is different for different users and different time periods. While summarizing a piece of language, if the subject of the summary is specified, we necessarily want to obtain a summary more relevant to the subject. Therefore, the invention designs an abstract method considering the theme. The method predefines a plurality of prior topics and topic word banks on the basis of large-scale social network data training. The method has the technical effects that: given some prior theme and several pieces of text of the some prior theme.
The word frequency inverse document frequency algorithm (TF-IDF) characterizes the importance of a word in a text segment to some extent. The main idea is that the higher the frequency of a word appearing in a piece of text, and the lower the frequency in the whole document, the higher its importance. Since the overall social network data usually includes a plurality of topics, if the text data of each topic is regarded as a special class, the frequency distribution of words in each class is necessarily different, so we consider that if a word appears more in a certain class but rarely appears in other classes, the word is a "common word" of the topic class. Thus, a word stock which is specific to each theme can be established. Meanwhile, the tweets are rich in social network interaction information, for example, each tweet has forwarding amount, praise number, comment number and the like, and the social recognition represents the fluency, integrity and generalization of the expression of the tweets to a certain extent, so that the tweets are used as one granularity of the selected abstract. Based on the above discussion, the invention designs a twitter summarization method based on social network interaction information and topic relevance, and the method comprises the following specific steps, with reference to fig. 1:
1. preparing data: because of the lack of published tweet linguistic data and summary linguistic data, we collected several tweets with published Twitter API during the year 2019 and 2020. After the original data are obtained, firstly carrying out sparsification removal treatment: and counting the word frequency of the nouns in all the tweets, and screening the top n topic nouns as hot topic words. And then, filtering the text by the prior subject words, and if the speech in all the linguistic data relates to the n topics or the topic tags of the speech relate to the topics, classifying the speech into the categories of the topics related to the speech. Finally we have n tweets, each relating to a topic.
2. And (6) data cleaning. Firstly removing noisy information such as Hashtag, @, URL, numbers at the tail of the tweet and the like, and then removing the tweet with the number of words less than m in the tweet. And (3) the user pushes the praise, forwards and reviews the number of the texts, extracts the comments through the regular expression, and sets the comment to 0 if the comment is not extracted. Finally, n processed tweet sets are obtained. And training a word vector model by using the cleaned data set through a skip-gram model for the requirement of calculating the similarity between words later.
3. Making a topic word set: words of different parts of speech in the dataset are first identified by a stanza named entity recognition tool. And counting the word frequency of the adjective words of the noun verb appearing in each word set, and then taking the words with the word frequency ranking 1% as candidate subject words. Considering that some words may be common nouns or have strong relevance with multiple topics, candidate subject words with a word frequency greater than k in other topics are filtered out as a final subject word set.
4. Topic relevance: the relevance of a piece of tweet to a certain topic is calculated by the following method after the topic word set is obtained:
sim(a,b)=(a·bT)/(|a|·|b|)
Figure BDA0002931670820000031
s(w,ti)=sim(emb[ti],emb[w])ti∈Twords
F(w,T)=max|s(w,t1),s(w,t2),...,s(w,tn)|
Figure BDA0002931670820000032
wherein the sim function is used for calculating cosine similarity between two word vectors, and a and b respectively represent the two word vectors; sr is a regular term of the length of the sentence and is used for balancing the tendency that the model is easy to select a shorter sentence; l is a set of noun verb adjectives in a sentence, LiThe ith sentence is pointed, and m represents the maximum text pushing amount in the text pushing set; s (w, t)i) Function computation word w and word tiThe similarity of (2); f (w, T) is the degree of membership from the word w to the topic T; the emb is a word embedding model for converting the word id into a word vector; the word embedding model is a word vector model obtained by training through a skip-gram method; t is a unit ofwordsA set of topic words that are a topic; SSTThe relevance measurement from a sentence to a topic T is carried out, and sigma is an adjustable hyper-parameter; n denotes the number of tweets in the source text, L [ i ]]Representing the ith word in L.
5. Public acceptance: if one tweet is forwarded, praise and comments of the tweet are more than those of other tweets, the relative acceptance of the tweet in the document is considered to be higher than those of other tweets, and the calculation formula is as follows:
Ri=α·ci+β·rei+γ·li
wherein, ci、rei、liThe values are respectively the dispersion standardized values of the ith twitter praise number, the forwarding number and the comment number, and alpha, beta and gamma are adjustable hyper-parameters and satisfy that alpha + beta + gamma is 1. RiIndicating public acceptance.
6. And (3) integrating the public acceptance and the topic correlation information, wherein the final significance of the tweet is as follows:
RankScore=ω·SST+(1-ω)·R
omega is an adjustable hyper-parameter and is used for coordinating two kinds of information.
7. Redundancy penalty strategy: in order to ensure that the redundancy of the screened abstract is as small as possible, the strategy of the invention adopting the improved maximum Marginal correlation MMR (maximum Marginal Relevance) is as follows:
1) initializing a set
Figure BDA0002931670820000041
A represents the set used to store the abstract, B represents the set of the tweets sorted by their saliency scores, xiThe ith tweet is represented, and n represents the total tweet number. Wherein the significance score of each tweet is by Si=RankScore(i)Calculating;
2) taking the ith element x from the set BiIf xiSatisfies the following conditions:
len(set(xi)∩set(s*))<k s*∈A
then x isiMoving from set B to set a, where epsilon is a hyperparameter representing a threshold of similarity. Otherwise will xiAnd is deleted from the B set. set (x)i) Represents a pair xiThe word set after the duplication of the word in (1) is removed, and k represents a threshold value of the word set.
Where len () is used to calculate xiAnd s*The set function is used for set element deduplication.
3) Repeat step 2 until
Figure BDA0002931670820000042
Or the number of the A set tweets reaches the expected digest length.
Reference to the literature
[1] Hoechamine, wubo, penhao, zhangyan chong, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing city: CN106886567B,2019-11-08.
[2] Tenghui, Liu Shimeng, Longfei A convolutional neural network-based microblog news abstract extraction type generation method [ P ]. Beijing City: CN110362674B,2020-08-04.
[3] Herosafa, guangchua, shanghai, wushui, huqing, topic-oriented multi-microblog time-series summarization method [ P ]. tianjin city: CN105740448B,2019-06-25.
[4] Congress, dawn, zhangxuefen, lie sanfei summary method based on social media microblog specific topics [ P ]. tianjin: CN107992634A,2018-05-04.
[5] Hoechamine, wubo, penhao, zhangyang, li jiangxin microblog emergency detection method and device based on semantic expansion [ P ]. beijing: CN106886567A,2017-06-23.
[6] A method for generating an abstract of a self-adaptive microblog topic [ P ]. beijing: CN106503064A, 2017-03-15.

Claims (6)

1. The twitter abstract generating method based on the topic correlation is characterized by comprising the following steps of:
1) preprocessing and cleaning original data to obtain a tweet set, and extracting network interaction information of the tweet;
2) counting word frequencies of nouns, verbs and adjectives appearing in each word set in the tweet set, then taking words with the word frequencies ranked at the top 1% as candidate subject words, and filtering out candidate subject words with the word frequencies higher than k in other subjects as a final subject word set;
3) selecting a theme which is closer to the source text from the themes as a given theme, calculating the relevance of the tweet to the given theme according to the theme word set, and calculating the relevance of the tweet to a certain theme by the following method:
sim(a,b)=(a·bT)/(|a|·|b|)
Figure FDA0003499999500000011
s(w,ti)=sim(emb[ti],emb[w])ti∈Twords
F(w,T)=max|s(w,t1),s(w,t2),...,s(w,tn)|
Figure FDA0003499999500000012
wherein the sim function is used for calculating cosine similarity between two word vectors, and a and b respectively represent the two word vectors; sr is the length regular term of the sentence; l is the set of noun verb adjectives in the current sentence, LiThe ith sentence is pointed, and m represents the maximum text pushing amount in the text pushing set; s (w, t)i) Function computation word w and word tiThe similarity of (2); f (w, T) is the degree of membership from the word w to the topic T; t iswordsA set of topic words that are a topic; the emb is a word embedding model for converting the word id into a word vector; SSTFor a sentence-to-topic T relevance metric, σ is an adjustable hyperparameter, n represents the number of tweets in the source text, L [ i [ ]]Represents the ith word in L;
4) calculating the public acceptance according to the network interaction information, wherein the public acceptance is calculated according to the following formula: ri=α·ci+β·rei+γ·liWherein c isi、rei、liRespectively the dispersion standardized values of the praise number, the forwarding number and the comment number of the ith deduction, wherein alpha, beta and gamma are adjustable hyper-parameters and satisfy that alpha + beta + gamma is 1, RiThe public acceptance of the ith tweet is represented;
5) and (3) integrating the public acceptance and the topic relevance to obtain the final significance of the tweet, which is expressed as: RankScore ═ ω. SST+(1-ω)·R,SSTThe relevance measure from a sentence to a topic T is shown, R is public identity, and omega is a hyperparameter;
6) and performing redundancy removal processing by adopting a maximum marginal correlation algorithm, and outputting an abstract.
2. The topic correlation-based twitter summary generation method according to claim 1, wherein: step 1) the pretreatment comprises: firstly, carrying out sparsification removal processing on original data, counting noun word frequencies of all tweets, and screening out the top n topic nouns as hot topic words; and then, filtering the text pushing through the prior subject words, if the speech in all the linguistic data relates to the n topics or the topic labels carried by the speech relate to the n topics, classifying the speech into the categories of the topics related to the speech, and finally obtaining n text pushing sets, wherein each text pushing set relates to one topic.
3. The topic correlation-based twitter summary generation method according to claim 2, wherein: step 1) the data cleaning comprises removing Hashtag, @ and URL and numbers of the last tail of the tweet, and then removing the tweet with the number of words less than m in the tweet.
4. The topic correlation-based twitter summary generation method according to claim 1 or 3, wherein: the network interaction information for extracting the tweet comprises the praise, forwarding and comment quantity extracted by the regular expression.
5. The topic correlation-based twitter summary generation method according to claim 1, wherein: the word embedding model is obtained by training a skip-gram model by using the cleaned data set.
6. The topic correlation-based twitter summary generation method according to claim 1, wherein: the specific steps of the maximum marginal relevance algorithm for redundancy removal processing are as follows:
1) initializing a set
Figure FDA0003499999500000021
A represents the set used to store the abstract, B represents the set of the tweets sorted by their saliency scores, xiThe ith tweet is represented, and n represents the total tweet quantity;
2) taking the ith element x from the set BiIf xiSatisfies the following conditions:
len(set(xi)∩set(s*))<k s*∈A
then x isiMove from B set to A set, otherwise xiDeleting from the B set; len function is used to calculate xiAnd s*The set function is used for set element deduplication; set (x)i) Represents a pair xiThe words in the Chinese language are subjected to de-duplication to form a word set, and k represents a threshold value of the word set;
3) repeat step 2 until
Figure FDA0003499999500000022
Or set AThe number of the conjunctive grammars reaches the expected digest length.
CN202110151630.9A 2021-02-03 2021-02-03 Twitter abstract generation method based on topic correlation Active CN112883716B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110151630.9A CN112883716B (en) 2021-02-03 2021-02-03 Twitter abstract generation method based on topic correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110151630.9A CN112883716B (en) 2021-02-03 2021-02-03 Twitter abstract generation method based on topic correlation

Publications (2)

Publication Number Publication Date
CN112883716A CN112883716A (en) 2021-06-01
CN112883716B true CN112883716B (en) 2022-05-03

Family

ID=76057037

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110151630.9A Active CN112883716B (en) 2021-02-03 2021-02-03 Twitter abstract generation method based on topic correlation

Country Status (1)

Country Link
CN (1) CN112883716B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254624B (en) * 2021-12-01 2023-01-31 马上消费金融股份有限公司 Method and system for determining website type

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916904A (en) * 2006-09-01 2007-02-21 北大方正集团有限公司 Method of abstracting single file based on expansion of file
US8078450B2 (en) * 2006-10-10 2011-12-13 Abbyy Software Ltd. Method and system for analyzing various languages and constructing language-independent semantic structures
US8930376B2 (en) * 2008-02-15 2015-01-06 Yahoo! Inc. Search result abstract quality using community metadata
CN102622411A (en) * 2012-02-17 2012-08-01 清华大学 Structured abstract generating method
CN103150333B (en) * 2013-01-26 2016-01-13 安徽博约信息科技有限责任公司 Opinion leader identification method in microblog media
CN105740448B (en) * 2016-02-03 2019-06-25 天津大学 More microblogging timing abstract methods towards topic
CN106126620A (en) * 2016-06-22 2016-11-16 北京鼎泰智源科技有限公司 Method of Chinese Text Automatic Abstraction based on machine learning
CN106886567B (en) * 2017-01-12 2019-11-08 北京航空航天大学 Microblogging incident detection method and device based on semantic extension
CN108920611B (en) * 2018-06-28 2019-10-01 北京百度网讯科技有限公司 Article generation method, device, equipment and storage medium
CN110362674B (en) * 2019-07-18 2020-08-04 中国搜索信息科技股份有限公司 Microblog news abstract extraction type generation method based on convolutional neural network
CN110990676A (en) * 2019-11-28 2020-04-10 福建亿榕信息技术有限公司 Social media hotspot topic extraction method and system
CN111125349A (en) * 2019-12-17 2020-05-08 辽宁大学 Graph model text abstract generation method based on word frequency and semantics
CN112100317B (en) * 2020-09-24 2022-10-14 南京邮电大学 Feature keyword extraction method based on theme semantic perception

Also Published As

Publication number Publication date
CN112883716A (en) 2021-06-01

Similar Documents

Publication Publication Date Title
CN108197111B (en) Text automatic summarization method based on fusion semantic clustering
Ombabi et al. Deep learning framework based on Word2Vec and CNNfor users interests classification
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
Tiwari et al. Ensemble approach for twitter sentiment analysis
Banik et al. Toxicity detection on bengali social media comments using supervised models
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
Zhou et al. Neural storyline extraction model for storyline generation from news articles
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Widjanarko et al. Multi document summarization for the Indonesian language based on latent dirichlet allocation and significance sentence
Arifin et al. Emotion detection of tweets in Indonesian language using non-negative matrix factorization
CN112883716B (en) Twitter abstract generation method based on topic correlation
Cai et al. Indonesian automatic text summarization based on a new clustering method in sentence level
De Saa et al. Self-reflective and introspective feature model for hate content detection in sinhala youtube videos
Zhu et al. NUDTSNA at TREC 2015 Microblog Track: A Live Retrieval System Framework for Social Network based on Semantic Expansion and Quality Model.
Trad et al. A framework for authorial clustering of shorter texts in latent semantic spaces
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Gu et al. Controllable citation text generation
Konkaew et al. Automatic tag recommendation approach with keyphrase extraction and word embedding techniques
CN108256055B (en) Topic modeling method based on data enhancement
Shyang et al. A text augmentation approach using similarity measures based on neural sentence embeddings for emotion classification on microblogs
Yu et al. Hot event detection for social media based on keyword semantic information
Steuber et al. Embedding semantic anchors to guide topic models on short text corpora
CN112527964B (en) Microblog abstract generation method based on multi-mode manifold learning and social network characteristics
Jiang et al. Parallel dynamic topic modeling via evolving topic adjustment and term weighting scheme
Alfarra et al. Graph-based Growing self-organizing map for Single Document Summarization (GGSDS)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant