CN115617981A - Information level abstract extraction method for short text of social network - Google Patents

Information level abstract extraction method for short text of social network Download PDF

Info

Publication number
CN115617981A
CN115617981A CN202211069136.9A CN202211069136A CN115617981A CN 115617981 A CN115617981 A CN 115617981A CN 202211069136 A CN202211069136 A CN 202211069136A CN 115617981 A CN115617981 A CN 115617981A
Authority
CN
China
Prior art keywords
abstract
clustering
node
text
social network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211069136.9A
Other languages
Chinese (zh)
Inventor
谢文波
陈俊秀
王欣
陈斌
陈端兵
李艳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Petroleum University
Original Assignee
Southwest Petroleum University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Petroleum University filed Critical Southwest Petroleum University
Priority to CN202211069136.9A priority Critical patent/CN115617981A/en
Publication of CN115617981A publication Critical patent/CN115617981A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/322Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a method for extracting information hierarchical abstract facing short text of a social network, which aims at short text data of the social network, carries out hierarchical clustering by constructing a multi-level clustering index tree, and extracts the abstract of the text data on the basis of a clustering result. Firstly, performing data cleaning and preprocessing on a text data set of a social network; secondly, feature extraction is carried out by using an LDA model, so that dimension disasters can be effectively avoided; then substituting the characteristic data into an RSC algorithm model to generate a hierarchical clustering tree, and merging and optimizing clustering results to obtain a multi-level clustering index tree; and finally, performing information-hierarchical abstract extraction based on the multi-level clustering index tree. The abstract extraction algorithm can extract the generalized sentences in the short text, so the method uses the TextRank algorithm to extract the abstract. The invention can help people to quickly and effectively acquire information, and is beneficial to data analysis of a social network platform and supervision work of a social network.

Description

Information level abstract extraction method for short text of social network
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a method for extracting an information hierarchy abstract facing short text data of a social network.
Background
With the rapid development of science and technology, text data on the internet grows exponentially, people can acquire information more conveniently and quickly, but it is very difficult to quickly find and extract useful information from massive text data. The text clustering technology automatically divides text data into appropriate category sets, and then finds the distribution condition of content from disordered data sets, and reduces the range of information retrieval. The abstract extraction technology is used for displaying the text content in a concise, time-saving and efficient manner in a concise and time-saving manner, and the hierarchical abstract can help people to quickly and effectively acquire information, so that the situation that the text is not straight or curved is distinguished, and the people can conveniently make correct value orientation.
The existing text clustering technology has the defects that the clustering algorithm has the following defects:
(i) The running time is long, and the task with high timeliness requirement cannot be solved.
(ii) The clustering result is ambiguous and it is difficult to extract effective information therefrom.
Therefore, the existing problems focus on how to effectively improve the algorithm efficiency and the result definition so as to face the data disaster problem in the big data era. The present invention uses the RSC algorithm, which can effectively solve the above problems.
For defect one, the time complexity of the RSC algorithm used by the invention is O (nlogn), which is compared with the time complexity O (n) of the traditional clustering algorithm 2 ) The time overhead is less.
For the second defect, the RSC algorithm used by the invention is a hierarchical clustering algorithm, and the result obtained by the algorithm is a clustering tree, so that the hierarchical organization relation of data can be well shown, and the RSC algorithm has good data interpretation capability.
Disclosure of Invention
The invention aims to provide an information level abstract extraction method for short texts of a social network, which is used for helping to quickly and effectively acquire information and improving the management efficiency of a social network platform.
A social network short text oriented information hierarchy abstract extraction method comprises the following steps:
1. pre-processing the collected data set;
2. performing feature extraction on the text data;
3. performing hierarchical clustering on the text data;
4. constructing a multi-level clustering index tree based on the clustering result;
5. and extracting the information abstract layer by layer from bottom to top.
In order to achieve the purpose, the invention provides a method for extracting information level abstract of short text of a social network, which is characterized by comprising the following steps:
for the step 1, according to the text characteristics of the short text of the social network, targeted data cleaning operation needs to be performed on the text data, noise data such as webpage links, expressions, topics, reprint sources, location addresses and the like are filtered, and meanwhile, words with frequent occurrence times need to be deleted, and a stop word list more suitable for the data set is constructed.
And for the step 2, in order to avoid overlarge feature dimension, an LDA model is adopted to extract the subject features of the text.
For step 3, hierarchical clustering of the text uses the RSC algorithm to perform hierarchical clustering on the short text of the social network, and in the data set, two main processes are required, including creation and pruning of a clustering sub-tree, by detecting reciprocal nearest neighbor nodes and iteratively creating a clustering tree:
3.1 Creation of clustered subtrees
Taking each piece of text data as a data point, and stopping building a sub-tree when two stopping conditions are met in a node link mode;
3.2 ) pruning clustered subtrees
After the construction of the clustering subtree is completed, in order to avoid the occurrence of the slender subtree, pruning operation needs to be carried out on the slender subtree.
For the step 4, the constructing of the multi-level clustering index tree based on the clustering result comprises:
4.1 ) construct a hierarchical clustering index tree
The clustering result obtained by using the RSC algorithm is a clustering tree with dispersed nodes, and the nodes in the clustering tree are divided into three types including: there is only one non-leaf node for a child, and there are two or more non-leaf nodes and leaf nodes for children. The method for processing three different types of nodes in the construction of the multilevel clustering index tree comprises the following steps:
(i) If the node is a non-leaf node with only one child, judging the child type of the node, and if the node is also a non-leaf node, recursively creating the child of the node; if the child is the leaf node, creating a child node on the basis of the leaf node for displaying the text of the leaf node;
(ii) If the node is a non-leaf node with two or more children, traversing the child nodes to repeat the step (i);
(iii) And if the leaf node is the child node, creating a child node on the basis of the leaf node for displaying the text data corresponding to the leaf node.
4.2 ) optimized merging of hierarchical clustering index trees
In the multi-level clustering index tree, a plurality of non-leaf nodes with only one child can increase the depth of the clustering tree, but the division capability of the hierarchy is not obvious, and even the nodes can be regarded as nodes of a unified hierarchy, so that the nodes in the multi-level clustering index tree are combined.
For the step 5, constructing the clustering tree with the information abstract according to the optimized and combined multi-level clustering index tree requires filling the text data corresponding to the index into the clustering tree structure, and generating the abstract of the text data corresponding to the child node at the non-leaf node:
5.1 Text selection and processing
The text data used for information summarization needs to be reprocessed to filter out data which easily causes reading obstacles, such as network expressions, characters, special symbols and the like.
5.2 Preparation of word vectors
In the process of generating information abstracts layer by layer from bottom to top, abstract extraction is carried out by using a TextRank algorithm, and the characteristics of sentences need to be obtained in the extraction process. Glove is a Word characterization tool based on global Word frequency statistics, semantic and syntactic information can be contained between vectors as much as possible by using Glove for vectorized representation of words, and compared with methods for creating features such as TF-IDF and Word2Vec, the Glove can not cause overlarge feature dimension, and can also fully utilize linguistic data.
5.3 ) extract the abstract
And finally, extracting the most general and representative sentences from each level of the clustering tree as the information abstract of the text.
Drawings
FIG. 1 is a flow chart of an embodiment.
Detailed Description
The following description of the present invention is provided in conjunction with fig. 1 to provide those skilled in the art with a better understanding of the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
The invention provides an information hierarchy abstract extraction method facing short texts in a social network by using an RSC algorithm to perform hierarchical clustering on microblog text data and using a TextRank algorithm to perform abstract extraction on clustering results. Firstly, cleaning and preprocessing the data of a text data set of a social network; secondly, feature extraction is carried out on the data, and dimension disasters can be effectively avoided by using an LDA model; then substituting the characteristic data into an RSC algorithm model to generate a hierarchical clustering tree, and merging and optimizing clustering results to obtain a multi-level clustering index tree; and finally, performing information-hierarchical abstract extraction on the text data based on the multi-level clustering index tree, and performing abstract extraction by using an extraction abstract algorithm TextRank.
A social network short text oriented information hierarchy abstract extraction method comprises the following steps:
1. pre-processing the collected data set;
2. performing feature extraction on the text data;
3. performing hierarchical clustering on the text data;
4. constructing a multi-level clustering index tree based on the clustering result;
5. and extracting the information abstract layer by layer from bottom to top.
The invention relates to a social network short text oriented information level abstract extraction method, which comprises the following steps:
1. data pre-processing
The invention uses microblog pcu data sets that are initially used to explore spammers in microblogs, and these data can also be used to study machine learning methods and to study social networks. Csv data in the microblog pcu data set, namely microblog content information sent by a user, are specifically used, and the data description is shown in table 1.
Attribute Description of the invention
post_id User microblog ID
post_time Microblog release time
content Microblog content
poster_id Issuing the user ID of the microblog
poster_url Personal homepage address of user
repost_num The forwarding amount of the microblog
comment_num Number of comments on the microblog
reposter_post_id Transfer person ID
inner_flag Built-in label
TABLE 1
Since the original data is unstructured Chinese text data and there are many irregular representations in social network texts, such as some network expressions, facial characters, expressions, special symbols, etc., the texts need to be cleaned and preprocessed. The data preprocessing comprises four steps:
(1) filtering the noise data: and observing original data, and deleting noise data including webpage links, expressions, titles, topics, reprint sources, positioning addresses and the like by using a regular expression, and deleting redundant words with frequent occurrence times.
(2) Removing stop words: the invention constructs a deactivation word list more suitable for the data set on the basis of a deactivation word library of a Sichuan university machine intelligent laboratory by combining specific conditions to supplement and perfect on the basis of the existing deactivation word list and constructing the deactivation word list as good as possible.
(3) Chinese word segmentation: the processing of Chinese text data requires word segmentation, and the invention adopts a jieba word segmentation tool to perform word segmentation.
(4) And (3) word number screening: because the time range of the data used by the method is 2015 years ago, the used microblog text data are all within 140 words. And deleting the microblog text with the text length smaller than 10 after segmenting words and removing stop words because the short text data has low value.
2. Feature extraction
In order to avoid the problem of overlarge feature dimension, the LDA topic model is adopted for topic feature extraction, some class libraries such as sciit-lead, spark MLib and genim can be used for learning the LDA topic model, the use principles of the class libraries are basically similar, and the LDA feature extraction is carried out on the basis of the sciit-lead. The implicit topic parameter is set to 5, that is, each microblog has 5 topic features, and an LDA feature matrix of n × 5 (n is microblog number) is obtained.
3. Hierarchical clustering of text
When the RSC algorithm is used for clustering analysis, two processes of constructing a clustering subtree and pruning are mainly executed in an iteration mode, and a clustering result obtained through the hierarchical clustering algorithm is a hierarchical clustering tree.
3.1 Construction of clustered subtrees)
The data set contains text data issued by microblog users, each piece of data (word vector) is regarded as a data node, an empty subtree and a candidate set containing all nodes are initialized firstly, if the candidate set is not empty, one node is randomly selected and is linked to a neighbor nearest to the node, namely a first-order nearest neighbor of the node, and the first-order nearest neighbor is further linked to the nearest neighbor of the node on the assumption that each node has only one nearest neighbor, so that a linked list is generated, and the construction of the subtree meets two stop conditions:
(i) The kth order nearest neighbor and the kth-2 order nearest neighbor form a pair of RNNs nodes;
(ii) The k-th nearest neighbor is not in the candidate set.
And (3) when the condition (i) is satisfied, forming a new subtree, wherein the k-2 nd order nearest neighbor and the k-1 st order nearest neighbor serve as a pair of RNNs and become support nodes of the clustering tree, at the moment, manually constructing a root node serving as a representative node of the subtree to be linked to the support nodes, finally deleting all nodes of the subtree from the candidate set, randomly selecting a node in the candidate set if the candidate set is not empty after deletion, circularly executing the process, and stopping constructing the subtree when the candidate set is empty. When condition (ii) is true, the linked list is linked to an existing subtree.
3.2 ), prune the clustered subtree
After the sub-tree construction is completed, in order to avoid the occurrence of the elongated sub-tree, it is necessary to prune it. For any node i, if it belongs to an existing subtree with node p and node q as support nodes, the depth of node i is the average of the sum of its shortest path lengths to p and q, and the depth of node i can be regarded as its path length to the root node. When each generated sub-tree is pruned, all nodes with the depth larger than a threshold value can be pruned according to the path length of the nodes, and each pruned node is linked to an artificially set root node, so that a special sub-tree is formed, and the sub-tree only consists of one node and one root.
3.3 Representative node evaluation index)
In the process of generating the clustering subtree, when the stopping condition (i) is satisfied, a more representative node needs to be selected from a pair of RNNs nodes, and the RSC algorithm provides four indexes for evaluating the RNNs nodes, wherein the indexes are as follows: the degree of the node, the neighbor average degree of the node, the step length centrality and the distance centrality. The degree of a node and the neighbor average degree of a node are relatively easy to understand, and it can be considered that in one cluster, a node with a larger degree is more important than a node with a smaller degree, and if the neighbor of a node has a larger degree, the node is more important, which means that the node is located at the center of the dense region of data. The degree and neighbor-average of node i are defined as follows:
Figure RE-GDA0003949147470000061
wherein, d i Is the degree of the node i and,
Figure RE-GDA0003949147470000062
is the neighbor mean of node i, r ij Are elements in a relationship matrix that are described based on a graph structure of data points. In order to further measure the centrality of RNNs, two indexes of step centrality and distance centrality are adopted. The step centrality of node i is defined as follows:
Figure RE-GDA0003949147470000063
wherein, | τ | is the number of all nodes in the cluster t where the node i is located, s ij Is the shortest path length between node i and node j. The distance centrality of node i is defined as follows:
Figure RE-GDA0003949147470000071
wherein, dist ij Is the distance between node i and node j. In order to more fairly and reasonably measure the representativeness of the nodes, a mixed index is provided by combining the four evaluation indexes, and the mixed index is obtained by calculating indexes and methods according to the relative importance in the complex network and is defined as follows:
Figure RE-GDA0003949147470000072
if the mixed index score of a node is higher, the node is more representative.
3.4 Results of clustering), the results of clustering
The result clustering tree is represented by a dictionary, and comprises a plurality of key-value pairs, and each non-leaf node and its children are represented by the key-value pairs. All microblog texts are regarded as nodes in the tree, and the names of the nodes are represented by the row indexes of the microblog texts. In the result clustering tree, keys represent non-leaf nodes, and the values of the keys represent children of the non-leaf nodes, represented by a set.
Assume a tree T: {1: {0},2: {1,3,12},3: {4,5},5: {6,7} }, non-leaf nodes in tree T are {1,2,3,5}, wherein nodes 1 and 3 are children of node 2, node 5 is a child of node 3, and the remaining nodes {0,12,4,6,7} are leaf nodes, then the root node of tree T is node 2.
4. Building a hierarchical clustering index tree
In the clustering tree with dispersed nodes, because of the storage structure, in the recursive process of forming the nested clustering tree by linking each node, the processing of different types of nodes is different, so the types of the nodes need to be identified, and the nodes in the clustering tree are divided into the following three types: (1) a non-leaf node with only one child; (2) non-leaf nodes with two or more children; (3) leaf nodes. The processing method for three different types of nodes in the construction of the clustering index tree is as follows:
(i) If the node is a non-leaf node with only one child, judging the child type of the node, and if the node is also a non-leaf node, recursively creating the child of the node; if the child is a leaf node, a child node is created on the basis of the leaf node for displaying the text of the leaf node.
(ii) If the node is a non-leaf node having two or more children, then step (i) is repeated traversing its child nodes.
(iii) And if the leaf node is the child node, creating a child node on the basis of the leaf node for displaying the text data corresponding to the leaf node.
In the multi-level index tree, a plurality of non-leaf nodes with only one child can be found, the nodes can increase the depth of the clustering tree, but the division capability of the hierarchy is not obvious, and the nodes can be even regarded as nodes of the same hierarchy, so that the nodes in the index tree are merged.
5. Abstract extraction
The clustering tree with the information abstract is constructed according to the multi-level clustering index tree, and only the text data corresponding to the index is filled into the clustering tree structure, and the abstract of the text data corresponding to the child node is generated at the non-leaf node.
5.1 ) selecting data and processing
And selecting partial text data in the clustering result to extract the abstract, wherein the data used for extracting the abstract needs to be subjected to some basic text cleaning so as to avoid the influence of the noise of the text data on the abstract extraction as much as possible. However, unlike the previous processing method, all punctuations of a text are deleted during data preprocessing, but when abstract extraction is performed, in order to quickly acquire information, data which is easily read-hindered, such as network vocabularies, characters, special symbols, and the like, needs to be processed, and some punctuations need to be reserved so as to divide sentences.
5.2 Ready word vector, preparation word vector
The method comprises the steps of training GloVe word vectors by using Chinese Wikipedia text corpus which is subjected to word segmentation and word stop removal, outputting word vector files, obtaining vectors of all constituent words of each sentence from the GloVe word vector files, and then taking the average value of the vectors to obtain the merging vector of the sentences as the feature vector of the sentence.
5.3 In a similar matrix, create a similar matrix
The similarity between sentences is calculated by using cosine similarity to find out the similarity points between sentences, an n-order zero matrix (n is the total number of sentences) is firstly defined, and the similarity matrix is initialized by calculating the cosine similarity between sentences. The sentence similarity matrix size for the experimental data was 238 × 238.
5.4 ) and abstract of the abstract
The traditional abstraction method is generally based on a graph model, sentences or phrases are scored by constructing a topological structure, and the purpose is to abstract the most general sentences in a text to form an abstract. TextRank is a classical abstract algorithm that replaces web pages with sentences, replaces the web page transition probabilities with similarity scores between sentences, and stores the similarity scores of sentences in a matrix similar to the transition matrix (TransformerMatrix) in the PageRank algorithm. The extraction of the abstract by using the TextRank algorithm mainly comprises the following steps:
(1) combining all documents into a whole;
(2) segmenting the text into a plurality of individual sentences;
(3) finding a word vector representation for each sentence;
(4) calculating the similarity between sentences by using cosine similarity to obtain a similarity matrix;
(5) converting the similarity matrix into a graph with sentences as nodes and similarity scores as the weights of edges;
(6) and calculating a TextRank score, and outputting a certain number of sentences or phrases to form a summary according to the score ranking.
The sentence similarity matrix is used for constructing a graph structure, sentences are used as nodes, similarity scores are used as the weights of edges, the TextRank values of all the sentences are obtained through a PageRank algorithm, the sentences are sorted according to the TextRank values, and the top 10 sentences with the highest scores are taken out to serve as abstracts, as shown in a table 2.
Figure RE-GDA0003949147470000091
TABLE 2
According to the abstract extraction result obtained in table 2, although the microblog text is short, general sentences exist in the text data to express the main idea of the microblog, so that the application effect of the extraction abstract extraction algorithm on the short text is good.
In conclusion, the information level abstract extraction method for the short text of the social network has the advantages of high calculation efficiency and good abstract extraction effect. In an actual production environment, given short text data of the existing social network, the hierarchical information abstract result can be quickly obtained by using the method, the information can be quickly and effectively acquired, and the data analysis of a social network platform and the supervision work of the social network are facilitated.

Claims (9)

1. The method for extracting the short text-oriented information hierarchy abstract of the social network is characterized in that the short text is hierarchically clustered through the short text-oriented information hierarchy abstract extracting method, and abstract extraction is performed on text data corresponding to a clustering result, so that a hierarchical abstract is obtained.
2. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 1, wherein the method comprises:
the information hierarchical clustering processing flow is roughly divided into three parts of data preprocessing, text feature extraction and short text hierarchical clustering, and the text data of the existing public data set facing the social network is subjected to feature analysis, data cleaning and preprocessing, so that the subject feature of the text is extracted and hierarchical clustering is carried out, and a hierarchical clustering tree is output;
the abstract extracting part needs to reprocess the text data, calculates the abstract score of the text based on the graph structure, and extracts the generalized sentences in the text as the abstract according to the score.
3. The method for extracting the short text message hierarchical abstract facing the social network according to claim 2, wherein noise data is filtered by observing characteristics of original data in a data cleaning process, and then preprocessing such as stop words removal, chinese word segmentation and word number screening is performed.
4. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 2, wherein an LDA topic model is used for extracting the topic features, so that the problem of overlarge feature dimension can be effectively avoided, and the topic feature matrix of the obtained text is output.
5. The method for extracting the short text-oriented information hierarchical abstract of the social network as claimed in claim 2, wherein the short text hierarchical clustering process requires iteratively performing two processes of constructing a clustering subtree and pruning the clustering subtree: when a clustering subtree is constructed, two nodes which are nearest neighbors to each other need to be found, and two stop conditions are met; after the sub-tree construction is completed, in order to avoid the occurrence of the elongated sub-tree, it is necessary to prune it.
6. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 5, wherein in the clustering process, the representative node of each layer is selected by calculating four indexes, and the four evaluation indexes of the representative node comprise:
the first is the node degree, in a cluster, a node with a larger degree is more important than a node with a smaller degree, and a node with a larger degree indicates that the node is positioned in the center of the data dense region; second, the neighbor mean degree of a node, that is, the mean value of the degrees of the neighbors of the node; third, step centrality; fourthly, the distance centrality is obtained;
the representativeness (centrality) of the node is further measured through the step length centrality and the distance centrality, a mixed index is provided by combining the four evaluation indexes, the mixed index is obtained according to a relative importance index calculation method in the complex data network, and the higher the mixed index score of the node is, the more representative the node is.
7. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 5, wherein, in order to avoid generating too many slender subtrees, the obtained clustering trees are combined to construct a multi-level clustering index tree, and in the obtained clustering result, the categories of the nodes of the clustering trees comprise: there is only one non-leaf node of a child, there are two or more non-leaf nodes of children and leaf nodes; in the recursion process of forming the nested clustering tree by linking all nodes, the processing modes of different types of nodes are different, and finally, the multi-level clustering index tree is obtained.
8. The method for short text abstract extraction in the social network short text oriented information hierarchy abstract extraction method according to claim 2, wherein the method comprises:
aiming at the reprocessed text data, constructing a multi-level information abstract clustering tree based on a multi-level index tree, filling the text data corresponding to the index into a clustering tree structure, and generating text abstracts corresponding to child nodes at non-leaf nodes; in the process of generating the abstract, firstly all texts are combined into a whole, then the texts are divided into a plurality of single sentences, further sentence characteristic vectors are generated, the similarity between the sentences is calculated, finally the similarity matrix is converted into a graph structure, the abstract score is calculated, and the sentence with the highest output score is used as the abstract.
9. The social network short text-oriented information hierarchy abstract extracting method as claimed in claim 8, wherein Chinese text corpora after jieba word segmentation and removal of stop words are used, chinese word vectors are trained by combining with a GloVe model, and word vector files are output for calculating feature vectors of sentences.
CN202211069136.9A 2022-09-02 2022-09-02 Information level abstract extraction method for short text of social network Pending CN115617981A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211069136.9A CN115617981A (en) 2022-09-02 2022-09-02 Information level abstract extraction method for short text of social network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211069136.9A CN115617981A (en) 2022-09-02 2022-09-02 Information level abstract extraction method for short text of social network

Publications (1)

Publication Number Publication Date
CN115617981A true CN115617981A (en) 2023-01-17

Family

ID=84857232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211069136.9A Pending CN115617981A (en) 2022-09-02 2022-09-02 Information level abstract extraction method for short text of social network

Country Status (1)

Country Link
CN (1) CN115617981A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150348A (en) * 2023-04-23 2023-05-23 南京邮电大学 Mixed unsupervised abstract generation method for long text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150348A (en) * 2023-04-23 2023-05-23 南京邮电大学 Mixed unsupervised abstract generation method for long text

Similar Documents

Publication Publication Date Title
CN109189942B (en) Construction method and device of patent data knowledge graph
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
Thakkar et al. Graph-based algorithms for text summarization
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN105045875B (en) Personalized search and device
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN114218389A (en) Long text classification method in chemical preparation field based on graph neural network
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN109299286A (en) The Knowledge Discovery Method and system of unstructured data
CN116362243A (en) Text key phrase extraction method, storage medium and device integrating incidence relation among sentences
CN115617981A (en) Information level abstract extraction method for short text of social network
CN114238653A (en) Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education
CN115982390B (en) Industrial chain construction and iterative expansion development method
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN108197295B (en) Application method of attribute reduction in text classification based on multi-granularity attribute tree
de Oliveira et al. A syntactic-relationship approach to construct well-informative knowledge graphs representation
CN115599915A (en) Long text classification method based on TextRank and attention mechanism
CN114722304A (en) Community search method based on theme on heterogeneous information network
Tohalino et al. Using virtual edges to extract keywords from texts modeled as complex networks
Ahmad et al. News article summarization: Analysis and experiments on basic extractive algorithms
CN114036907A (en) Text data amplification method based on domain features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination