CN115617981A

CN115617981A - Information level abstract extraction method for short text of social network

Info

Publication number: CN115617981A
Application number: CN202211069136.9A
Authority: CN
Inventors: 谢文波; 陈俊秀; 王欣; 陈斌; 陈端兵; 李艳丽
Original assignee: Southwest Petroleum University
Current assignee: Southwest Petroleum University
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2023-01-17

Abstract

The invention discloses a method for extracting information hierarchical abstract facing short text of a social network, which aims at short text data of the social network, carries out hierarchical clustering by constructing a multi-level clustering index tree, and extracts the abstract of the text data on the basis of a clustering result. Firstly, performing data cleaning and preprocessing on a text data set of a social network; secondly, feature extraction is carried out by using an LDA model, so that dimension disasters can be effectively avoided; then substituting the characteristic data into an RSC algorithm model to generate a hierarchical clustering tree, and merging and optimizing clustering results to obtain a multi-level clustering index tree; and finally, performing information-hierarchical abstract extraction based on the multi-level clustering index tree. The abstract extraction algorithm can extract the generalized sentences in the short text, so the method uses the TextRank algorithm to extract the abstract. The invention can help people to quickly and effectively acquire information, and is beneficial to data analysis of a social network platform and supervision work of a social network.

Description

Information level abstract extraction method for short text of social network

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method for extracting an information hierarchy abstract facing short text data of a social network.

Background

With the rapid development of science and technology, text data on the internet grows exponentially, people can acquire information more conveniently and quickly, but it is very difficult to quickly find and extract useful information from massive text data. The text clustering technology automatically divides text data into appropriate category sets, and then finds the distribution condition of content from disordered data sets, and reduces the range of information retrieval. The abstract extraction technology is used for displaying the text content in a concise, time-saving and efficient manner in a concise and time-saving manner, and the hierarchical abstract can help people to quickly and effectively acquire information, so that the situation that the text is not straight or curved is distinguished, and the people can conveniently make correct value orientation.

The existing text clustering technology has the defects that the clustering algorithm has the following defects:

(i) The running time is long, and the task with high timeliness requirement cannot be solved.

(ii) The clustering result is ambiguous and it is difficult to extract effective information therefrom.

Therefore, the existing problems focus on how to effectively improve the algorithm efficiency and the result definition so as to face the data disaster problem in the big data era. The present invention uses the RSC algorithm, which can effectively solve the above problems.

For defect one, the time complexity of the RSC algorithm used by the invention is O (nlogn), which is compared with the time complexity O (n) of the traditional clustering algorithm ² ) The time overhead is less.

For the second defect, the RSC algorithm used by the invention is a hierarchical clustering algorithm, and the result obtained by the algorithm is a clustering tree, so that the hierarchical organization relation of data can be well shown, and the RSC algorithm has good data interpretation capability.

Disclosure of Invention

The invention aims to provide an information level abstract extraction method for short texts of a social network, which is used for helping to quickly and effectively acquire information and improving the management efficiency of a social network platform.

A social network short text oriented information hierarchy abstract extraction method comprises the following steps:

1. pre-processing the collected data set;

2. performing feature extraction on the text data;

3. performing hierarchical clustering on the text data;

4. constructing a multi-level clustering index tree based on the clustering result;

5. and extracting the information abstract layer by layer from bottom to top.

In order to achieve the purpose, the invention provides a method for extracting information level abstract of short text of a social network, which is characterized by comprising the following steps:

for the step 1, according to the text characteristics of the short text of the social network, targeted data cleaning operation needs to be performed on the text data, noise data such as webpage links, expressions, topics, reprint sources, location addresses and the like are filtered, and meanwhile, words with frequent occurrence times need to be deleted, and a stop word list more suitable for the data set is constructed.

And for the step 2, in order to avoid overlarge feature dimension, an LDA model is adopted to extract the subject features of the text.

For step 3, hierarchical clustering of the text uses the RSC algorithm to perform hierarchical clustering on the short text of the social network, and in the data set, two main processes are required, including creation and pruning of a clustering sub-tree, by detecting reciprocal nearest neighbor nodes and iteratively creating a clustering tree:

3.1 Creation of clustered subtrees

Taking each piece of text data as a data point, and stopping building a sub-tree when two stopping conditions are met in a node link mode;

3.2 ) pruning clustered subtrees

After the construction of the clustering subtree is completed, in order to avoid the occurrence of the slender subtree, pruning operation needs to be carried out on the slender subtree.

For the step 4, the constructing of the multi-level clustering index tree based on the clustering result comprises:

4.1 ) construct a hierarchical clustering index tree

The clustering result obtained by using the RSC algorithm is a clustering tree with dispersed nodes, and the nodes in the clustering tree are divided into three types including: there is only one non-leaf node for a child, and there are two or more non-leaf nodes and leaf nodes for children. The method for processing three different types of nodes in the construction of the multilevel clustering index tree comprises the following steps:

(i) If the node is a non-leaf node with only one child, judging the child type of the node, and if the node is also a non-leaf node, recursively creating the child of the node; if the child is the leaf node, creating a child node on the basis of the leaf node for displaying the text of the leaf node;

(ii) If the node is a non-leaf node with two or more children, traversing the child nodes to repeat the step (i);

(iii) And if the leaf node is the child node, creating a child node on the basis of the leaf node for displaying the text data corresponding to the leaf node.

4.2 ) optimized merging of hierarchical clustering index trees

In the multi-level clustering index tree, a plurality of non-leaf nodes with only one child can increase the depth of the clustering tree, but the division capability of the hierarchy is not obvious, and even the nodes can be regarded as nodes of a unified hierarchy, so that the nodes in the multi-level clustering index tree are combined.

For the step 5, constructing the clustering tree with the information abstract according to the optimized and combined multi-level clustering index tree requires filling the text data corresponding to the index into the clustering tree structure, and generating the abstract of the text data corresponding to the child node at the non-leaf node:

5.1 Text selection and processing

The text data used for information summarization needs to be reprocessed to filter out data which easily causes reading obstacles, such as network expressions, characters, special symbols and the like.

5.2 Preparation of word vectors

In the process of generating information abstracts layer by layer from bottom to top, abstract extraction is carried out by using a TextRank algorithm, and the characteristics of sentences need to be obtained in the extraction process. Glove is a Word characterization tool based on global Word frequency statistics, semantic and syntactic information can be contained between vectors as much as possible by using Glove for vectorized representation of words, and compared with methods for creating features such as TF-IDF and Word2Vec, the Glove can not cause overlarge feature dimension, and can also fully utilize linguistic data.

5.3 ) extract the abstract

And finally, extracting the most general and representative sentences from each level of the clustering tree as the information abstract of the text.

Drawings

FIG. 1 is a flow chart of an embodiment.

Detailed Description

The following description of the present invention is provided in conjunction with fig. 1 to provide those skilled in the art with a better understanding of the present invention. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

The invention provides an information hierarchy abstract extraction method facing short texts in a social network by using an RSC algorithm to perform hierarchical clustering on microblog text data and using a TextRank algorithm to perform abstract extraction on clustering results. Firstly, cleaning and preprocessing the data of a text data set of a social network; secondly, feature extraction is carried out on the data, and dimension disasters can be effectively avoided by using an LDA model; then substituting the characteristic data into an RSC algorithm model to generate a hierarchical clustering tree, and merging and optimizing clustering results to obtain a multi-level clustering index tree; and finally, performing information-hierarchical abstract extraction on the text data based on the multi-level clustering index tree, and performing abstract extraction by using an extraction abstract algorithm TextRank.

1. pre-processing the collected data set;

2. performing feature extraction on the text data;

3. performing hierarchical clustering on the text data;

5. and extracting the information abstract layer by layer from bottom to top.

The invention relates to a social network short text oriented information level abstract extraction method, which comprises the following steps:

1. data pre-processing

The invention uses microblog pcu data sets that are initially used to explore spammers in microblogs, and these data can also be used to study machine learning methods and to study social networks. Csv data in the microblog pcu data set, namely microblog content information sent by a user, are specifically used, and the data description is shown in table 1.

Attribute	Description of the invention
		post_id	User microblog ID
post_time	Microblog release time
		content	Microblog content
poster_id	Issuing the user ID of the microblog
		poster_url	Personal homepage address of user
repost_num	The forwarding amount of the microblog
		comment_num	Number of comments on the microblog
reposter_post_id	Transfer person ID
		inner_flag	Built-in label

TABLE 1

Since the original data is unstructured Chinese text data and there are many irregular representations in social network texts, such as some network expressions, facial characters, expressions, special symbols, etc., the texts need to be cleaned and preprocessed. The data preprocessing comprises four steps:

(1) filtering the noise data: and observing original data, and deleting noise data including webpage links, expressions, titles, topics, reprint sources, positioning addresses and the like by using a regular expression, and deleting redundant words with frequent occurrence times.

(2) Removing stop words: the invention constructs a deactivation word list more suitable for the data set on the basis of a deactivation word library of a Sichuan university machine intelligent laboratory by combining specific conditions to supplement and perfect on the basis of the existing deactivation word list and constructing the deactivation word list as good as possible.

(3) Chinese word segmentation: the processing of Chinese text data requires word segmentation, and the invention adopts a jieba word segmentation tool to perform word segmentation.

(4) And (3) word number screening: because the time range of the data used by the method is 2015 years ago, the used microblog text data are all within 140 words. And deleting the microblog text with the text length smaller than 10 after segmenting words and removing stop words because the short text data has low value.

2. Feature extraction

In order to avoid the problem of overlarge feature dimension, the LDA topic model is adopted for topic feature extraction, some class libraries such as sciit-lead, spark MLib and genim can be used for learning the LDA topic model, the use principles of the class libraries are basically similar, and the LDA feature extraction is carried out on the basis of the sciit-lead. The implicit topic parameter is set to 5, that is, each microblog has 5 topic features, and an LDA feature matrix of n × 5 (n is microblog number) is obtained.

3. Hierarchical clustering of text

When the RSC algorithm is used for clustering analysis, two processes of constructing a clustering subtree and pruning are mainly executed in an iteration mode, and a clustering result obtained through the hierarchical clustering algorithm is a hierarchical clustering tree.

3.1 Construction of clustered subtrees)

The data set contains text data issued by microblog users, each piece of data (word vector) is regarded as a data node, an empty subtree and a candidate set containing all nodes are initialized firstly, if the candidate set is not empty, one node is randomly selected and is linked to a neighbor nearest to the node, namely a first-order nearest neighbor of the node, and the first-order nearest neighbor is further linked to the nearest neighbor of the node on the assumption that each node has only one nearest neighbor, so that a linked list is generated, and the construction of the subtree meets two stop conditions:

(i) The kth order nearest neighbor and the kth-2 order nearest neighbor form a pair of RNNs nodes;

(ii) The k-th nearest neighbor is not in the candidate set.

And (3) when the condition (i) is satisfied, forming a new subtree, wherein the k-2 nd order nearest neighbor and the k-1 st order nearest neighbor serve as a pair of RNNs and become support nodes of the clustering tree, at the moment, manually constructing a root node serving as a representative node of the subtree to be linked to the support nodes, finally deleting all nodes of the subtree from the candidate set, randomly selecting a node in the candidate set if the candidate set is not empty after deletion, circularly executing the process, and stopping constructing the subtree when the candidate set is empty. When condition (ii) is true, the linked list is linked to an existing subtree.

3.2 ), prune the clustered subtree

After the sub-tree construction is completed, in order to avoid the occurrence of the elongated sub-tree, it is necessary to prune it. For any node i, if it belongs to an existing subtree with node p and node q as support nodes, the depth of node i is the average of the sum of its shortest path lengths to p and q, and the depth of node i can be regarded as its path length to the root node. When each generated sub-tree is pruned, all nodes with the depth larger than a threshold value can be pruned according to the path length of the nodes, and each pruned node is linked to an artificially set root node, so that a special sub-tree is formed, and the sub-tree only consists of one node and one root.

3.3 Representative node evaluation index)

In the process of generating the clustering subtree, when the stopping condition (i) is satisfied, a more representative node needs to be selected from a pair of RNNs nodes, and the RSC algorithm provides four indexes for evaluating the RNNs nodes, wherein the indexes are as follows: the degree of the node, the neighbor average degree of the node, the step length centrality and the distance centrality. The degree of a node and the neighbor average degree of a node are relatively easy to understand, and it can be considered that in one cluster, a node with a larger degree is more important than a node with a smaller degree, and if the neighbor of a node has a larger degree, the node is more important, which means that the node is located at the center of the dense region of data. The degree and neighbor-average of node i are defined as follows:

wherein, d _i Is the degree of the node i and,

is the neighbor mean of node i, r _ij Are elements in a relationship matrix that are described based on a graph structure of data points. In order to further measure the centrality of RNNs, two indexes of step centrality and distance centrality are adopted. The step centrality of node i is defined as follows:

wherein, | τ | is the number of all nodes in the cluster t where the node i is located, s _ij Is the shortest path length between node i and node j. The distance centrality of node i is defined as follows:

wherein, dist _ij Is the distance between node i and node j. In order to more fairly and reasonably measure the representativeness of the nodes, a mixed index is provided by combining the four evaluation indexes, and the mixed index is obtained by calculating indexes and methods according to the relative importance in the complex network and is defined as follows:

if the mixed index score of a node is higher, the node is more representative.

3.4 Results of clustering), the results of clustering

The result clustering tree is represented by a dictionary, and comprises a plurality of key-value pairs, and each non-leaf node and its children are represented by the key-value pairs. All microblog texts are regarded as nodes in the tree, and the names of the nodes are represented by the row indexes of the microblog texts. In the result clustering tree, keys represent non-leaf nodes, and the values of the keys represent children of the non-leaf nodes, represented by a set.

Assume a tree T: {1: {0},2: {1,3,12},3: {4,5},5: {6,7} }, non-leaf nodes in tree T are {1,2,3,5}, wherein nodes 1 and 3 are children of node 2, node 5 is a child of node 3, and the remaining nodes {0,12,4,6,7} are leaf nodes, then the root node of tree T is node 2.

4. Building a hierarchical clustering index tree

In the clustering tree with dispersed nodes, because of the storage structure, in the recursive process of forming the nested clustering tree by linking each node, the processing of different types of nodes is different, so the types of the nodes need to be identified, and the nodes in the clustering tree are divided into the following three types: (1) a non-leaf node with only one child; (2) non-leaf nodes with two or more children; (3) leaf nodes. The processing method for three different types of nodes in the construction of the clustering index tree is as follows:

(i) If the node is a non-leaf node with only one child, judging the child type of the node, and if the node is also a non-leaf node, recursively creating the child of the node; if the child is a leaf node, a child node is created on the basis of the leaf node for displaying the text of the leaf node.

(ii) If the node is a non-leaf node having two or more children, then step (i) is repeated traversing its child nodes.

In the multi-level index tree, a plurality of non-leaf nodes with only one child can be found, the nodes can increase the depth of the clustering tree, but the division capability of the hierarchy is not obvious, and the nodes can be even regarded as nodes of the same hierarchy, so that the nodes in the index tree are merged.

5. Abstract extraction

The clustering tree with the information abstract is constructed according to the multi-level clustering index tree, and only the text data corresponding to the index is filled into the clustering tree structure, and the abstract of the text data corresponding to the child node is generated at the non-leaf node.

5.1 ) selecting data and processing

And selecting partial text data in the clustering result to extract the abstract, wherein the data used for extracting the abstract needs to be subjected to some basic text cleaning so as to avoid the influence of the noise of the text data on the abstract extraction as much as possible. However, unlike the previous processing method, all punctuations of a text are deleted during data preprocessing, but when abstract extraction is performed, in order to quickly acquire information, data which is easily read-hindered, such as network vocabularies, characters, special symbols, and the like, needs to be processed, and some punctuations need to be reserved so as to divide sentences.

5.2 Ready word vector, preparation word vector

The method comprises the steps of training GloVe word vectors by using Chinese Wikipedia text corpus which is subjected to word segmentation and word stop removal, outputting word vector files, obtaining vectors of all constituent words of each sentence from the GloVe word vector files, and then taking the average value of the vectors to obtain the merging vector of the sentences as the feature vector of the sentence.

5.3 In a similar matrix, create a similar matrix

The similarity between sentences is calculated by using cosine similarity to find out the similarity points between sentences, an n-order zero matrix (n is the total number of sentences) is firstly defined, and the similarity matrix is initialized by calculating the cosine similarity between sentences. The sentence similarity matrix size for the experimental data was 238 × 238.

5.4 ) and abstract of the abstract

The traditional abstraction method is generally based on a graph model, sentences or phrases are scored by constructing a topological structure, and the purpose is to abstract the most general sentences in a text to form an abstract. TextRank is a classical abstract algorithm that replaces web pages with sentences, replaces the web page transition probabilities with similarity scores between sentences, and stores the similarity scores of sentences in a matrix similar to the transition matrix (TransformerMatrix) in the PageRank algorithm. The extraction of the abstract by using the TextRank algorithm mainly comprises the following steps:

(1) combining all documents into a whole;

(2) segmenting the text into a plurality of individual sentences;

(3) finding a word vector representation for each sentence;

(4) calculating the similarity between sentences by using cosine similarity to obtain a similarity matrix;

(5) converting the similarity matrix into a graph with sentences as nodes and similarity scores as the weights of edges;

(6) and calculating a TextRank score, and outputting a certain number of sentences or phrases to form a summary according to the score ranking.

The sentence similarity matrix is used for constructing a graph structure, sentences are used as nodes, similarity scores are used as the weights of edges, the TextRank values of all the sentences are obtained through a PageRank algorithm, the sentences are sorted according to the TextRank values, and the top 10 sentences with the highest scores are taken out to serve as abstracts, as shown in a table 2.

TABLE 2

According to the abstract extraction result obtained in table 2, although the microblog text is short, general sentences exist in the text data to express the main idea of the microblog, so that the application effect of the extraction abstract extraction algorithm on the short text is good.

In conclusion, the information level abstract extraction method for the short text of the social network has the advantages of high calculation efficiency and good abstract extraction effect. In an actual production environment, given short text data of the existing social network, the hierarchical information abstract result can be quickly obtained by using the method, the information can be quickly and effectively acquired, and the data analysis of a social network platform and the supervision work of the social network are facilitated.

Claims

1. The method for extracting the short text-oriented information hierarchy abstract of the social network is characterized in that the short text is hierarchically clustered through the short text-oriented information hierarchy abstract extracting method, and abstract extraction is performed on text data corresponding to a clustering result, so that a hierarchical abstract is obtained.

2. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 1, wherein the method comprises:

the information hierarchical clustering processing flow is roughly divided into three parts of data preprocessing, text feature extraction and short text hierarchical clustering, and the text data of the existing public data set facing the social network is subjected to feature analysis, data cleaning and preprocessing, so that the subject feature of the text is extracted and hierarchical clustering is carried out, and a hierarchical clustering tree is output;

the abstract extracting part needs to reprocess the text data, calculates the abstract score of the text based on the graph structure, and extracts the generalized sentences in the text as the abstract according to the score.

3. The method for extracting the short text message hierarchical abstract facing the social network according to claim 2, wherein noise data is filtered by observing characteristics of original data in a data cleaning process, and then preprocessing such as stop words removal, chinese word segmentation and word number screening is performed.

4. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 2, wherein an LDA topic model is used for extracting the topic features, so that the problem of overlarge feature dimension can be effectively avoided, and the topic feature matrix of the obtained text is output.

5. The method for extracting the short text-oriented information hierarchical abstract of the social network as claimed in claim 2, wherein the short text hierarchical clustering process requires iteratively performing two processes of constructing a clustering subtree and pruning the clustering subtree: when a clustering subtree is constructed, two nodes which are nearest neighbors to each other need to be found, and two stop conditions are met; after the sub-tree construction is completed, in order to avoid the occurrence of the elongated sub-tree, it is necessary to prune it.

6. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 5, wherein in the clustering process, the representative node of each layer is selected by calculating four indexes, and the four evaluation indexes of the representative node comprise:

the first is the node degree, in a cluster, a node with a larger degree is more important than a node with a smaller degree, and a node with a larger degree indicates that the node is positioned in the center of the data dense region; second, the neighbor mean degree of a node, that is, the mean value of the degrees of the neighbors of the node; third, step centrality; fourthly, the distance centrality is obtained;

the representativeness (centrality) of the node is further measured through the step length centrality and the distance centrality, a mixed index is provided by combining the four evaluation indexes, the mixed index is obtained according to a relative importance index calculation method in the complex data network, and the higher the mixed index score of the node is, the more representative the node is.

7. The method for extracting the short text-oriented information hierarchy abstract of the social network as claimed in claim 5, wherein, in order to avoid generating too many slender subtrees, the obtained clustering trees are combined to construct a multi-level clustering index tree, and in the obtained clustering result, the categories of the nodes of the clustering trees comprise: there is only one non-leaf node of a child, there are two or more non-leaf nodes of children and leaf nodes; in the recursion process of forming the nested clustering tree by linking all nodes, the processing modes of different types of nodes are different, and finally, the multi-level clustering index tree is obtained.

8. The method for short text abstract extraction in the social network short text oriented information hierarchy abstract extraction method according to claim 2, wherein the method comprises:

aiming at the reprocessed text data, constructing a multi-level information abstract clustering tree based on a multi-level index tree, filling the text data corresponding to the index into a clustering tree structure, and generating text abstracts corresponding to child nodes at non-leaf nodes; in the process of generating the abstract, firstly all texts are combined into a whole, then the texts are divided into a plurality of single sentences, further sentence characteristic vectors are generated, the similarity between the sentences is calculated, finally the similarity matrix is converted into a graph structure, the abstract score is calculated, and the sentence with the highest output score is used as the abstract.

9. The social network short text-oriented information hierarchy abstract extracting method as claimed in claim 8, wherein Chinese text corpora after jieba word segmentation and removal of stop words are used, chinese word vectors are trained by combining with a GloVe model, and word vector files are output for calculating feature vectors of sentences.