CN110781289B

CN110781289B - Text visualization method for reserving unstructured text semantics

Info

Publication number: CN110781289B
Application number: CN201911081479.5A
Authority: CN
Inventors: 周锋; 汪文君; 李小勇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2022-07-15
Anticipated expiration: 2039-11-07
Also published as: CN110781289A

Abstract

The invention provides a social media text visualization method for reserving unstructured text semantics, which comprises the following steps: step S101, performing word segmentation, filtering, part-of-speech tagging and obtaining a dependency relationship on an input text; step S102, constructing a syntax binary tree based on the dependency relationship between part-of-speech labels and vocabularies, calculating the emotion polarity of each text, and dividing a text set into a positive type and a negative type; step S103, respectively generating a vocabulary sequence mode for positive and negative texts based on the co-occurrence relation between word frequency and vocabulary in each text, and keeping semantics; step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors; step S105, showing the semantic relation in the sequence mode and among the sequence modes by adopting a layout algorithm; and step S106, introducing interactive design so that the user can pay attention to the local details. By adopting the method, the visualization of the social media text is realized, the emotional orientation, the viewpoint semantics and the public opinion support degree of the text are clearly presented, the text information is effectively presented, and the text analysis is facilitated.

Description

Text visualization method for reserving unstructured text semantics

Technical Field

The invention relates to the technical field of data visualization, in particular to a text visualization method for reserving unstructured text semantics.

Background

The traditional disciplines related to the data visualization technology include scientific visualization and information visualization, and the purpose is to extract information and insight knowledge from big data and display the information and insight knowledge in an intuitive mode. In the visualization technology, the visualization of text information is an important research branch. The text information visualization is used for vividly and intuitively displaying semantic features (such as word occurrence frequency, word importance degree, text logic structure, multi-text theme clustering, dynamic theme change trend and the like) contained in a large amount of texts.

A typical text visualization technology includes word categories (or tag categories), in which extracted keywords are sorted according to a certain rule (e.g., word frequency), and then arranged and laid out according to a certain rule, and the extracted keywords are distinguished by setting different font sizes, colors, or font types and other graphic attributes, so as to realize the visualization of the keywords. After the theme popularity can be well perceived, the scientific research hotspot turns to show the semantics contained in the text, namely the logic structure and the narration mode of the text. From this point on, a series of text semantic structure visualization models are proposed, such as: DAViewer shows a narration structure of a certain text in a tree form to realize semantic visualization, and meanwhile, a list shows similarity statistics among the texts, a retrieval structure of the text and specific text content; DocubBurst then shows the semantic structure of the text as a radial circle.

The existing visual model obtains good results to a certain extent, but the text analysis capability is limited by paying attention to the perception of the whole content or the reaction of the text semantics. The invention provides a novel visual structure, which can sense emotional tendency, can keep the semantic content of unstructured text, and can be displayed to users in an intuitive and visual mode, so that public opinion analysts or common users can better sense text information.

Disclosure of Invention

In view of this, the present invention designs a text visualization method for preserving unstructured text semantics, which comprises the following steps:

step S101, performing word segmentation, filtering, part-of-speech tagging and acquisition of inter-vocabulary dependency on an input text;

step S102, calculating the emotion polarity of each text based on the dependency relationship between part of speech labels and vocabularies, and dividing a text set into a positive type and a negative type;

step S103, respectively generating a vocabulary sequence mode for positive and negative texts based on the word frequency and the co-occurrence relation of vocabularies in each text, and keeping semantics;

step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors;

s105, displaying semantic relations in the sequence modes and among the sequence modes by adopting a layout algorithm;

and step S106, introducing interactive design so that the user can pay attention to the local details.

The specific method for calculating the emotion polarity of the single text in the step S102 is as follows: the method comprises the steps of firstly, carrying out syntactic analysis by using a syntactic analyzer to obtain the dependency relationship among words and the emotion polarity of a single word, then constructing a syntactic binary tree structure for a sentence based on the obtained dependency relationship, and converting sentence emotion judgment into symbol calculation based on a tree by using the dependency relationship among words and a rule method.

The specific method for constructing the syntax binary tree in step S102 is as follows: firstly, creating an empty stack and reading in a sentence head vocabulary; step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading in the next vocabulary; reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; if the dependency relationship does not exist, jumping to the second step; fourthly, if two nodes are redundant in the stack, jumping to the third step, and otherwise, jumping to the second step; and fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text.

The specific method for generating the vocabulary sequence pattern in step S103 is as follows: in the initial state, the only sequence of the sequence pattern spanning tree is given, after each operation, the highest frequency sequence pattern is popped up to find the subsequence pattern which has one more word than the highest frequency sequence pattern, the new sequence pattern is used as the left child of the node of the original pattern tree, the original sequence pattern becomes the right child of the original sequence pattern, the frequency of the original sequence pattern is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern, and the process is circulated until the number of the remaining required visible vocabularies becomes 0.

The specific method for allocating the visual space and designing the visual interface in the step S104 is as follows: in the two types of texts with positive and negative polarities, the text with larger weight is positioned above, the text with smaller weight is positioned below, and occupies the area proportion corresponding to the weight ratio, the positive and negative text sets adopt edges with different colors to connect nodes, and the occurrence frequency is secondarily coded by using the font size and the transparency.

The specific method of the layout algorithm in step S105 is: and adopting force guide layout, wherein the horizontal layout sequence of the sequence mode is consistent with the sequence in the sequence mode, and if the two mode sequences belong to the same subsequence of one mode sequence, the two mode sequences are vertically arranged during layout.

The specific method of the interactive design in step S106 is that the model displays a composite image of all sequence modes in an initial state, when the user focuses the mouse on a word, the word belonging to one sequence mode with the mouse is highlighted, and the other words are shaded, so as to clearly show the semantics of the sequence mode, and at the same time, the model displays the text containing the sequence mode and having the highest weight through a floating layer, so as to disclose more detailed information.

Drawings

Some specific embodiments of the invention will be described in detail hereinafter, by way of illustration and not limitation, with reference to the accompanying drawings. The steps and main algorithms of the method are described in the form of pseudo codes in the attached drawings. Those skilled in the art will appreciate that these figures are not necessarily straightforward to implement. The objects and features of the present invention will become more apparent in view of the following description taken in conjunction with the accompanying drawings, in which:

fig. 1 is a flowchart of a visual algorithm of a text visualization method for preserving unstructured text semantics according to an embodiment of the present invention.

Fig. 2 is a syntax binary tree construction algorithm according to an embodiment of the present invention.

FIG. 3 illustrates a vocabulary sequence pattern generation algorithm in accordance with an embodiment of the present invention.

Detailed Description

In order to make the present invention more comprehensible with respect to its gist, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details and specific examples are set forth in order to provide a more thorough understanding of the present invention and to provide a thorough understanding of the present invention. While the invention is capable of embodiments in many different forms than those described herein, those skilled in the art will appreciate that the present invention is not limited to the specific examples and figures disclosed below, since various modifications can be made without departing from the scope of the invention.

While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It will be understood by those skilled in the art that variations and modifications of the embodiments of the present invention can be made without departing from the scope and spirit of the invention.

Fig. 1 shows a flow chart of a visual algorithm of a text visualization method for preserving unstructured text semantics according to an embodiment of the present invention. The method comprises the following steps: and step S101, performing word segmentation, filtering and part-of-speech tagging on the input text. And S102, constructing a syntax tree based on the dependency relationship between the part-of-speech labels and the vocabularies, calculating the emotion polarity of a father node from bottom to top according to a calculation rule until a following node is the emotion polarity of the text, and dividing the text set into a positive type and a negative type. And step S103, respectively generating a vocabulary sequence mode for the positive and negative texts based on the word frequency and the co-occurrence relation of vocabularies in each text, and keeping semantics. And step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors. And S105, displaying the semantic relation in the sequence mode and among the sequence modes by adopting a layout algorithm. And step S106, introducing interactive design so that the user can pay attention to the local details.

In the implementation, the existing word segmentation tool is adopted to carry out word segmentation on the text, filtering of stop words, part of speech tagging and obtaining of the dependency relationship among words, and a force guidance method is adopted to carry out visual drawing based on d3.js and cola. js.

Fig. 2 shows a construction algorithm of a bivariate syntax tree according to an embodiment of the present invention. The input of the algorithm is the dependency relationship between a single text subjected to word segmentation and a text vocabulary, and the output is the emotion polarity of the text. The algorithm is divided into five steps, wherein in the first step, an empty stack is created, and sentence beginning vocabularies are read in. And step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading the next vocabulary. Reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; and if the dependency relationship does not exist, jumping to the second step. And fourthly, if two nodes are redundant in the stack at the moment, jumping to the third step, and otherwise, jumping to the second step. And fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text.

The emotion calculation rules involved in the preferred embodiment are: in a single sentence, the words for positive emotion are marked as 1, the words for negative emotion are marked as-1, the words for neutral are marked as 0, the words for degree are marked as one, and the words for negation are marked as! . When calculating, 0 and! The polarity is changed to be reverse, and when the positive polarity and the negative polarity meet, the left polarity is taken. In the compound sentence, the turning word is marked as R, the common conjunctions are marked as L, L does not change the polarity, and R takes the right polarity.

FIG. 3 illustrates a high frequency vocabulary sequence generation algorithm in accordance with an embodiment of the present invention. The input of the algorithm is an initial sequence mode s, the regularized text set D and the vocabulary number N required to be presented by the visual map, and the output is a high-frequency vocabulary sequence mode L. The main idea of the algorithm is to construct a sequence pattern spanning tree, and in the initial state, the only sequence of the sequence pattern spanning tree is given. After each run, the highest frequency sequence pattern is popped off the stack looking for a sequence pattern that has one more word than it. The new sequence schema serves as the left child of the original schema node of the original schema tree. The original pattern sequence becomes its right child. The frequency of the original pattern sequence is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern. Thus, the original text set is continually segmented, which continually generates a sequence of leaf patterns until the remaining number of required visible words becomes 0.

Claims

1. A social media text visualization method for preserving unstructured text semantics is characterized by comprising the following steps:

step S102, constructing a syntax binary tree based on the dependency relationship between part of speech labels and vocabularies, calculating the emotion polarity of each text, and dividing a text set into a positive type and a negative type;

the specific method for building the syntax binary tree comprises the following steps: firstly, creating an empty stack and reading in a sentence head vocabulary; step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading in the next vocabulary; reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; if the dependency relationship does not exist, jumping to the second step; fourthly, if two nodes are redundant in the stack at the moment, jumping to the third step, otherwise, jumping to the second step; fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text; wherein the emotion calculating rule in the third step is as follows: in a single sentence, the words for positive emotion are marked as 1, the words for negative emotion are marked as-1, the words for neutral are marked as 0, the words for degree are marked as one, and the words for negation are marked as! When calculating, 0 and! Changing the polarity into reverse, when the positive polarity and the negative polarity meet, taking the left polarity, in the complex sentence, marking a turning word as R, marking a common conjunction word as L, not changing the polarity of L, and taking the right polarity of R;

the specific method for generating the vocabulary sequence mode comprises the following steps: in the initial state, the only sequence of the sequence pattern spanning tree is given, after each operation, the highest frequency sequence pattern is popped up to search the subsequence pattern which has one more word than the highest frequency sequence pattern, the new sequence pattern is used as the left child of the node of the original pattern tree, the original pattern sequence becomes the right child of the original pattern sequence, the frequency of the original pattern sequence is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern, and the process is circulated until the number of the remaining required visible vocabularies is 0;

2. The method for visualizing the text of the social media with the preserved semantic meaning of the unstructured text as claimed in claim 1, wherein the specific method for calculating the emotion polarity of the single text in step S102 is: the method comprises the steps of firstly, carrying out syntactic analysis by using a syntactic analyzer to obtain the dependency relationship among words and the emotion polarity of a single word, then constructing a syntactic binary tree structure for a sentence based on the obtained dependency relationship, and converting sentence emotion judgment into symbol calculation based on a tree by using the dependency relationship among words and a rule method.

3. The method for visualizing social media text with preserved unstructured text semantics as claimed in claim 1, wherein the specific method for allocating visual space and designing visual interface in step S104 is as follows: in the two types of texts with positive and negative polarities, the text with larger weight is positioned above, the text with smaller weight is positioned below, and occupies the area proportion corresponding to the weight ratio, the positive and negative text sets adopt edges with different colors to connect nodes, and the occurrence frequency is secondarily coded by using the font size and the transparency.

4. The method for visualizing the text of the social media with the unstructured text semantics maintained as claimed in claim 1, wherein the specific method of the layout algorithm in the step S105 is as follows: the horizontal layout sequence of the sequence patterns is consistent with the sequence in the sequence patterns, and if the two pattern sequences belong to the subsequences of one pattern sequence, the two pattern sequences are vertically arranged in layout.

5. The method as claimed in claim 1, wherein the step S106 of interactive design is a method of displaying a composite image of all sequence modes in an initial state of the model, when a user focuses a mouse on a word, the vocabulary that belongs to a sequence mode with the word is highlighted, and the rest of the vocabulary is shaded, so as to clearly display the semantics of a sequence mode, and at the same time, the model displays the text that contains the sequence mode and has the highest weight through a floating layer, so as to disclose more detailed information.