CN110781289B - Text visualization method for reserving unstructured text semantics - Google Patents

Text visualization method for reserving unstructured text semantics Download PDF

Info

Publication number
CN110781289B
CN110781289B CN201911081479.5A CN201911081479A CN110781289B CN 110781289 B CN110781289 B CN 110781289B CN 201911081479 A CN201911081479 A CN 201911081479A CN 110781289 B CN110781289 B CN 110781289B
Authority
CN
China
Prior art keywords
text
sequence
pattern
vocabulary
polarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201911081479.5A
Other languages
Chinese (zh)
Other versions
CN110781289A (en
Inventor
周锋
汪文君
李小勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201911081479.5A priority Critical patent/CN110781289B/en
Publication of CN110781289A publication Critical patent/CN110781289A/en
Application granted granted Critical
Publication of CN110781289B publication Critical patent/CN110781289B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a social media text visualization method for reserving unstructured text semantics, which comprises the following steps: step S101, performing word segmentation, filtering, part-of-speech tagging and obtaining a dependency relationship on an input text; step S102, constructing a syntax binary tree based on the dependency relationship between part-of-speech labels and vocabularies, calculating the emotion polarity of each text, and dividing a text set into a positive type and a negative type; step S103, respectively generating a vocabulary sequence mode for positive and negative texts based on the co-occurrence relation between word frequency and vocabulary in each text, and keeping semantics; step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors; step S105, showing the semantic relation in the sequence mode and among the sequence modes by adopting a layout algorithm; and step S106, introducing interactive design so that the user can pay attention to the local details. By adopting the method, the visualization of the social media text is realized, the emotional orientation, the viewpoint semantics and the public opinion support degree of the text are clearly presented, the text information is effectively presented, and the text analysis is facilitated.

Description

Text visualization method for reserving unstructured text semantics
Technical Field
The invention relates to the technical field of data visualization, in particular to a text visualization method for reserving unstructured text semantics.
Background
The traditional disciplines related to the data visualization technology include scientific visualization and information visualization, and the purpose is to extract information and insight knowledge from big data and display the information and insight knowledge in an intuitive mode. In the visualization technology, the visualization of text information is an important research branch. The text information visualization is used for vividly and intuitively displaying semantic features (such as word occurrence frequency, word importance degree, text logic structure, multi-text theme clustering, dynamic theme change trend and the like) contained in a large amount of texts.
A typical text visualization technology includes word categories (or tag categories), in which extracted keywords are sorted according to a certain rule (e.g., word frequency), and then arranged and laid out according to a certain rule, and the extracted keywords are distinguished by setting different font sizes, colors, or font types and other graphic attributes, so as to realize the visualization of the keywords. After the theme popularity can be well perceived, the scientific research hotspot turns to show the semantics contained in the text, namely the logic structure and the narration mode of the text. From this point on, a series of text semantic structure visualization models are proposed, such as: DAViewer shows a narration structure of a certain text in a tree form to realize semantic visualization, and meanwhile, a list shows similarity statistics among the texts, a retrieval structure of the text and specific text content; DocubBurst then shows the semantic structure of the text as a radial circle.
The existing visual model obtains good results to a certain extent, but the text analysis capability is limited by paying attention to the perception of the whole content or the reaction of the text semantics. The invention provides a novel visual structure, which can sense emotional tendency, can keep the semantic content of unstructured text, and can be displayed to users in an intuitive and visual mode, so that public opinion analysts or common users can better sense text information.
Disclosure of Invention
In view of this, the present invention designs a text visualization method for preserving unstructured text semantics, which comprises the following steps:
step S101, performing word segmentation, filtering, part-of-speech tagging and acquisition of inter-vocabulary dependency on an input text;
step S102, calculating the emotion polarity of each text based on the dependency relationship between part of speech labels and vocabularies, and dividing a text set into a positive type and a negative type;
step S103, respectively generating a vocabulary sequence mode for positive and negative texts based on the word frequency and the co-occurrence relation of vocabularies in each text, and keeping semantics;
step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors;
s105, displaying semantic relations in the sequence modes and among the sequence modes by adopting a layout algorithm;
and step S106, introducing interactive design so that the user can pay attention to the local details.
The specific method for calculating the emotion polarity of the single text in the step S102 is as follows: the method comprises the steps of firstly, carrying out syntactic analysis by using a syntactic analyzer to obtain the dependency relationship among words and the emotion polarity of a single word, then constructing a syntactic binary tree structure for a sentence based on the obtained dependency relationship, and converting sentence emotion judgment into symbol calculation based on a tree by using the dependency relationship among words and a rule method.
The specific method for constructing the syntax binary tree in step S102 is as follows: firstly, creating an empty stack and reading in a sentence head vocabulary; step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading in the next vocabulary; reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; if the dependency relationship does not exist, jumping to the second step; fourthly, if two nodes are redundant in the stack, jumping to the third step, and otherwise, jumping to the second step; and fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text.
The specific method for generating the vocabulary sequence pattern in step S103 is as follows: in the initial state, the only sequence of the sequence pattern spanning tree is given, after each operation, the highest frequency sequence pattern is popped up to find the subsequence pattern which has one more word than the highest frequency sequence pattern, the new sequence pattern is used as the left child of the node of the original pattern tree, the original sequence pattern becomes the right child of the original sequence pattern, the frequency of the original sequence pattern is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern, and the process is circulated until the number of the remaining required visible vocabularies becomes 0.
The specific method for allocating the visual space and designing the visual interface in the step S104 is as follows: in the two types of texts with positive and negative polarities, the text with larger weight is positioned above, the text with smaller weight is positioned below, and occupies the area proportion corresponding to the weight ratio, the positive and negative text sets adopt edges with different colors to connect nodes, and the occurrence frequency is secondarily coded by using the font size and the transparency.
The specific method of the layout algorithm in step S105 is: and adopting force guide layout, wherein the horizontal layout sequence of the sequence mode is consistent with the sequence in the sequence mode, and if the two mode sequences belong to the same subsequence of one mode sequence, the two mode sequences are vertically arranged during layout.
The specific method of the interactive design in step S106 is that the model displays a composite image of all sequence modes in an initial state, when the user focuses the mouse on a word, the word belonging to one sequence mode with the mouse is highlighted, and the other words are shaded, so as to clearly show the semantics of the sequence mode, and at the same time, the model displays the text containing the sequence mode and having the highest weight through a floating layer, so as to disclose more detailed information.
Drawings
Some specific embodiments of the invention will be described in detail hereinafter, by way of illustration and not limitation, with reference to the accompanying drawings. The steps and main algorithms of the method are described in the form of pseudo codes in the attached drawings. Those skilled in the art will appreciate that these figures are not necessarily straightforward to implement. The objects and features of the present invention will become more apparent in view of the following description taken in conjunction with the accompanying drawings, in which:
fig. 1 is a flowchart of a visual algorithm of a text visualization method for preserving unstructured text semantics according to an embodiment of the present invention.
Fig. 2 is a syntax binary tree construction algorithm according to an embodiment of the present invention.
FIG. 3 illustrates a vocabulary sequence pattern generation algorithm in accordance with an embodiment of the present invention.
Detailed Description
In order to make the present invention more comprehensible with respect to its gist, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details and specific examples are set forth in order to provide a more thorough understanding of the present invention and to provide a thorough understanding of the present invention. While the invention is capable of embodiments in many different forms than those described herein, those skilled in the art will appreciate that the present invention is not limited to the specific examples and figures disclosed below, since various modifications can be made without departing from the scope of the invention.
While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It will be understood by those skilled in the art that variations and modifications of the embodiments of the present invention can be made without departing from the scope and spirit of the invention.
Fig. 1 shows a flow chart of a visual algorithm of a text visualization method for preserving unstructured text semantics according to an embodiment of the present invention. The method comprises the following steps: and step S101, performing word segmentation, filtering and part-of-speech tagging on the input text. And S102, constructing a syntax tree based on the dependency relationship between the part-of-speech labels and the vocabularies, calculating the emotion polarity of a father node from bottom to top according to a calculation rule until a following node is the emotion polarity of the text, and dividing the text set into a positive type and a negative type. And step S103, respectively generating a vocabulary sequence mode for the positive and negative texts based on the word frequency and the co-occurrence relation of vocabularies in each text, and keeping semantics. And step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors. And S105, displaying the semantic relation in the sequence mode and among the sequence modes by adopting a layout algorithm. And step S106, introducing interactive design so that the user can pay attention to the local details.
In the implementation, the existing word segmentation tool is adopted to carry out word segmentation on the text, filtering of stop words, part of speech tagging and obtaining of the dependency relationship among words, and a force guidance method is adopted to carry out visual drawing based on d3.js and cola. js.
Fig. 2 shows a construction algorithm of a bivariate syntax tree according to an embodiment of the present invention. The input of the algorithm is the dependency relationship between a single text subjected to word segmentation and a text vocabulary, and the output is the emotion polarity of the text. The algorithm is divided into five steps, wherein in the first step, an empty stack is created, and sentence beginning vocabularies are read in. And step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading the next vocabulary. Reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; and if the dependency relationship does not exist, jumping to the second step. And fourthly, if two nodes are redundant in the stack at the moment, jumping to the third step, and otherwise, jumping to the second step. And fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text.
The emotion calculation rules involved in the preferred embodiment are: in a single sentence, the words for positive emotion are marked as 1, the words for negative emotion are marked as-1, the words for neutral are marked as 0, the words for degree are marked as one, and the words for negation are marked as! . When calculating, 0 and! The polarity is changed to be reverse, and when the positive polarity and the negative polarity meet, the left polarity is taken. In the compound sentence, the turning word is marked as R, the common conjunctions are marked as L, L does not change the polarity, and R takes the right polarity.
FIG. 3 illustrates a high frequency vocabulary sequence generation algorithm in accordance with an embodiment of the present invention. The input of the algorithm is an initial sequence mode s, the regularized text set D and the vocabulary number N required to be presented by the visual map, and the output is a high-frequency vocabulary sequence mode L. The main idea of the algorithm is to construct a sequence pattern spanning tree, and in the initial state, the only sequence of the sequence pattern spanning tree is given. After each run, the highest frequency sequence pattern is popped off the stack looking for a sequence pattern that has one more word than it. The new sequence schema serves as the left child of the original schema node of the original schema tree. The original pattern sequence becomes its right child. The frequency of the original pattern sequence is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern. Thus, the original text set is continually segmented, which continually generates a sequence of leaf patterns until the remaining number of required visible words becomes 0.
While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It will be understood by those skilled in the art that variations and modifications of the embodiments of the present invention can be made without departing from the scope and spirit of the invention.

Claims (5)

1. A social media text visualization method for preserving unstructured text semantics is characterized by comprising the following steps:
step S101, performing word segmentation, filtering, part-of-speech tagging and acquisition of inter-vocabulary dependency on an input text;
step S102, constructing a syntax binary tree based on the dependency relationship between part of speech labels and vocabularies, calculating the emotion polarity of each text, and dividing a text set into a positive type and a negative type;
the specific method for building the syntax binary tree comprises the following steps: firstly, creating an empty stack and reading in a sentence head vocabulary; step two, if the next vocabulary does not exist, jumping to the step five, otherwise, reading in the next vocabulary; reading the dependency relationship of two nodes on the stack top, if the dependency relationship exists, generating a father node, calculating the part of speech of the father node according to the emotion calculation rule, and entering the next step; if the dependency relationship does not exist, jumping to the second step; fourthly, if two nodes are redundant in the stack at the moment, jumping to the third step, otherwise, jumping to the second step; fifthly, outputting the emotion polarity of the nodes in the stack, namely the emotion polarity of the whole text; wherein the emotion calculating rule in the third step is as follows: in a single sentence, the words for positive emotion are marked as 1, the words for negative emotion are marked as-1, the words for neutral are marked as 0, the words for degree are marked as one, and the words for negation are marked as! When calculating, 0 and! Changing the polarity into reverse, when the positive polarity and the negative polarity meet, taking the left polarity, in the complex sentence, marking a turning word as R, marking a common conjunction word as L, not changing the polarity of L, and taking the right polarity of R;
step S103, respectively generating a vocabulary sequence mode for positive and negative texts based on the word frequency and the co-occurrence relation of vocabularies in each text, and keeping semantics;
the specific method for generating the vocabulary sequence mode comprises the following steps: in the initial state, the only sequence of the sequence pattern spanning tree is given, after each operation, the highest frequency sequence pattern is popped up to search the subsequence pattern which has one more word than the highest frequency sequence pattern, the new sequence pattern is used as the left child of the node of the original pattern tree, the original pattern sequence becomes the right child of the original pattern sequence, the frequency of the original pattern sequence is divided into two parts, namely a part containing the new sequence pattern and a part not containing the new sequence pattern, and the process is circulated until the number of the remaining required visible vocabularies is 0;
step S104, distributing a visual space based on the weight occupied by the positive and negative text sets, and designing visual fonts and colors;
s105, displaying semantic relations in the sequence modes and among the sequence modes by adopting a layout algorithm;
and step S106, introducing interactive design so that the user can pay attention to the local details.
2. The method for visualizing the text of the social media with the preserved semantic meaning of the unstructured text as claimed in claim 1, wherein the specific method for calculating the emotion polarity of the single text in step S102 is: the method comprises the steps of firstly, carrying out syntactic analysis by using a syntactic analyzer to obtain the dependency relationship among words and the emotion polarity of a single word, then constructing a syntactic binary tree structure for a sentence based on the obtained dependency relationship, and converting sentence emotion judgment into symbol calculation based on a tree by using the dependency relationship among words and a rule method.
3. The method for visualizing social media text with preserved unstructured text semantics as claimed in claim 1, wherein the specific method for allocating visual space and designing visual interface in step S104 is as follows: in the two types of texts with positive and negative polarities, the text with larger weight is positioned above, the text with smaller weight is positioned below, and occupies the area proportion corresponding to the weight ratio, the positive and negative text sets adopt edges with different colors to connect nodes, and the occurrence frequency is secondarily coded by using the font size and the transparency.
4. The method for visualizing the text of the social media with the unstructured text semantics maintained as claimed in claim 1, wherein the specific method of the layout algorithm in the step S105 is as follows: the horizontal layout sequence of the sequence patterns is consistent with the sequence in the sequence patterns, and if the two pattern sequences belong to the subsequences of one pattern sequence, the two pattern sequences are vertically arranged in layout.
5. The method as claimed in claim 1, wherein the step S106 of interactive design is a method of displaying a composite image of all sequence modes in an initial state of the model, when a user focuses a mouse on a word, the vocabulary that belongs to a sequence mode with the word is highlighted, and the rest of the vocabulary is shaded, so as to clearly display the semantics of a sequence mode, and at the same time, the model displays the text that contains the sequence mode and has the highest weight through a floating layer, so as to disclose more detailed information.
CN201911081479.5A 2019-11-07 2019-11-07 Text visualization method for reserving unstructured text semantics Expired - Fee Related CN110781289B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911081479.5A CN110781289B (en) 2019-11-07 2019-11-07 Text visualization method for reserving unstructured text semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911081479.5A CN110781289B (en) 2019-11-07 2019-11-07 Text visualization method for reserving unstructured text semantics

Publications (2)

Publication Number Publication Date
CN110781289A CN110781289A (en) 2020-02-11
CN110781289B true CN110781289B (en) 2022-07-15

Family

ID=69390046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911081479.5A Expired - Fee Related CN110781289B (en) 2019-11-07 2019-11-07 Text visualization method for reserving unstructured text semantics

Country Status (1)

Country Link
CN (1) CN110781289B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523289B (en) * 2020-04-24 2023-05-09 支付宝(杭州)信息技术有限公司 Text format generation method, device, equipment and readable medium
CN111859146B (en) * 2020-07-30 2024-02-23 网易(杭州)网络有限公司 Information mining method and device and electronic equipment
CN116522901B (en) * 2023-06-29 2023-09-15 金锐同创(北京)科技股份有限公司 Method, device, equipment and medium for analyzing attention information of IT community

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105930368A (en) * 2016-04-13 2016-09-07 深圳大学 Emotion classification method and system
CN107305539A (en) * 2016-04-18 2017-10-31 南京理工大学 A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN110287319A (en) * 2019-06-13 2019-09-27 南京航空航天大学 Students' evaluation text analyzing method based on sentiment analysis technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2571373C2 (en) * 2014-03-31 2015-12-20 Общество с ограниченной ответственностью "Аби ИнфоПоиск" Method of analysing text data tonality

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955451A (en) * 2014-05-15 2014-07-30 北京优捷信达信息科技有限公司 Method for judging emotional tendentiousness of short text
CN105930368A (en) * 2016-04-13 2016-09-07 深圳大学 Emotion classification method and system
CN107305539A (en) * 2016-04-18 2017-10-31 南京理工大学 A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN110287319A (en) * 2019-06-13 2019-09-27 南京航空航天大学 Students' evaluation text analyzing method based on sentiment analysis technology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Keyword Based Tweet Extraction and Detection of Related Topics;Amrutha Benny and Mintu Philip;《Procedia Computer Science》;20150423;第46卷;第364-371页 *
基于扩展词典和规则的中文微博情感分析;李继东;《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》;20180615(第2018年第06期);I138-2168 *
面向大规模图数据的并行图布局算法;程致远 等;《大数据》;20160531;第2卷(第5期);第12-21页 *

Also Published As

Publication number Publication date
CN110781289A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
Cui et al. Text-to-viz: Automatic generation of infographics from proportion-related natural language statements
Van Ham et al. Mapping text with phrase nets
Cao et al. Introduction to text visualization
CN110781289B (en) Text visualization method for reserving unstructured text semantics
CN102016837B (en) System and method for classification and retrieval of Chinese-type characters and character components
US7904455B2 (en) Cascading cluster collages: visualization of image search results on small displays
KR20210116379A (en) Method, apparatus for text generation, device and storage medium
US9177407B2 (en) Method and system for assembling animated media based on keyword and string input
TWI472933B (en) Method and computer program products for reconstruction of lists in a document
Smith et al. Evaluating visual representations for topic understanding and their effects on manually generated topic labels
Dunst et al. The graphic narrative corpus (GNC): design, annotation, and analysis for the digital humanities
Merlino et al. 25 An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News
Ying et al. MetaGlyph: Automatic generation of metaphoric glyph-based visualization
CN109508448A (en) Short information method, medium, device are generated based on long article and calculate equipment
Karim et al. A step towards information extraction: Named entity recognition in Bangla using deep learning
CN108491381B (en) Syntax analysis method of Chinese binary structure
CN110008807A (en) A kind of training method, device and the equipment of treaty content identification model
JP2008009671A (en) Data display device, data display method and data display program
Van Enschot et al. Taming our wild data: On intercoder reliability in discourse research
Riehmann et al. Visualizing a thinker's life
Ishihara et al. Analyzing visual layout for a non-visual presentation-document interface
CN113240485A (en) Training method of text generation model, and text generation method and device
Tohalino et al. Using citation networks to evaluate the impact of text length on the identification of relevant concepts
McLean Davies et al. Reading in the (post) digital age: Large digital databases and the future of literature in secondary classrooms.
JP2009140113A (en) Dictionary editing device, dictionary editing method, and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220715