CN116521858A - Context semantic sequence comparison method based on dynamic clustering and visualization - Google Patents

Context semantic sequence comparison method based on dynamic clustering and visualization Download PDF

Info

Publication number
CN116521858A
CN116521858A CN202310445169.7A CN202310445169A CN116521858A CN 116521858 A CN116521858 A CN 116521858A CN 202310445169 A CN202310445169 A CN 202310445169A CN 116521858 A CN116521858 A CN 116521858A
Authority
CN
China
Prior art keywords
context
word
sequence
clustering
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310445169.7A
Other languages
Chinese (zh)
Other versions
CN116521858B (en
Inventor
马滨
任军霞
李响
唐嘉成
仇斌杰
赵建波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tiandao Jinke Co ltd
Zhejiang Zhelixin Credit Reporting Co ltd
Original Assignee
Tiandao Jinke Co ltd
Zhejiang Zhelixin Credit Reporting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tiandao Jinke Co ltd, Zhejiang Zhelixin Credit Reporting Co ltd filed Critical Tiandao Jinke Co ltd
Priority to CN202310445169.7A priority Critical patent/CN116521858B/en
Publication of CN116521858A publication Critical patent/CN116521858A/en
Application granted granted Critical
Publication of CN116521858B publication Critical patent/CN116521858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a context semantic sequence comparison method based on dynamic clustering and visualization, wherein a ContextWing system is provided, and context sequence modes continuously evolving between two data streams are supported to be compared pairwise. The computation model section is capable of generating dynamic topics and sequence patterns, and computing public attention and pairwise correlations. The system also comprises a novel multi-layer double-wing metaphor design, which can intuitively display sequence modes fused by different contexts to reveal the difference of two sequences in time and semantics. The interactive tool then supports selection of the center word and its contextual keywords to iteratively generate patterns for focused exploration. In addition, the system also supports static and streaming set analysis and wider application scenes.

Description

Context semantic sequence comparison method based on dynamic clustering and visualization
Technical Field
The invention relates to the technical field of data analysis, in particular to a context semantic sequence comparison method based on dynamic clustering and visualization.
Background
With the rapid development of social media, many people like to express their own views and concepts by posting messages, spreading important news, which appear in a data stream, and a collection of tweets containing the same keywords forming a social media data stream. In order to facilitate social science researchers and public opinion analysts to quickly understand a large amount of social media data, it is important to provide opinion summaries embedded with social media information. The visual summary of these tweets allows the user to quickly understand these text data.
Word clouds are a common method of providing visual summaries for text data. However, the word cloud provides limited contextual information and cannot provide links between keywords to convey meaning of sentences. Therefore, we extract the keyword sequences that appear in sequence in sentences as digests of the tweets. Meanwhile, since many of the tweets contain the same sequence, we define this sequence as "pattern". For example, "election theory starts at 9 on monday evening", "election theory will start on monday", and so on. People have different expressions, but they all mention the same keywords and sequences: "election-dialect-four week start" so that the frequently occurring semantic sequence is a pattern. The patterns are very diverse and require comparison of differences between them to understand the opinion. Furthermore, since these patterns belong to different time periods, it is also necessary to compare the patterns from the time level. Furthermore, to help analyze the public attitude, it is necessary to compare the relationships between patterns and different data streams. To handle these complex analyses, visualization techniques may be used to support the comparison.
Visual comparison of text is a widely studied topic. However, there is currently a lack of methods of analysis that support simultaneous comparison of time-varying features and semantic features of sequences, as well as in different data streams. First, it is difficult to combine semantic comparison with dynamic comparison in sequence analysis. Some students use tree structures to solve the challenges of sequence comparison, helping people understand the basic concepts and ideas quickly, however, this approach is limited to static text sequence data and does not support temporal comparison. The effort to support time trend comparisons between multiple tag clouds cannot support sequence comparisons again because of the lack of connections between keywords. Thus, it is difficult to visualize both temporal and semantic comparisons of sequences simultaneously. Second, it is challenging to compare semantics and dynamics in different data streams. Some work addresses the challenge of pairwise visual comparison of multiple items between two data streams, but still cannot be applied to sequences to display more context and connections. Third, in addition to historical social media data, real-time analysis is more challenging for real-world streaming data, but is also more important in that it requires fast modeling methods and dynamic visualization to reveal features in a short time. In general, there is a lack of a visualization technique to support both time and semantic sequence patterns in two data streams for comparison, and analysis to support real-time patterns
Disclosure of Invention
The invention aims to realize simultaneous visualization of time and semantic comparison of text sequences and realize semantic and dynamic comparison among different data streams, and provides a context semantic sequence comparison method based on dynamic clustering and visualization.
To achieve the purpose, the invention adopts the following technical scheme:
the method comprises the steps of dynamically clustering continuously updated tweets based on a dynamic clustering method of BERTopac and KMeans++ for real-time stream data, and then carrying out visual analysis on the dynamic stream, wherein the visual analysis specifically comprises the steps;
s1, extracting a center word selected by a user by calculating the similarity between each word in a push text and the center wordContext keywords of the center word; and calculating the public attention of the context key words and the central word
S2, calculating the association degree between the context keywords and the two key entitiesAnd visualizing;
s3, generating a semantic sequence mode and visualizing through an iterative search method according to the central word and the context keyword set thereof.
Preferably, the method for dynamically clustering continuously updated tweets based on the dynamic clustering method of BERTopac and KMeans++ comprises the following steps:
a1, carrying out text recognition on the context keywords in the continuously updated push text by using the BERTopac model according to the center word given by the user to obtain the context keywords to be clustered at the time of initialization t;
a2, initializing clustering at t moment by using KMeans++ algorithmAfter the first clustering is completed, the clustering center is transferred to the clustering of t+1 time +.>
A3, judging at each clustering timeWhether the first m of said context keywords are also present +.>In (if yes), will be->And->Cluster merging is performed, andranking the context keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by the context keywords with the top x rank as data after updating
And A4, completing clustering of the context keywords identified at all moments by adopting the method of the steps A2-A3, and taking the context in which the first y context keywords in the finally combined cluster are located as an object to be subjected to visual analysis.
Preferably, in step S1, similarity calculation is performed on each word in the center word and the push text by using a cosine similarity calculation method, and the word with the top n rank is used as the context keyword set.
Preferably, in step S1, the method of calculating the public attention of the context keyword of the center word includes the steps of:
s11, calculating the public attention degreeThe calculation method is expressed by the following formula (1):
in formula (1), k represents the center word selected by the user or system;
c represents the context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, and if so, u i (c, k) =1, otherwise 0;
u i (c, -k) represents whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i indicating whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i representing the number of the ith push message to be forwarded;
s12, according toIs visualized.
Preferably, in step S2,is expressed by the following formula (2):
in the formula (2),the co-occurrence frequency of the context keyword i, the key entity A and the key entity B at the time t is respectively represented;
rank represents the difference between co-occurrence frequencies of the context keyword iAt all i ε W t Ranking of (3);
N t the total number of context keywords of the central word i at the moment t is represented;
W t all contextual keyword sets representing the center word at time t.
Preferably, in step S3, the method for generating the semantic sequence pattern includes the steps of:
s31, forming an initial sequence, wherein the initial sequence comprises the center word and the context key words which are selected by a user and keep the appearance sequence in a push text;
s32, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in a pushing text after a word in the set is newly added in the initial sequence, adding the found context keywords into the initial sequence to realize sequence expansion, and filtering the context keywords newly added into the initial sequence in the keyword set;
and S33, taking the new semantic sequence obtained by expansion in the step S32 as the initial sequence, returning to the step S31, continuing to expand the initial sequence from the filtered residual keyword set until the expanded sequence reaches a preset sequence length, and taking the new semantic sequence obtained finally as the generated semantic sequence mode.
The Contextwing system provided by the invention supports the comparison of the context sequence modes which continuously evolve between two data streams. The computation model section is capable of generating dynamic topics and sequence patterns, and computing public attention and pairwise correlations. The system also comprises a novel multi-layer double-wing metaphor design, which can intuitively display sequence modes fused by different contexts to reveal the difference of two sequences in time and semantics. The interactive tool then supports selection of the center word and its contextual keywords to iteratively generate patterns for focused exploration. In addition, the system also supports static and streaming set analysis and wider application scenes.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a system interface diagram of social media context text visualization provided by an embodiment of the present invention;
FIG. 2 is an enlarged view of an interface of a subject view shown in area A of FIG. 1;
FIG. 3 is an enlarged view of an interface of a control view shown in area B of FIG. 1;
FIG. 4 is a partial enlarged view of the mode view shown in area C of FIG. 1;
FIG. 5 is a histogram of the number of tweets displayed in the area a1 of FIG. 2;
FIG. 6 is an interface schematic of the dynamic word cloud shown in area a2 of FIG. 2;
FIG. 7 is an enlarged interface view of a detail view of the original tweet shown in area D of FIG. 1;
FIG. 8 is a schematic diagram of the visual metaphor design provided by an embodiment of the present invention;
FIG. 9 is a schematic diagram of a semantic merging method of a visual metaphor;
FIG. 10 is a system architecture flow diagram of a visual analysis interface;
fig. 11 is an example diagram of a topic view.
Detailed Description
The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to be limiting of the present patent; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if the terms "upper", "lower", "left", "right", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, only for convenience in describing the present invention and simplifying the description, rather than indicating or implying that the apparatus or elements being referred to must have a specific orientation, be constructed and operated in a specific orientation, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and should not be construed as limiting the present patent, and that the specific meaning of the terms described above may be understood by those of ordinary skill in the art according to specific circumstances.
In the description of the present invention, unless explicitly stated and limited otherwise, the term "coupled" or the like should be interpreted broadly, as it may be fixedly coupled, detachably coupled, or integrally formed, as indicating the relationship of components; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between the two parts or interaction relationship between the two parts. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
The embodiment of the invention provides a context semantic sequence comparison method based on dynamic clustering and visualization, which comprises the following analysis processes:
the present invention proposes a social media context text visualization system named Contextwi as shown in FIG. 1, which integrates an integrated computational model, a novel visual design and a symmetrical wing structure, connects sequences with the same center word (e.g. "chi na" in FIG. 1), and merges the sequences according to the same context keywords (e.g. "ukrai ne", "trademark" etc. in FIG. 1), and after merging, distinguishes between colors and pairs of levels to more clearly show their semantic differences and similarities. The sequences are arranged vertically from top to bottom, corresponding to different time periods. Keywords in the schema are concatenated from left to right in the order in which they appear in the text. Meanwhile, the positions and colors of different levels encode the correlation between semantic information and key entities (such as ' person A ' and ' person B ' in social media events), and people's tendencies on the key entities can be known by comparing the relation between the semantics and the paired entities. Therefore, the visual design enables the user to perform paired visual comparison on the time characteristics and the semantic characteristics of the context at the same time, and the limitation of word cloud and word tree is overcome.
The system interface provided in fig. 1 includes A, B, C, D four regions respectively corresponding to the 4 view portions of the topic view, the control view, the mode view and the detail view of the original text shown in fig. 2-4 and fig. 7, where the topic view portion further includes the histogram of the number of original text shown in fig. 5 and the interface diagram of the dynamic word cloud shown in fig. 6. The user may select different center words and context keywords in the topic view interface displayed in the area a shown in fig. 2 to generate a semantic sequence pattern, may observe the user-selected center words and context keywords through the control view displayed in the area B shown in fig. 3, and may reset or return the user-selected center words and/or context keywords to the topic view shown in fig. 2 through a reset or return function embedded in the control view. The pattern view (i.e., visualized wing metaphor design) displayed in the region C shown in FIG. 4 is used to visualize the generated semantic sequence patterns.
How to generate semantic sequence patterns and how to perform real-time pattern analysis on different data streams and how to visualize analysis results are key technical contents of the invention, the following three major blocks of contents are used for specifically explaining the principle of realizing the key technology:
1. building a computing model and data flow pattern analysis
The built calculation model mainly bears the following calculation functions: keyword classification computation, pairwise correlation computation, public attention computation, context semantic sequence pattern generation, and data analysis of different data streams according to the generated patterns.
1. Keyword classification calculation
In a static setting (representing that the data is historical data and is not updated in real time), word2Vec is used (Word 2Vec is a neural network model for converting words into vector form, processing of text content can be simplified into vector operation in vector space through conversion, similarity in vector space is calculated to represent similarity in text semantics), a vector of each Word in an original text is obtained, cosine similarity of the vector is calculated to find out keywords similar to a central Word given by a user, and higher similarity indicates higher semantic correlation between two Word vectors. Because the given original text is historical data, the invention can assign the center word by obtaining priori knowledge, and the clustering effect of the center word is more in line with the expectations of experts. Since there are typically a large number of words per cluster, the present invention retains the top n words, which are more frequent, as the keywords for visualization. Similarly, based on the cosine similarity of the vectors, the extraction of the keywords of the context of each central word can be performed to obtain the first n words with higher similarity. Considering that the topic view needs to be ensured to be clear when visualization is performed, the top 20-30 context keywords are generally selected.
The most straightforward way to quantify the relationship between a center word and its contextual keywords is to calculate its co-occurrence frequency in the text. However, we find that, in practical application, there is a limitation in information presented simply based on co-occurrence frequency, and in order to promote the effect of pair-wise comparison of subsequent sequences, the invention also innovatively proposes to use public attention to represent the close relationship between the central word and the context key word. We characterize the proposed public interest asThe method is used for calculating the distance between the center word and the context keyword, and the distance can accurately reflect the popularity of the forwarded text.Is expressed by the following formula (2):
in formula (2), k represents a user or system selected center word;
c represents a context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, and if so, u i (c, k) =1, otherwise 0;
u i (c, -k) represents whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i indicating whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i indicating the number of ith tweets forwarded.
In equation (2), both the numerator and denominator reflect empirical estimates of the number of forwarding under inclusion conditions. This approach can help describe the distance between the center word and each context keyword ifThe closer the relationship between c and k is, the higher the public interest, and if negative, the less closely they are, the lower the public interest.
2. Paired correlation computation
Each event necessarily has two key subjects, which are the focus of the discussion and have a great influence on the trend of public opinion. The invention quantifies the relatedness of two data streams to keywords according to their co-occurrence frequency (i.e. "relatedness"), and marks asAnd->
The invention innovatively providesIs calculated by->Calculated by the following formula (1):
in the formula (1),the co-occurrence frequency of the context keyword i and the key entity A and the key entity B at the time t are respectively represented;
rank represents the co-occurrence frequency difference calculated for the center word iAt all i ε W t Ranking of (3);
N t the total number of context keywords representing the center word at time t;
W t a set of contextual keywords representing a center word at time t.
If it isClose to 1, the context keyword i is more relevant to the key entity a or the data stream in which the key entity a is located at time t.
The calculation method of (a) is exemplified as follows:
for example, the context keyword i is "apple", and the co-occurrence frequency of the keyword and the keyword entity A (such as name A)Co-occurrence frequency 10 with key entity B (e.g. name B)>5, then->Assuming that there are 4 words other than "apple" with such co-occurrence frequency difference, and the value of the co-occurrence frequency difference of "apple" is ranked second from large to small according to the value of the co-occurrence frequency difference, the association degree of "apple" with the key entity A is>
3. Generating context semantic sequence patterns
To summarize the information of the original tweet more briefly, the invention sets that the semantic sequence consists of verbs, nouns and adjectives, and the sequence length can be 4 (4 words) or further adjusted. The repeated sequence is a sequence pattern, the generation process of the pattern is a searching process, and the searching process is specifically as follows:
assume thatThe central word and the context keyword selected by the user are respectively marked as centralkey word and w, and the relevance is achieved through the above-mentioned relevanceAfter the calculation of (1), the context keywords having the top n of the relevance rank with the center word form a keyword set. First, an initial sequence is formed, which includes a central word centralkey and a context keyword w, forming a binary group. The ordering order of the central word centralkey word and the context keyword w in the initial sequence is consistent with the appearance order of the text in the text, and the central word centralkey word or centralkey word-w is w-centralkey word or centralkey word-w.
Then, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in the push text after a word in the new set is added in the initial sequence, determining the word as the context keyword which is finally taken out from the keyword set and is newly added in the initial sequence, and changing the form of the binary group of the initial sequence into the form of the triplet after the context keyword is newly added, thereby realizing the expansion of the initial sequence. In order to more flexibly set the coverage of the semantic sequence to the text of the push text, the invention also adds a skip value, namely, the position relation between the newly added context keywords and the keywords in the current tuple is allowed to fluctuate within the skip value range. Further, the skip value is set to according to the sequence length lFor example, assuming a sequence length l of 20 and a skip value of 11, the position of the last newly added word for the current tuple is 5, the fluctuation range of the position relationship of the newly added context keyword and the keywords in the current tuple is allowed to be 1 to 16 bits in the current tuple. The relative distance of the keywords is adjusted according to the length of the text by the sequence through setting the skip value, so that the coverage of the semantic sequence on the text of the push text is set more flexibly.
4. Data analysis of different data streams based on generated semantic sequence patterns
Flow data analysis faces many difficulties compared to static settings, on the one hand, faster and more accurate computation efficiency and, on the other hand, flexible visualization support is required. However, due to the characteristics of continuous change, inheritance, disappearance and the like of the theme of the event, clustering is more complex. In order to solve the problem, the invention adopts a dynamic clustering method based on BERTopac and KMeans++, and processes continuously updated text in real time, and the processing method is as follows:
firstly, the BERTopaic model (BERTopaic is a topic modeling technique, uses a transducer and a c-TF-IDF to create a dense cluster, allows the topic to be interpreted while preserving important words in the topic description) to generate semantic vectors of documents in a high-dimensional space, and is constructed by UMAP (Uniform Manifold Approximation and Projection, a new dimension-reducing manifold learning technique). Since the BERTopac model does not support dynamic clustering in the streaming data set, the present invention combines it with the KMeans++ algorithm, which is one of the fastest clustering algorithms applicable to streaming data. After the kmeans++ algorithm is used to obtain the topic clusters of each event, class-based TF-IDF vectors are used to generate topic representations.
The reason why the Word2 Vec-based method is not used in the stream data mode includes two aspects. First, word vectors are generated dependent on the corpus per minute, but the vectors of the same word in the data per minute will change. Thus, the cluster center cannot pass on to the next generation unless a large sliding window is set and the entire window is considered a bag of words. However, this method brings about a time difference from the real time. Thus, the present invention can produce the same word vector per minute using a transducer-based pre-training model. Therefore, the clustering center can transmit to the next minute, so that real-time clustering is realized, and a coherent theme is obtained. Secondly, word2 Vec-based methods require initial keywords to extract words with high similarity, and require prior knowledge of the topic of the event. There is therefore a need for an automatic clustering method to help users learn about upcoming topics. Therefore, the method adopts BERTopac+KMeans++ to dynamically cluster the continuously updated pushers.
The process of dynamically clustering continuously updated tweets by the BERTopac+KMeas++ method will be described in detail below:
the basic principle of the dynamic KMeans algorithm is to initialize the cluster center with the last clustering result, and when data arrives within one minute, kmeans++ is used to initialize the clusters firstAfter the first clustering is completed, the cluster center is transferred to the cluster of the next minute +.>The information obtained in the last step is maintained, and the clustering efficiency is improved. Considering the limitation of the user on the real-time change information, setting up to 6 topics generated in each clustering, and generating up to 20 context keywords under each topic.
In order to obtain a coherent topic, after clustering per minute, ifThe first 25% of keywords are also present inEvery cluster at the current time is +.>Cluster +.>Merging, sorting the keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by context keywords with the top x of the rank as a data updated +.>If->The first 25% of keywords are not present +.>In the middle, do not correspond to->And carrying out data updating.
By adopting the method, the clustering of the context keywords identified by the BERTopac model at all moments is completed, and the context where the first y context keywords in the finally combined clusters are located is taken as an object to be subjected to visual analysis.
For the performance of the above-described BERTopac model+KMeas++ clustering method, the present invention evaluates BERTopac+KMeas++ and BoW+KMeas++ (Bag of words model (Bag of words), and fits all words into a Bag, irrespective of their lexical and word order, i.e., each word is independent BERTopac is a word vector model, a neural network model that considers word positional relationships, by extensive training, map each word into a solution of high dimension (thousands, tens of thousands of dimensions) by a large number of corpus training. The invention tests NMI (normalized mutual information, NMI is a measure of similarity between two tags whose mutual information is the same data),
wherein |U i I is the number of samples in the cluster, V i The mutual information of U and V in the cluster is Normalized Mutual Information (NMI) is normalized by the Mutual Information (MI) score, scaling the result between 0 (no mutual information) and 1 (complete correlation).
And judging the quality of the result clustering according to the class labels. Each method we run 5 times to calculate the median, found that the median NMI of bertopac+kmeans++ is 0.61[0.60-0.62], and the median NMI of bow+kmeans++ is 0.42[0.39-0.43], indicating that bertopac+kmeans++ is superior to BoW methods in terms of clustering results. As for the computational efficiency problem, the BERTopac+KMeas++ method was tested on the dataset of the case study, containing an average of 80 tweets per minute. BERTopac+KMeas++ was found to process 1 minute of data in 6-7 seconds. Thus, for many social media event datasets with a number of tweets per minute around 800, the method of BERTopac+KMeas++ is feasible in both clustering effect and time efficiency.
2. Visualizing analysis results
The design principles, visual coding and concrete construction process of the wing metaphors in the semantic sequence pattern view (as shown in fig. 4) will be described below:
the present invention proposes a new design that can be used to visualize the changing sequential patterns of contexts and allow interactions to compare them. In Contextwing, the main metaphor is wings and feathers, as shown in FIG. 8.
Wing metaphor: wings visualize the connection between the sequence patterns of the center word. In the horizontal direction, the wings are divided into a left-right symmetrical structure. The words on the left wing represent words that appear in the text before the center word and vice versa.
Feather metaphor: each pair of horizontally symmetric feathers (also referred to as each layer) in the wing exhibits a sequence pattern that merges according to the same context keyword. The color and vertical position of the feathers represent the correlation between each context keyword and two key entities. The horizontal position represents a degree of public concern.
Next, how the semantic sequence schema view is constructed will be described.
1. The feather layer described above is built for the selected context key. The present invention assigns each selected context keyword to a layer having the same length, and the width can be automatically fine-tuned according to the number of selected words. The vertical position and color coding of the layers represent a pair-wise comparison. The color and vertical position of the layers are used to encode the pairwise correlations generated by the computational model. To facilitate expression of pairwise relatedness, as shown in fig. 8, the lower the position the layer is associated with key entity a and vice versa the closer to key entity B. The horizontal position of the hierarchy is a quantification result based on the public attention, which indicates public attention to the center word and its selected context keywords. If the layers are horizontally closer to the center keyword, this means that they have more attention. Furthermore, the width of the link to the layer represents the total frequency of the modes on the layer.
2. Context keywords are laid out on each layer. As shown in a of fig. 8, the present invention places words on the feathers around the center keyword in the order of appearance and arranges patterns from the top to the bottom of the layers in time order. The present invention vertically aligns the context keywords with the center word. Keywords that are on the same horizontal line as the center word form a pattern. The time scale of the layer side indicates the corresponding time period of the pattern in the same row. The size of the keyword encodes the pattern frequency after the word is included. Thus, the last keyword frequency represents the pattern frequency.
3. The selected context keywords are merged. During the placement process, many repeated keywords are found in the same column, so that it is not easy to compare different semantic information in the sequence. For example, a selected context keyword such as "flu" will not be apparent because "handemic" will also be repeated and more nearly centered (a in fig. 9). Therefore, it is necessary to avoid the influence of other context keywords and emphasize the context keywords of the selected layer. As shown in b of fig. 9, the present invention combines these keywords in the same column, maintaining the overall structure and avoiding misunderstanding. After merging, the frequency evolution information of the context keywords is lost. Therefore, the invention also adds a mini trend graph to visualize the change of word frequency with time so as to enhance the information display.
4. And adding connecting lines among words. With the idea of a tree structure, the present invention connects words of the same pattern by adding lines for better understanding. As indicated by (a) in fig. 9, there is a case where the context keyword is the final word of the pattern (a "flu" in fig. 9). If repeated words are merged, the position of the word may become blank and may appear as if no word is on the correct level, resulting in misunderstanding. Thus, the present invention adds a line to connect the context keyword with the next blank position on the horizontal line to indicate the presence of the context keyword, as shown in FIG. 9.
The Contextwing system provided by the invention comprises a theme view, a control view, a mode view and a detail view shown in fig. 10.
1. Topic view
The present invention provides a topic view to select keywords as input for a mode view. As shown in fig. 11, the top view is a histogram showing the percentage of change in the pushers of the two data streams. The symbols of the two streams are placed at the top and bottom of the view, respectively. The theme is marked with a different color on the lower left button. In the bubble map, keywords are aggregated and divided into several time periods. Since each keyword can actually generate a pattern wing structure, the invention designs the keyword bubble as a wing-shaped carving. The size and opacity represent the frequency of occurrence of the word and the color represents the subject to which the word belongs. The vertical position of the bubble represents a correlation with two key roles. Some important indicators, such as frequency and mood profile, are displayed in the tool tip. In order to intuitively observe the consistency of the theme, the invention adds connecting lines for keywords frequently appearing in different stages. The user may hover the keywords over the screen observing the frequency and relevance. The histogram may be used to select a time period by swiping the screen, and the data for the selected time period may be reassembled and displayed in multiple pools. The design of the theme view may also be extended to stream settings. The histogram, bubble pool, and subject buttons are updated synchronously at preset intervals (e.g., 1 minute). According to the modeling result, if a new theme appears, the old theme will be replaced and highlighted with a new color. The color and name of the theme buttons always correspond to the category of the update bubble, so that a user can be helped to more intuitively perceive the dynamic change of the theme. In the case of dynamic changes, it is difficult for the user to keep a map of previous information in mind. Therefore, the invention combines the histogram and the bubble chart, and can help the user to view the real-time historical data. The user may also click a "pause" button to pause/continue the update.
2. Control view
The invention sets data set options and analysis modes, and the user can choose to switch between static and stream analysis modes. Furthermore, starting from the topic view (shown in FIG. 11), there are two methods that can explore (each context word can also have a context keyword as a center word, so each keyword can be clicked down continuously in the exploration mode) the center word and its context keywords. The user can click on the 'change mode', and the iteration control panel (shown in figure 3) is opened, wherein the search mode is (1) the user can drill down the context key words of a central word continuously by clicking. (2) Analysis mode-the user can click on a keyword as a central word and then select its contextual keywords. To maintain consistency of information, the color of the selected keywords in the control view still represents its theme. Clicking on "Go Pattern" then observes the derived Pattern in the Pattern view on the right. For example, FIG. 1 shows a selection operation in analysis mode that supports contextual keyword selection of a center word. In this process, the user can re-click the bubble to update the selection, click "back" and "restart" to the previous or initial state, and conduct an iterative exploration.
3. Mode view
Once the wing structure is constructed, the user can compare from different aspects. The present invention provides the following four interactions to make a detailed comparison. First, to support temporal comparison of patterns of the same context keywords, the user can hover over any keyword, the corresponding pattern will be highlighted, and other patterns will be hidden. Thus, the user can better observe the single pattern of time stamps on the single layer. Second, patterns are compared from the perspective of the selected context keyword. The user can click on any time scale to highlight the different levels of the pattern during that time period. In addition, when the user hovers the mouse over the side of the layer, a mini-spark-up (small data plot represented by a line without axes) is displayed, indicating the frequency of evolution of the selected context keyword throughout the cycle. Finally, the invention also provides a tool-tip that allows the user to click on any keyword to view the frequency and emotion profile of each pattern. The mode view also supports real-time updates, displaying the mode at the same time intervals as in the subject view, vertically aligned to correspond to several times of the current time.
4. Detail view
To assist the user in understanding the pattern, we provide a detailed view (fig. 7) that can display information such as time, mood score, etc. of the original tweet. In the mode view, the user can select a mode, and the original tweet will be displayed in the detail view. In addition, the user can select a time period and type in words of interest to them.
It should be understood that the above description is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be apparent to those skilled in the art that various modifications, equivalents, variations, and the like can be made to the present invention. However, such modifications are intended to fall within the scope of the present invention without departing from the spirit of the present invention. In addition, some terms used in the specification and claims of the present application are not limiting, but are merely for convenience of description.

Claims (6)

1. A context semantic sequence comparison method based on dynamic clustering and visualization is characterized in that for real-time stream data, dynamic clustering is carried out on continuously updated tweets based on a dynamic clustering method of BERTopac and KMeas++, and then visual analysis is carried out on dynamic streams, wherein the visual analysis specifically comprises the steps of;
s1, extracting context keywords of a central word by calculating the similarity between each word in a push text and the central word according to the central word selected by a user; and calculating the public attention of the context key words and the central word
S2, calculatingAssociation between the context keyword and the two key entitiesAnd visualizing;
s3, generating a semantic sequence mode and visualizing through an iterative search method according to the central word and the context keyword set thereof.
2. The method for comparing context semantic sequences based on dynamic clustering and visualization according to claim 1, wherein the method for dynamically clustering continuously updated tweets based on the dynamic clustering method of bertopac and kmeans++ comprises the steps of:
a1, carrying out text recognition on the context keywords in the continuously updated push text by using the BERTopac model according to the center word given by the user to obtain the context keywords to be clustered at the time of initialization t;
a2, initializing clustering at t moment by using KMeans++ algorithmAfter the first clustering is completed, the clustering center is transferred to the clustering of t+1 time +.>
A3, judging at each clustering timeWhether the first m of said context keywords are also present +.>In (if yes), will be->And->Merging clusters, sorting the context keywords in the merged clusters according to class-based TF-IDF scores, and taking a set formed by the context keywords with the top x rank as data after updating
And A4, completing clustering of the context keywords identified at all moments by adopting the method of the steps A2-A3, and taking the context in which the first y context keywords in the finally combined cluster are located as an object to be subjected to visual analysis.
3. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein in step S1, similarity calculation is performed on each word in the center word and the push word by a cosine similarity calculation method, and the word with the top n rank is used as the context keyword set.
4. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein in step S1, the method of calculating the public attention of the context keywords of the center word comprises the steps of:
s11, calculating the public attention degreeThe calculation method is expressed by the following formula (1):
in formula (1), k represents the center word selected by the user or system;
c represents the context keyword;
n represents the total number of tweets in the dataset;
u i (c, k) is an inclusion condition indicating whether the ith tweet contains c and k, and if so, u i (c, k) =1, otherwise 0;
u i (c, -k) represents whether the ith tweet contains c but not k, if so, u i (c, -k) =1, otherwise 0;
η i indicating whether the ith push is forwarded or not, if so, η i =1, otherwise 0;
r i representing the number of the ith push message to be forwarded;
s12, according toIs visualized.
5. The method for comparing context semantic sequences based on dynamic clustering and visualization according to claim 1, wherein in step S2,is expressed by the following formula (2):
in the formula (2),the co-occurrence frequency of the context keyword i, the key entity A and the key entity B at the time t is respectively represented;
rank represents the difference between co-occurrence frequencies of the context keyword iAt all i ε W t Ranking of (3);
N t the total number of context keywords of the central word i at the moment t is represented;
W t all contextual keyword sets representing the center word at time t.
6. The context semantic sequence comparison method based on dynamic clustering and visualization according to claim 1, wherein in step S3, the method of generating the semantic sequence pattern comprises the steps of:
s31, forming an initial sequence, wherein the initial sequence comprises the center word and the context key words which are selected by a user and keep the appearance sequence in a push text;
s32, traversing each context keyword in the keyword set, searching a word with the largest co-occurrence frequency of the word in the formed semantic new sequence in a pushing text after a word in the set is newly added in the initial sequence, adding the found context keywords into the initial sequence to realize sequence expansion, and filtering the context keywords newly added into the initial sequence in the keyword set;
and S33, taking the new semantic sequence obtained by expansion in the step S32 as the initial sequence, returning to the step S31, continuing to expand the initial sequence from the filtered residual keyword set until the expanded sequence reaches a preset sequence length, and taking the new semantic sequence obtained finally as the generated semantic sequence mode.
CN202310445169.7A 2023-04-20 2023-04-20 Context semantic sequence comparison method based on dynamic clustering and visualization Active CN116521858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310445169.7A CN116521858B (en) 2023-04-20 2023-04-20 Context semantic sequence comparison method based on dynamic clustering and visualization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310445169.7A CN116521858B (en) 2023-04-20 2023-04-20 Context semantic sequence comparison method based on dynamic clustering and visualization

Publications (2)

Publication Number Publication Date
CN116521858A true CN116521858A (en) 2023-08-01
CN116521858B CN116521858B (en) 2024-04-30

Family

ID=87407620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310445169.7A Active CN116521858B (en) 2023-04-20 2023-04-20 Context semantic sequence comparison method based on dynamic clustering and visualization

Country Status (1)

Country Link
CN (1) CN116521858B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796754A (en) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 Visual analysis method and system based on time-varying context semantic sequence pair comparison

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
US20180096057A1 (en) * 2016-10-03 2018-04-05 Sap Se Collecting event related tweets
CN110543559A (en) * 2019-06-28 2019-12-06 谭浩 Method for generating interview report, computer-readable storage medium and terminal device
CN110909153A (en) * 2019-10-22 2020-03-24 中国船舶重工集团公司第七0九研究所 Knowledge graph visualization method based on semantic attention model
CN115470344A (en) * 2022-08-24 2022-12-13 西南财经大学 Video barrage and comment theme fusion method based on text clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
US20180096057A1 (en) * 2016-10-03 2018-04-05 Sap Se Collecting event related tweets
CN110543559A (en) * 2019-06-28 2019-12-06 谭浩 Method for generating interview report, computer-readable storage medium and terminal device
CN110909153A (en) * 2019-10-22 2020-03-24 中国船舶重工集团公司第七0九研究所 Knowledge graph visualization method based on semantic attention model
CN115470344A (en) * 2022-08-24 2022-12-13 西南财经大学 Video barrage and comment theme fusion method based on text clustering

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796754A (en) * 2023-04-20 2023-09-22 浙江浙里信征信有限公司 Visual analysis method and system based on time-varying context semantic sequence pair comparison

Also Published As

Publication number Publication date
CN116521858B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
Yang et al. Image-text multimodal emotion classification via multi-view attentional network
Bordes et al. Translating embeddings for modeling multi-relational data
Yang et al. Interactive steering of hierarchical clustering
CN105095433B (en) Entity recommended method and device
El-Assady et al. Semantic concept spaces: Guided topic model refinement using word-embedding projections
CN112966091B (en) Knowledge map recommendation system fusing entity information and heat
KR102636493B1 (en) Medical data verification method, apparatus and electronic device
Nabati et al. Multi-sentence video captioning using content-oriented beam searching and multi-stage refining algorithm
CN116521858B (en) Context semantic sequence comparison method based on dynamic clustering and visualization
Wang et al. Detecting hot topics from academic big data
CN115481325A (en) Personalized news recommendation method and system based on user global interest migration perception
Mai et al. A unimodal representation learning and recurrent decomposition fusion structure for utterance-level multimodal embedding learning
Suresh et al. Data mining and text mining—a survey
WO2024139925A1 (en) Method and system for constructing visualization graph based on natural language
CN118069927A (en) News recommendation method and system based on knowledge perception and user multi-interest feature representation
Liu et al. Scanning, attention, and reasoning multimodal content for sentiment analysis
Park et al. Survey and challenges of story generation models-A multimodal perspective with five steps: Data embedding, topic modeling, storyline generation, draft story generation, and story evaluation
CN116796754A (en) Visual analysis method and system based on time-varying context semantic sequence pair comparison
Suresh An innovative and efficient method for Twitter sentiment analysis
Zhai et al. MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion
Marchenko et al. Examining the historical development of techno-scientific biomedical communication in Russia
Zbakh et al. An online reversed French Sign Language dictionary based on a learning approach for signs classification
Tamrakar et al. Student sentiment analysis using classification with feature extraction techniques
El-Gayar Automatic generation of image caption based on semantic relation using deep visual attention prediction
Schaffer et al. Interactive interfaces for complex network analysis: An information credibility perspective

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant