CN107368506A - Unstructured data analysis system and method - Google Patents

Unstructured data analysis system and method Download PDF

Info

Publication number
CN107368506A
CN107368506A CN201610496280.9A CN201610496280A CN107368506A CN 107368506 A CN107368506 A CN 107368506A CN 201610496280 A CN201610496280 A CN 201610496280A CN 107368506 A CN107368506 A CN 107368506A
Authority
CN
China
Prior art keywords
data
topic
document
unstructured data
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610496280.9A
Other languages
Chinese (zh)
Other versions
CN107368506B (en
Inventor
汪晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Stewart Feiteng Co
Original Assignee
Stewart Feiteng Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/151,572 external-priority patent/US10452698B2/en
Application filed by Stewart Feiteng Co filed Critical Stewart Feiteng Co
Priority to CN202011265115.5A priority Critical patent/CN112732878A/en
Publication of CN107368506A publication Critical patent/CN107368506A/en
Application granted granted Critical
Publication of CN107368506B publication Critical patent/CN107368506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of unstructured data analysis system, including:Unstructured data parser, it is resident on the server and can be used to via browser access, the unstructured data parser:Unstructured data is received from one or more remote sources, applies one or more analysis tools to unstructured data, and shown to one or more users and summarize information;Wherein the summary information is shown in presentation layer, exploration layer and annotation layer to one or more user.The unstructured data parser also can be used to receive external data from one or more remote sources.The presentation layer show it is following in it is one or more:Unstructured data, the summary of unstructured data and summary information.The layer of exploring allows one or more users to change the granularity for summarizing information, thus changes the granularity of presentation layer.One or more users can interact with unstructured data analysis system simultaneously via annotation layer.

Description

Unstructured data analysis system and method
The cross reference of related application
Entitled " the UNSTRUCTURED submitted on May 11st, 2015 of present patent application/patent requirements CO-PENDING The DATA ANALYTICS SYSTEMS AND METHODS INCLUDING A VISUALIZATION INTERFACE " U.S. Temporary patent application No.62/159,662 and entitled " the UNSTRUCTURED DATA submitted on May 11st, 2015 ANALYTICS SYSTEMS AND METHODS INCLUDING NATURAL LANGUAGE PROCESSING AND STATISTICS FUNCTIONS " U.S. Provisional Patent Application No.62/159,683 priority, it incite somebody to action both by quoting Full content be incorporated herein.
Technical field
The present invention relates generally to the method and system for analyzing big corpus of text and unstructured data.More specifically, The present invention relates to analyzed using visual analyzing and topic modeling, visualization interface and natural language processing and statistical function The method and system of big corpus of text and unstructured data.
Background technology
The management of a large amount of and growing set to text message and unstructured data is asking for challenge Topic.The data repository of knowledgeable text message has become to popularize, and causes to arrange, excavate and analyze mass data. With the increase of number of documents, the implication of learning text language material becomes to recognize with high costs and time-consuming.
For the researcher in natural language processing (NLP) field, summary automatically this challenge to big corpus of text is Through as principal concern.In order to summarize corpus of text, researcher has been developed such as extracting and representing word Use below the technology of the implicit semantic analysis (LSA) of the implication under environment etc.LSA, which is produced, can be used for document classification and gathers The concept space of class.Recently, occurred as finding the semantically meaningful words in non-structured text set The probability topic model of the favourable new technology of topic.In order to which the visualization further provided for corpus of text is summarized, sent out from knowledge Now the researcher with visualization community field has been developed to based on LSA and probability topic model (probabilistic Topic models) the two is supported the visualization (visualization) of big corpus of text and explores the work of (exploration) Tool and technology.
Although probability topic model demonstrates their advantage in terms of explaining with semantic association, almost do not have Interactive visual system supports the exploration and analysis to corpus of text using this model.Visualization based on example and Probability implicit semantic method for visualizing projects document semantic two-dimentional (2D) while the topic of corpus of text is estimated On chart.Although document clusters obey selected label well, it there's almost no what the interaction to document clusters was explored and analyzed Chance.One exception is time-based visualization system TIARA, and it is using river figure (ThemeRiver) metaphor with based on words Summarize text collection topic content visualization.By the analysis of TIARA systems, user can answer such as problems with:Document What the staple of conversation in language material isAnd how topic is with time evolution
However, when analyzing big corpus of text, exist current text analysis visualization system be difficult to answer it is many other Real world problem.Specifically, it is difficult to be answered with existing instrument on the problem of relation between topic and document.This problem Including:What the file characteristics of topic distribution based on document areAnd any document once includes multiple topics (and this is more What individual topic is)In the field of Scientific Strategy, such as the document with multiple topics can indicate interdisciplinary (that is, contain Cover more than one knowledge body) publication.Similarly, in the context of social media analysis, the document with multiple topics can To represent the unique news article related from different much-talked-about topics.
In order to overcome the shortcomings that associated with existing method and system, and in order to help user more effectively to understand greatly Corpus of text, the present invention provide novel Visualized Analysis System, and it divides newest probability topic model, implicit Di Li Crays Cloth (LDA) is integrated with interactive visual.In order to describe document language material, method and system of the invention is first by LDA extractions one The semantically meaningful topic of group.Different from most of traditional clustering techniques that document is assigned to specific clusters, LDA models consider In terms of the different topics of each separate document.This permits realizing a pair efficient comprehensive text for the larger document that can include multiple topics Analysis.In order to protrude the property of model, across topic document is presented using parallel coordinate metaphor in method and system of the invention Probability distribution.This present allows user to find single topic and more topic documents, and each topic for the document of concern Relative importance.Further, since most of corpus of text are that intersexuality, system and method for the invention also illustrate sometimes in itself Inscribe evolution with the time.
Exist in addition, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven Basis strengthens their decision process.The technology absorbs text data, and each by depth calculation and statistic algorithm, identification Theme, topic and produced problem in data set.Result is shown with interactive visual form so that any in company People being capable of integrally or subtly analyze data.(such as the electronics postal of all types of text data-internal datas can be analyzed Part, chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and news Website).The technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjustment analysis The highly customizable feature of effect is chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but several Have no ability to excavate unstructured text data acquirement information.
The content of the invention
Again, in each example embodiment, method and system of the invention is by interactive visual and newest probability topic Model tighter integration.Specifically, in order to solve the problems, such as herein above to propose, method and system of the invention utilizes parallel coordinate (PC) metaphor is presented the probability distribution across topic document.The well-chosen presentation not illustrate only document and how many topic phases Close, further it is shown that importance of each topic to document.In addition, the method and system of the present invention, which provides, can help user's base Topic number in document divides one group of abundant interaction of collection of document automatically.Except showing the relation between topic and document Outside, method and system of the invention is also supported for understanding other necessary tasks of collection of document, such as summarizes collection of document The staple of conversation, and show topic with the time how evolution.
The method and system of the present invention can effectively solve the problems, such as that set includes when analyzing big corpus of text:Capture text What the staple of conversation of shelves set isWhat the file characteristics of topic distribution based on document areAny document is once related to more Individual topicAnd how the topic of concern is with time evolutionIn order to help user to answer these problems, method of the invention and it is System is first by one group of semantically meaningful topic of LDA model extractions.In order to support based on topic model to collection of document Visualization explore, method and system of the invention protrudes the topic of document language material and temporal characteristics using multiple views of coordinating The two.The novel contribution of one of the method and system of the present invention is:Description to document by the probability distribution of topic, and support To the interactive identification of single topic and more topic documents and more detailed inspection.
In an example embodiment, the present invention is provided to the method for the computerization of text data analysis, including: At one or more processors the text data to be analyzed is received from one or more memories;It is one or more using this Individual processor is formatted for subsequent analysis to text data;Using one or more processors, to text data To extract one group of semantically meaningful topic, this organizes semantically meaningful topic and described jointly applied probability topic model All or part of of text data;It is raw using the keyword weighting block performed on one or more processors Into the topic cloud view that topic is expressed as to label-cloud, wherein each label-cloud is associated with multiple keywords;Using this one The topic order module performed on individual or more processor, all or part of of generation expression text data is in multiple topics On distribution Document distribution view;Use the document entropy computing module performed on one or more processors, generation Represent that how many topic can belong to all or part of document scatter diagram view of circumferential edge;Using one or more at this The interim topic trend computing module performed on individual processor, generation represent to talk about for all or part of of text data The time view that topic changes over time;And in all or part of analysis to text data, show to user Show one or more in topic cloud view, Document distribution view, document scatter diagram view and time view.Text data bag Include it is following in it is one or more:Derived from multiple documents text data, derived from multiple files text data, from one Text data and the text data derived from internet derived from individual or multiple data repositories.Probability topic model produces Each topic is simultaneously expressed as the multinomial distribution on multiple keywords by one group of implicit topic.Text data is described as topic Probability mixes.Alternatively, keyword is sorted to indicate them for giving the importance and relation to each other of topic.It is optional Ground, keyword is protruded to indicate their importance to multiple topics.Topic is sorted, to represent their relation.Herein also Various other illustrative functions are provided.
In another example embodiment, the present invention is provided to the method for the computerization of text data analysis, including: One or more memories and one or more processors, the memory can be used to store the text to be analyzed Data, the processor can be used to receive the text data to be analyzed;Performed on one or more processors Algorithm, it can be used to:Text data is formatted for subsequent analysis;Performed on one or more processors Algorithm, can be used to:To text data applied probability topic model, to extract one group of semantically meaningful topic, The semantically meaningful topic of the group describes all or part of of text data jointly;In one or more processors The keyword weighting block of upper execution, can be used to:Topic is expressed as the topic cloud view of label-cloud by generation, wherein each Label-cloud is associated with multiple keywords;The topic order module performed on one or more processors, operable use In:Generation represents the Document distribution view of all or part of distribution on multiple topics of text data;At this or The document entropy computing module performed on more processors, can be used to:Generation represents that how many topic can belong to and counted herein According to all or part of document scatter diagram view;The interim topic trend meter performed on one or more processors Module is calculated, can be used to:Generation represents that the generation of the topic for all or part of of text data changes with the time The time view of change;And display can be used to:In all or part of analysis to text data, show to user Show one or more in topic cloud view, Document distribution view, document scatter diagram view and time view.Text data bag Include it is following in it is one or more:Derived from multiple documents text data, derived from multiple files text data, from one Text data and the text data derived from internet derived from individual or multiple data repositories.Probability topic model produces One group of implicit topic, and each topic is expressed as the multinomial distribution on multiple keywords.Text data is described as topic Probability mixing.Alternatively, keyword is sorted to indicate them for giving the importance and relation to each other of topic.Can Selection of land, keyword is protruded to indicate their importance to multiple topics.Topic is sorted with the relation representing them.Herein Various other illustrative functions are also provided.
Again, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee and existed Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven Basis strengthens their decision process.The technology absorbs text data, and by depth calculation and statistic algorithm, identifies per number According to the theme in collection, topic and produced problem.Result is shown with interactive visual form so that anyone in company Can integrally or subtly analyze data.Can analyze all types of text data-internal datas (such as Email, Chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and News Network Stand).Technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjusting analytical effect Highly customizable feature be chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but are not almost had Have the ability to excavate unstructured text data acquirement information.
In additional example embodiment, the invention provides a kind of unstructured data analysis system, including:It is unstructured Data analysis algorithm, it is resident on the server and can be via browser access, and the unstructured data parser can Operate for receiving unstructured data from one or more remote sources, one or more points are applied to unstructured data Analysis instrument, and shown to one or more users and summarize information;(presentation) layer, exploration wherein is being presented (exploration) one or more middle shown to one or more users in layer and annotation layer summarize information.Non- knot Structure data include it is following in it is one or more:Customer experience data, teledata, e-mail data and social activity Media data.The unstructured data parser also can be used to:External number is received from one or more remote sources According to.External data include it is following in it is one or more:Internet data, government data and business data.To non-structural Change data application one or more analysis tools include it is following in it is one or more:Statistic algorithm, machine learning and, Natural language processing and text mining.Presentation layer show it is following in it is one or more:It is unstructured data, non-structural Change the summary of data and summarize information.The layer of exploring allows one or more users to change the granularity for summarizing information, by The granularity of this modification presentation layer.One or more users can hand over unstructured data analysis system simultaneously via annotation layer Mutually.Shown also in combination layer to one or more users and summarize information.
In another additional example embodiment, the invention provides a kind of unstructured data analysis method, including:There is provided Unstructured data parser, it is resident on the server and can analyzed via browser access, the unstructured data Algorithm can be operated for receiving unstructured data from one or more remote sources, to unstructured data using one or More analysis tools, and shown to one or more users and summarize information;Wherein in presentation layer, explore layer and annotation layer In it is one or more it is middle to one or more users show summarize information.Unstructured data include it is following in one Or more:Customer experience data, teledata, e-mail data and social media data.The unstructured data Parser also can be used to:External data is received from one or more remote sources.External data include it is following in one Individual or more:Internet data, government data and business data.Applied to unstructured data one or more Analysis tool include it is following in it is one or more:Statistic algorithm, machine learning, natural language processing and text mining. Presentation layer show it is following in it is one or more:In unstructured data, the summary of unstructured data and summary information It is one or more.The layer of exploring allows one or more users to change the granularity for summarizing information, thus changes presentation layer Granularity.One or more users can interact with unstructured data analysis system simultaneously via annotation layer.Also combining Shown in layer to one or more users and summarize information.
Brief description of the drawings
The present invention is had shown and described herein by reference to each accompanying drawing, and similar reference symbol is used to optionally identify class in accompanying drawing As method and step/system component, and in accompanying drawing:
Fig. 1 is the schematic diagram for an example embodiment for showing the visualText Concordance instrument of the present invention;
Fig. 2 is that the example for the topic cloud view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 3 is that the example for the Document distribution view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 4 is that the method according to the invention and system are shown on a topic, two topics and more than two topic A series of charts of Document distribution;
Fig. 5 is that the example for the topic cloud view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 6 is that the example for the time view for showing the visualText Concordance instrument of the present invention is shown;And
Fig. 7 is the schematic diagram for an example embodiment for showing the unstructured data analysis system according to the present invention;
Fig. 8 is the schematic diagram for another example embodiment for showing the unstructured data analysis system of the present invention;
Fig. 9 is the schematic diagram for the additional example embodiment for showing the non-structured data analysis system of the present invention;
Figure 10 is the schematic diagram for another example embodiment for showing the unstructured data analysis system of the present invention;
Figure 11 is the signal of an example embodiment of the presentation layer for showing the unstructured data analysis system of the present invention Diagram;
Figure 12 is the signal of an example embodiment of the exploration layer for showing the unstructured data analysis system of the present invention Diagram;And
Figure 13 is the signal of an example embodiment of the annotation layer for showing the unstructured data analysis system of the present invention Diagram.
Embodiment
Two-wire works, i.e., text analyzing model and text visualization technology are the main inspirations of the Preliminary design of the present invention. Then these concepts are refined and are extended based on it, are described in more detail below.
The first major progress in text-processing is vector space model (VSM).In the model, text is represented as height Vector in dimensional space, wherein each dimension is associated with a unique terminology in document.A VSM well known example is TF-IDF, it assesses significance level of the word for the document in language material.Although VSM shows having for it with practical experience Effect property, but numerous inherent defects be present in terms of the statistical framework between capturing document and in document in it.
The shortcomings that to overcome VSM, researcher have been introduced into LSA, and LSA is to be reduced to term document matrix to capture language The factor analysis of the much lower dimension subspace of most number variable in material.Although LSA overcomes VSM some shortcomings, It is that it also has its limitation.New feature space is difficult to explain, reason is that each dimension is one from luv space The linear combination of group word.
After the limitation for recognizing LSA, researcher proposes generating probability model to Document Modeling.For example, researcher Have been introduced into representing the generation model with the word of probability topic and the content of document, rather than pure space representation.It is this A kind of unique advantage represented is that each topic can be explained independently, there is provided based on the phase for selecting relational terms The probability distribution of the word of dry cluster.The implicit structure that is made up of one group of topic of LDA model hypothesis;Produced by following manner every Individual document:The distribution based on topic is selected, then randomly generates each word according to by using the topic for being distributed selection.Example Such as, as shown in by analysis science summary and newapaper file, the topic extracted captures having in other unstructured datas The structure of implication.In cognition aspect, prediction word association and language of the LDA models in various Language Processings and store tasks Performance is good in terms of justice association and fuzzy effect.
Due to the various advantages of LDA models, method and system of the invention extracts given text first by the model The semantically significant topic of one group of language material.Then probability results are presented with intuitive manner in the method and system of the present invention, with So that when analyzing big corpus of text, user can easily consume complex model.
In addition to except in autotext treatment technology the advantages of, artificial intelligence still plays crucial work when analyzing corpus of text With.A large amount of visualization systems and technology based on text handling method are therefore, it has been developed to, to keep user in process.
For example, using VSM, instrument is had been introduced into so that Email content visualizes, it is therefore intended that go through according to session History describes relation.Keyword in visualization is produced based on TF-IDF algorithms.
Other instruments allow users to visually explore text by social networks metaphor based on implicit semantic analysis result This language material.Other visualization systems are used for multidimensional projecting method (such as Principle components analysis (PCA) and/or multidimensional chi Spend (MDS)) so that corpus of text visualizes.These shadow casting techniques are mentally similar with LSA, due to they by text representation be will Vector of the term frequency as their features, then identify relatively low dimensional projections space.Visualization system is therefore based on these bags Include the shadow casting technique including IN-SPIRE.Recently, in order that big classifying documents are collection visualized, other people have been proposed using In the projection based on topology and the two level framework of visualization tool.However, most of tradition with document to be assigned to specific clusters Clustering technique is different, in terms of method and system of the invention considers the different topics of each separate document.
From topic model it is first expose, visualization system is because this model is relative to previous text-processing skill The advantage of art and use these models.Visualization and probability implicit semantic visualization tool based on example have projected document Onto static 2D charts, while estimate the topic of corpus of text.Although visualization cluster result ratio obtains from multidimensional projecting method Result it is more preferable, but there are some limitations in it.First, as the quantity of extraction topic increases, the document clusters in 2D projections are not It can be separated again based on topic.In addition, there's almost no in these visualization tools for document clusters interaction excavate and The space of analysis.Recently, TIARA is had been introduced into, i.e., a kind of time-based interactive visualization system, it is with time-sensitive Mode the topic extracted from given corpus of text is presented.TIARA is provided on topic with time evolution to topic Good general introduction.However, the relation between document and topic is not clear.
Therefore, method and system of the invention also presents document in addition to describing the topic evolution with time development Across the probability distribution of the topic extracted.Therefore, method and system of the invention provides the text of the topic distribution based on them The general introduction of shelves feature, and allow users to identify the document for being once related to multiple topics.
The method and system of the present invention supports the exploration to collection of document on many levels.In overview level, it is System auxiliary user answer problems with:What the staple of conversation of collection of document isAnd what document is characterized in the set In facet (facet) aspect, system supports for example following activity:Identify specific topics time trend, and identification with it is more The related document of individual concern topic.In level of detail, system allows the detailed content for accessing each separate document as needed. One of based on newest topic model, system uses multiple coordination views, and each view solves the above problems.
Referring now particularly to Fig. 1, in an example embodiment, visualText Concordance instrument 10 of the invention Overall structure includes:Offline Text Pretreatment 12 and topic modeling module 14.Text Pretreatment module 12 can be used to phase The text for closing document 16 is placed under appropraite condition for subsequent treatment, exploration and analysis.This Text Pretreatment can include but Be not limited to from social media (for example, Twitter is puted up and Facebook profiles), books (for example, coming from Gutenberg The document of online book entry) and other documents (for example, Email, Word document etc.) text pretreatment.
As described above, topic model has some advantages relative to traditional text treatment technology.Therefore, it is of the invention visual Change corpus of text analysis tool 10 and summarize relevant documentation 16 using the probability topic model in topic modeling module 14.More specifically Ground, LDA are used first to extract one group of semantically meaningful topic.LDA produces one group of implicit topic, and each topic is expressed For the multinomial distribution based on keyword, and assume that each document can be described as the probability mixing of these topics.P (z) is special Determine the distribution based on topic z in document.Assuming that text collection 16 includes D document and T topic.It is using visual to determine topic Change the iterative process of corpus of text analysis tool 10.The instrument 10 allows users to alternatively specify multiple topics to be considered as at it Analysis domain in be necessary.User is allowed to be built based on the discovery of the virtual interactive interface from them and investigation to change topic Mould module 14 so that they can change the quantity of topic and/or the iteration number of definition procedure.VisualText Concordance Instrument 10 also allows users to add, remove and merge topic to topic modeling module 14.
Therefore, collection of document 16 is pretreated to remove stop word etc. first.Then, Stamford topic modeling tool case Etc. (STMT) be used to extract topic set from collection of document 16.The topic and probability Document distribution of extraction serve as it is other can Depending on the input of change.
The visual design of the instrument 10 of the present invention includes four and coordinates to summarize, and it can be by either individually or in combination suitable Graphic user interface (GUI) on show and operate:(1) Document distribution view 18 of the document across the probability distribution of topic is shown; (2) the topic cloud 20 of the content of the topic of extraction is presented;(3) the time view 22 of the time evolution of prominent topic;And (4) promote Enter document scatter diagram view 24 of single topic relative to the interactive selection of more topic documents.Each in four general introductions is served Different purposes, and they are coordinated by one group of abundant user mutual.In addition, when selecting any document, regard in detail The content of text of that document is presented in figure as needed.
In order to help user quickly to catch the main points of collection of document, the staple of conversation is rendered as marking in topic cloud view 20 Sign cloud.In topic cloud view 20, often row shows a topic, and it is for example including the multiple keywords related to that topic. Because each topic is modeled as the multinomial distribution based on keyword, the weight of each keyword indicates its weight for topic The property wanted.In order to encapsulate this information in label-cloud, align keyword from left to right, wherein placing most important pass at beginning Keyword.Further, since a keyword can occur in multiple topics, the display size or weight of each keyword reflect Its appearance situation in all topics.However, those skilled in the art will be apparent from that other configurations can be used.There is provided in Fig. 2 The example of topic cloud view 20.In order to aid in user to understand the staple of conversation in collection of document 16, topic is presented in the sequence, makes Obtain semantically similar topic to be closely packed together so that continuity be present when browsing topic successively.Because LDA models are not talked with Relationship modeling between topic, topic is resequenced by defining similarity measurement.VisualText Concordance instrument 10 The similarity measurement for the degree of closeness for representing topic is characterized using woods lattice (Hellinger) distance function.VisualText language Material analysis tool 10 visualizes measuring similarity, to provide a user the understanding of the semantic layer to topic distribution, and by right Topic space clustering helps to reduce their cognitive overload.
Topic cloud view 20 also provides the user one group of interaction to help user's fast understanding topic.For example, in specific pass Hovering will cause occur highlighting to the every other of that keyword in label-cloud on keyword.User can also search for The particular keywords of concern.In addition, topic cloud view 20 provides pass with every other view close cooperation with rapid as needed In the information of specific topics.
Part produces topic cloud view 20 by online keyword weighting block 26, and online keyword weighting block 26 can be grasped Act on the result of polymerization topic modeling module.It topic is given to this based on probability of the word in given topic in word Language is classified, and word more likely will be placed in the top of classification queue.The value calculated with topic modeling module 14 marks The probable value.For example, determine the word in topic cloud view by the frequency of occurrences of the word in whole corpus of text Size, and be normalized based on maximum word frequency rate.For example, frequency is higher, word is bigger.For example, the acquiescence of instrument 10 represents every 50 most possible words of individual topic.User can pass through the quantity of interactive modifying word.
In order to which the general introduction of document to be provided as to the mixing of topic, instrument 10 of the invention protrudes each document across all extractions The distribution of the topic gone out.Document probability distribution is converted into the class signal shape pattern for representing each document by selected expression.More specifically Ground, using parallel coordinate metaphor, wherein each axle represents a topic and every line represents to gather a document in 16. The point is illustrated in Fig. 3.In the arrangement, all variable (i.e. topic) uniform intervals and each variable are shared from 0 to 1 Identical value scope.Therefore, when checking Document distribution view 18, it is not necessary to managed based on document in each individually value on axle Document is solved, and can be based on the pattern integrally on all axles to understand document.However, those skilled in the art will be obvious Other configurations can be used.
LDA it is a kind of be limited in it direct dialogue topic occur between cross correlation modeling, but in most of texts In language material, the cross correlation between topic appearance can be naturally enough anticipated.The instrument 10 of the present invention is by making between topic Cross-correlation is more outstanding to overcome the limitation using visualization.Coincidentally, one of parallel coordinate visualization is characterised by more holding Easily find the association between adjacent axle.Therefore, the mode for causing topic as semantic category adjacent to each other can be used to sort topic, So that the association between similar topic becomes visually prominent.The topic similitude is according to two words in whole documents 16 Euclidean distance between topic defines:
Wherein dkIt is one of D document in whole set 16, and P (dk) it is that k-th of document is general on whole topics Rate is distributed.Therefore, P (dk| z=i) represent when generating document k topic i probability.When in interface selected by topic is plotted as During axle, the topic most concentrated with probability is started and is then based on the lookup of the distance between topic and the most similar topic of actualite. Fig. 3 illustrates that the document across topic after topic is resequenced visualizes.Relation between the most similar topic of any two is (i.e. On adjacent axle) become visually to can recognize that.
Part produces Document distribution view 18 by online topic order module 28, and the online topic order module 28 can be grasped Act on the signal expression for performing above-mentioned function and separate document.This signal is the explanation of different nature to document.Depending on Figure 18 shows that there is the document being significantly distributed to pay special attention to particular topic on single topic, but with 2 or 3 topics The document of distribution indicate variable focus.
When exploring document in the distribution on topic, the topic number that easily can be had based on them finds given document Different characteristic is presented.Fig. 4 shows the document 32 and more than two topic of document 30, two topics of concern only one topic Document 34.Different topic numbers in document can be construed to the different characteristic under the context of given collection of document 16.Example Such as, in the set of scientific publications, there is the publication related to specific field of scientific study of the document representation of a topic Thing.Document with two or more topics more likely represents research article interdisciplinary, and it generally integrates two or more Individual professional knowledge body.
In addition, Document distribution view 18 provides abundant interaction set, such as brush, highlighted etc..Necessarily compare on brush topic The scope of example allows user to select the document for having particular probability for that specific topics.Topic cloud view is come from by synthesis 20 and Document distribution view 18 both the information related to main topic and file characteristics, user can effectively develop to document The general introduction of set 16.
Document distribution view 18 allows users to identify the text of concern specific topics by the upper extent on brush topic Shelves.However, identifying that the document related to two or more topics is less easy in big language material, reason is that they are high The single topic document of probable value is covered.In order to alleviate the problem, with can with easily separated single topic document and more topics text The mode of shelves separates whole documents.This is document scatter diagram view 24.
Class signal sample probability distribution pattern is converted into as can be seen that each document in Document distribution view 18.At this In expression, have more topics document show must clearly be paid close attention to than those a topic document noise it is bigger.In information theory, Shannon entropy is the measurement of the amount of the uncertainty associated with stochastic variable.Assuming that topic is to be directed to each text in our contexts The stochastic variable of shelves, Shannon entropy can be used for distinguishing clean signal and noise signal.Therefore, instrument 10 of the invention is applied Shannon entropy distinguishes document the topic number that has based on document.Each document based on it across the probability distribution of topic entropy quilt It is calculated as:
Wherein P (dk) it is probability distribution of k-th of document on whole topics.Then can be in document scatter diagram view 24 In entropy based on each document and its most probable value (being normalized to [0,1]) on topic come draw each document (referring to Fig. 5).In the presentation, for example, single topic (having higher maximum and relatively low entropy) document is in the upper left corner of scatter diagram, and The lower right corner captures the document with higher topic numbers (with relatively low maximum and higher entropy).In selection, pie chart is shown The topic distribution of particular document is described.In Figure 5, document selected by each pie chart expression, wherein each color represents a topic. As indicated, the document with smaller entropy shows as the pie chart of solid line circle;And the document with larger entropy is shown as with more Color, instruction entropy are corresponding with inputting the topic number in document.
In a word, document scatter diagram view 24 allows users to interactively know by the selection to document in different zones Not Ju You requirement topic document subgroup.Part produces document scatter diagram view 24 by document entropy computing module 36, The document entropy computing module 36 can be used to perform above-mentioned function and the packet to the document in any given corpus of text. Document scatter diagram view 24 is intentionally grouped based on the entropy of document to document, and visually illustrates to give on language material at that Concern, it is to pay close attention to single theme or variable theme to imply that language material.
Because most of collection of document 16 are with accumulated time, this temporal information, which is presented, to be helped to aid in user to understand language material Topic how evolution.Referring now particularly to Fig. 6, time view 22 is created as interactive river figure (ThemeRiver), its In each band represent a topic.In corpus of text, each document is associated with timestamp, thus can by it is each when Between distribution of the document on the topic plus and to determine height of each band with the time in frame.The unit of time frame depends on language Material, for example, be probably within 1 year the right times unit for scientific publications, and one month or even one day for news corpus To be more suitable.After selection time unit, document is divided into corresponding time frame based on timestamp.However, for every Individual time frame, by the distribution in the time frame to the topic from document plus and to calculate the height of each topic.
For example, in both topic cloud view 20 and Document distribution view 18, the order of topic is (from top to bottom) identical.It is logical Cross and normalized cumulant used between all adjacent topics, by interpolation color or pattern frequency spectrum, for topic assign color or Pattern.As a result, a pair of similar topics are assigned more similar color or pattern.
In a word, how the topic of the offer of time view 22 collection of document 16 summarizes with the visualization of time evolution.Except the table Beyond showing, various interactions are also supported in time view 22.Selection to time frame (a vertical time unit) causes to selected The filtering of all documents of issue in time frame.Similarly, for example, the intersection of the topic band and time frame in time view 22 The selection on selected topic with the document more than 30% probability caused to being issued during the time frame is clicked on point.Cause This, can identify that generation of what document to topic is made that shared in special time period.Time view 22 is by disclosing text Shelves gather the temporal information hidden in 16 and allow user rich to add based on time and topic execution filtering.
The generation time view 22 of part passage time topic trend computing module 38, the time topic trend computing module 38 It can be used to perform above-mentioned function and the inspection to detailed documentation.Time view 22, which allows users to directly select, for example to exist Document in particular time range simultaneously obtains corresponding data.Time view 22 is by disclosing the document associated with this description Details plays a crucial role in the visualization pattern for showing to identify to user and in the basis of trend.
When selecting any document, instrument 10 of the invention provides the details of the actual text content of the document of concern.By In any topic model all far from perfection, the function of detailed view is dual:First, it provides the user context to carry out The deep understanding of the keyword associated to topic with topic;Secondly, it helps the pattern shown in user's checking visualization.
Due to understand big corpus of text 16 can be related to the utilization to all four views, it is necessary to carefully ponder all views it Between coordination.In topic aspect, hover and will be dashed forward in other views on the topic in any view for being related to topic expression Go out to show same topic.For example, if user is hovered on an axle in Document distribution view 18, in topic cloud view 20 Same topic is highlighted with both time views 22.Therefore, user can rapid integrated keyword on specific topics, The information of Document distribution and time trend.In addition, view is also coordinated by color or pattern, wherein each topic is in whole views In there is same color or pattern.
In document aspect, any document set is selected to be protruded in other views in the view including each document Show same collection of document.For example, the brush operation in document scatter diagram view 20 is immediately reflected in Document distribution view 18, And vice versa.When user selected in document scatter diagram view 24 it is several with two prominent topics (i.e. intermediate range) During document, the topic combination that the distribution of these documents helps user to understand document is checked.
In terms of the time, the filtering to the document of writing/issue in special time period is supported.For example, in time view Click in 22 on a time frame (i.e. a vertical time unit) causes all documents to being issued in selected time span Filtering.Similarly, the click on the crosspoint of the topic band in time view 22 and time frame causes in the period The selection of following documents of period issue:The topic that those documents have accounts for main contributions to those documents (for example, more than 30% Probability).This selection is shown in both Document distribution view 18 and document scatter diagram view 24.The function allows user's base Document is filtered in the time of concern and topic, and then checks the document issued in selected time frame.
The instrument 10 of the present invention allows user to be explored from multiple viewpoints and inquire about big document language material 16.From topic cloud view 20 Start, even user can check the summary of language material 16 and identify the topic keyword of concern.According to Document distribution view 18, use Family can position the topic of concern and select to pay close attention to the document of the topic by carrying out brush operation on the vertical axis.User is then Selected document sets can be visually identified by checking the distribution in Document distribution view 18 and document scatter diagram view 24 Which close related to other topics.In addition, user always can be based on the details that selection checks document.For example, if user thinks Interdiscipline/multidisciplinary publication in language material 16 is identified, he/her is provided as passing through selection in document scatter diagram view 24 This point is realized to the document in the lower right corner in centre.If in addition, user for the passage time factor inquire about language material 16 it is interested, Then he/her can be clicked on by being clicked on a time frame or on the crosspoint of special time frame and topic To perform selection in time view 22.In a word, instrument 10 of the invention supports corpus of text 16 using multiple coordination views Interaction exploration.Each in view is designed to solve one in four major issues.
In order to assess efficiency of the instrument 10 of the present invention in terms of four target problems are answered, instrument 10 is applied to explore And two corpus of text are analyzed, the two corpus of text include what is authorized from the National Nature fund (NSF) of 2010 2006 Publication in science motion and IEEE VAST collections of thesis.
Case study 1.Analysis science motion.In the case study, we describe the data of our collections first.Then We characterize aiming field and show the group task based on us with the dialogue summary of NSF project administrator.Finally, Wo Menzhan It is existing that instrument can how auxiliary expert user solves these tasks.
Data Collection and preparation.Provide funds to determine and manage to examine the instrument whether can be made with supporting item manager Authorize investment structure, we collect first as computer and information science and engineering (CISE) board of directors a part information with The motion authorized from 2000 to 2010 of intelligence system (IIS) department.The set is formed by being authorized close to 4000, wherein having It is related to the structural data of the number of authorizing, the board of directors, department, project, project administrator, primary investigators and date of grant;With And the motion summary of the form with non-structured text.We handle the summary of all collections, wherein each summary forms language Single document in material.We remove the list of standard disabling word.That give the vocabulary of our 334,447 words.We Then 30 topics are extracted from language material using LDA models.
Portray in domain.The core of NSF mission is:By the research in traditional sphere of learning is provided with funds (including Identification widely influences), and provided with funds to variable and research interdisciplinary, keep the U.S. to be in and find forward position,.For The former is realized, NSF project administrator needs to identify suitable reviewer and group member to ensure that optimal possible go together is commented Examine.In order to efficiently perform the latter, project administrator needs to identify emerging field and research topic, to be interdiscipline and can The research of variation is provided with funds.In addition to making investment decision, project administrator also needs to manage their investment of authorizing and tied Structure.Although project administrator had been made fine in the past, they need new method to help them, and reason is science Naturally fast-changing characteristic and the notable growth of motion quantity submitted.Advanced tasks are mapped to eecutable item, we Devise and decision-making and three tasks that to authorize investment structure related.Task 1 pays close attention to the topic based on new motion and proposes new motion Hand over packet.The task filters Ziwen it is understood that the staple of conversation of corpus of text based on them relative to the feature of topic Shelves set.Task 2 is the suitable reviewer that identification is submitted for motion, and it further relates to know whether submission is related to multiple topics To collect correct expert.Finally, task 3 pay close attention to be related to find with the time development topic trend authorize fund structure when Between aspect.
Expert assesses.Because NSF project administrator is especially busy, we have invited preceding NSF project administrators to carry out me Expert assess.Participant has 2 years working experiences of the project administrator as NSF.In the beginning of the assessment, we Spending 30 minutes proves each visual system design and function.Then, we require that participant performs following three using instrument Individual task.
Task 1.200 motions submitted recently are grouped based on topic.Since topic cloud view, participant is fast Speed browses the topic of extraction to obtain the general view to submitting motion recently.Because participant is responsible for robotics and computer vision The motion in field, her notice is quickly concerned about on the two topics by she.In topic of the selection concern on robotics Motion when, participant in detailed view gaze swept title to verify their correlation.Although participant ensures each The motion of selection is related, and she is also noted that position of the motion in document scatter diagram view is scattered.Due in bottom right The motion of position more likely includes two or more topics, and participant is interesting to know these motions which is further related to other words Topic.Further filtered by the motion that in document scatter diagram view those are looked like with more cross discipline, participant It was found that they are related to the other field of such as Neuscience and society's communication etc.It is related literary when being selected in Document distribution view During shelves, detailed view is called to allow project administrator to check the PIs previously authorized.
Task 2.Identify suitable reviewer.In order to identify reviewer, participant wants motion being roughly grouped first.Base Explored in initial, participant summarizes and substantially exists two groups of motions:The core of one group of concern robotics field, and another group of use Knowledge body from such as Neuscience with the other field of social communication etc.In order to identify the reviewer of two groups of motions, ginseng Want to find PIs from the motion previously authorized with person.By checking historical data, project administrator is in Document distribution view Position the topic on robotics.Then she carries out brush operation to select relevant with the topic carry in the top extent of axle Case.Finally, participant turns to the PI that detailed view had previously been authorized to check in robotics field.For the interdiscipline in group 2 Motion, participant undergoes similar process to identify other experts from other association areas (such as Neuscience), with clothes It is engaged in evaluation figure, it is ensured that optimal possible peer review.
Task 3.The time trend of investment structure is authorized in analysis.In investment structure aspect, preceding project administrator is interested in Check the time trend of field that she is responsible in recent years.By exploring time view, participant has found to award in robotics field The trend stability for the motion given, although the overall quantity for the motion authorized during 2006 and 2009 is increasing.With robot Stable tendency is different, and the motion quantity authorized on the topic of " helping disabled person using technology " increases year by year.Preceding project Manager comments on that the view is valuable for her, and reason is that the view enables her to check and is difficult in other ways It was found that the Investment Trend on different topics.
In a word, participant thinks each view in instrument with understanding purpose good design.She comments on, the work Tool can play a driving role in the workflow of project administrator.Specifically, the fact that she likes:Our instrument can be with The automatic motion for suggesting more cross discipline, reason are that this is difficult to judge with traditional approach.She also likes the cooperation between view, This information from same language material different aspect that helps her rapid integrated.
Case study 2.Analyze VAST proceedings.With the maturation in visual analyzing field, how the field is looked back Evolution is beneficial.A kind of mode for solving the problem is analyzed by the publication of most important meeting-place receiving in visual analyzing Thing.In the case study, we recruit four researchers to explore since the field in 2006 has started in VAST meetings/seat The paper of Tan Huizhong issues.Because all users are familiar with visual analyzing field, it is intended that encouragement is freely explored, and this is with The task of the satisfactory texture in face is opposite.After evaluation, the discovery of participant is classified as two groups by us:It was found that the time of topic drills Interesting subdomains in the causality entered between funds source, and studying visualization analysis field.
Data Collection and preparation.We collect what is issued from 2006 to 2010 in VAST meetings/forum first Whole papers.Collect 123 publications altogether.Then each publication is resolved to including title, author, delivers year by we Limit, summary, the field of main body and bibliography.Our the whole main bodys to every article perform topic and modeled (from introduction to knot By), wherein every article forms a document in language material.Removal standard disables word, and 317,315 words are left to us Vocabulary.The not co-orbital record of each VAST meetings is directed to based on us, we are extracted 19 topics from language material.
User assesses.In four researchers that we recruit, two are the senior researchers in visual analyzing field, And another two is the doctor using visual analyzing as their main research interests.In the assessment, we are all participations Person provides advanced tasks and encourages more free excavate.After the instrument is introduced, we require that each participant identifies field Interior core topic and the field be between past 5 years how evolution.We will roughly be classified as two groups using pattern: Identification rise/topic of decline, and use the system as teaching tools.
Identification rise/decline topic.After whole topics being swept in topic cloud view, a senior researcher Comment on:Topic well meets the paper tracking from VAST meetings.When checking the time trend of each topic, participant's note Anticipate to several patterns for clearly rising and declining.For example, originally the topic on news-video analysis has attracted many concerns, But concern is reduced rapidly year by year.He is also noted that on Network monitoring and the similar trend on the topic of analysis.Will The pattern is associated with his knowledge, and participant explains the trend, because when the field starts, by as at that time The Department of Homeland Security (DHS) of Main sources of capital Finance has guided the Focus Area.Next, participant turns to the pattern risen, It indicates the concern in those topics caused in recent years.Specifically, since 2008, topic trend and uncertainty Both analysis and topic dimensional analysis and reduction have attracted more concerns.Equally by the knowledge phase of the pattern and he itself Association, participant comment on the foundation of this data for being likely to be introduced by NSF and DHS joints and visual analyzing (FODAVA) result of project.
Understand the field of visual analyzing.Another senior researcher's (it teaches visual analyzing course at that time) is commented on: He can be seen that the instrument for he course it is useful.Student can explore whole VAST publications, and identify and concern topic Relevant paper is for course demonstration.Similarly, another participant wants to check in visual analyzing field in text What has done in terms of analysis.He positions topic first, then selects the high publication of ranking on the topic in Document distribution view Thing.His gaze swept Article Titles in detailed view, and verify that paper selected by whole is satisfied by his interest.He is also noted that Some papers in the selection seem related to other topics of such as entity extraction and data base querying etc.The study it Afterwards, he requires the screen capture to detailed view so that he can search the paper that he identifies during the Learning Studies.
In a word, participant thinks that the instrument assists in them and explores the evolution in visual analyzing field, and is based on They investigate own interests identification publication for further.
It will be appreciated by those skilled in the art that the various modules and process of the present invention are realized using processing equipments such as computers 's.The processing equipments such as this computer can include one or more universal or special processors, such as microprocessor, numeral Signal processor, customized processor and field programmable gate array (FPGA) and programmed instruction (including the software uniquely stored With both firmwares), it controls one or more processors, with reference to specific non-processor, realize the present invention method and In the function of system some, most of or repertoire.Alternatively, some or all functions can be by the journey without storage The state machine of sequence instruction is realized, each function or function in ASIC in one or more application specific integrated circuits (ASIC) Some combinations be implemented as customized logic.Of course, it is possible to the combination using the above method.Furthermore, it is possible to via with it Being used for for upper storage can to the non-transient computer of the computer-readable code of the programmings such as computer, server, electrical equipment, equipment Read storage medium come realize in some example embodiments, computer, server, electrical equipment, equipment etc. each can include place Device is managed to perform the function of being described herein and require.The example that this computer can show disrespect on storage medium includes but is not limited to:Hard disk, Light storage facilities, magnetic storage apparatus, read-only storage (ROM), programmable read only memory (PROM), erasable programmable are only shown disrespect on Memory (EPROM), Electrically Erasable Read Only Memory (EEPROM), flash memory etc..When computer-readable in non-transient When being stored in medium, software can include that processor can be made in this execution by the instruction of computing device, processor response And/or other any circuits perform one group of operation, step, method, process, algorithm etc..
Again, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee and existed Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven Basis strengthens their decision process.The technology absorbs text data, and by depth calculation and statistic algorithm, identifies per number According to the theme in collection, topic and emerging problem.Result is shown with interactive visual form so that any in company People can integrally or subtly analyze data.(such as the electronics postal of all types of text data-internal datas can be analyzed Part, chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and news Website).The technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjustment analysis The highly customizable feature of effect is chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but several Have no ability to excavate unstructured text data acquirement information.
Generally, software of the invention complexity Visualization Platform in transmit the data analysis based on deep learning, its Disclose, analyze in the broad range in business decision field and speculate executable strategy.It is to find to influence sale, client takes The advantageous manner of contact in the data of business, operation and risk analysis stakeholder is by call center's audio, Email, new News, social media, chat, transaction data, client feedback and analysis connect.Structural data is also utilized, including retail Transaction, survey data, personal profiles etc., and country and International Industry, government and the specific data source of product.Software is can be by What any browser device accessed, prediction modeling, artificial intelligence and statistics NLP are incorporated, to analyze any type of non-knot Structure data.Visualization is integrally and/or subtly to provide.Whole system 40 is schematically shown in Fig. 7.System 40 makes With the multilingual API of high-throughput, for being extracted using complicated term extraction, entity designator, geographical space designator extracts, Time indicator is extracted and the analysis of opinion mood carries out information flag.System 40 also using data-driven semantic machine study and Cluster, associated using automatic term, count topic summary, influencer's interference, the content ordering of context-aware, content network pass Connection and product center analysis.
Referring now particularly to Fig. 8 and 9, in an example embodiment, the invention provides help company to find from data To the information platform 45 of the enhancing of the shortest path of income.It is brought together the data silo of fragment, creates top layer Unified visual analyzing layer, and enable the user from multiple commercial functions effectively and collaboratively extract valuable to see Solution.Platform 45 is safely located at the top in tissue data lake and compatible with the multiple grades of data infrastructure.It passes through depth Calculate and statistic algorithm absorbs unstructured data (for example, Email, message registration) and structural data (example automatically Such as, sale, budget, finance).Its feedback point and data point of processing number in terms of necessarily, and identify the theme in tissue, words in real time Topic and positive produced problem.It helps dynamically that customer experience trend is associated with whole company datas.Platform 45 is complete It is interactive and easy to use.Anyone in tissue, employee, analyst, sellers from front to commercial lead Person and c-type employee, can integrally or subtly be interacted with data, customize they itself instrument board and with other people shared hairs It is existing.In addition to data analysis background engine, platform 45 is also experienced with the UI of the user strengthened completely and supported.The present invention is User, which provides, has the customizable visual perfect instrument board of pixel.This make it that the analysis work of presentation user is much easier It is and more controllable.Exploring the rich interactive in layer allows user quickly to start to analyze details and keeps contextual information around it. Present invention ensure that and flexible Data analytic environment ensure user never lost while details is slipped into general aspect with The contact of data.This has surmounted only several visualizations;Consumer's Experience is expanded into various useful data analyses and visualization. It is easy like never before to be annotated and cooperated on analysis results.The present invention changed completely people can find, share and The mode to be cooperated in analysis task.User can annotate and share their discovery with colleague, support in each data analysis group Inside and outside cooperation.In a word, the present invention strengthens decision-making by providing the true environment of plan of data analysis.
Figure 10 is the schematic diagram for another example embodiment for showing the unstructured data analysis system 50 of the present invention.It is logical Often, such as with commercial enterprise customer experience data 52, teledata 54, e-mail data 56, the social media being closely related Data 58 and other data 60, polymerize in data repository 62, and such as outside of internet data, government data etc Data source 64 is drawn into unstructured data parser 66, and the unstructured data parser 66 for example resides in network clothes It is engaged on device, and can be via browser access.Such as specific descriptions herein above, unstructured data parser 66 is to data Applied forecasting modeling, artificial intelligence and statistics NLP, to disclose, analyze, speculate and visualize executable information.Advantageously, may be used To check executable information by various commercial 68, stakeholder or other users, its all can add or use other Mode, which is changed, to be visualized and shares result via public interactive user interface 70.
Figure 11 is that an example of the presentation layer 80 for showing the unstructured data analysis system 50 (Fig. 8) of the present invention is implemented The schematic diagram of example;Generally, presentation layer 80 allows display on unstructured data and/or the various summary information of result.Example Such as, presentation layer 80 is illustrated as showing customer experience data 82, teledata 84 and sales data 86.
Figure 12 is an example reality of the exploration layer 90 for showing the non-structured data analysis system 50 (Fig. 8) of the present invention Apply the schematic diagram of example.Generally, exploring layer 90 allows display on unstructured data and/or the various summary information of result. Exploring layer 90 also allows selection time granularity and is shown with further details.This " slipping into downwards " also corresponding renewal includes Other visualizations including presentation layer 80.For example, snapshot 94 is illustrated as selecting from customer experience data 92.
Figure 13 is an example reality of the annotation layer 100 for showing the unstructured data analysis system 50 (Fig. 8) of the present invention Apply the schematic diagram of example.Annotation layer 100 is configured as showing various results, and customer experience data 102, teledata 104, Email 106, social media data 108, other data 110 etc., and user comment 112 is received, the user comment 112 can To be accessed via shared user interface 114 by whole users or selected user.
Although illustrate and describe the present invention, this area skill with reference to preferred embodiment and its particular example herein Art personnel, which will be apparent from other embodiment and example, can also perform similar functions and/or realize similar results.Thus understand, All this equivalent embodiments and example within the spirit and scope of the present invention, and are intended to be covered by appended claims.

Claims (18)

1. a kind of unstructured data analysis system, including:
It is resident on the server and can be via the unstructured data parser of browser access, the unstructured data Parser, which can operate, to be used for:Unstructured data is received from one or more remote sources, to unstructured data application One or more analysis tools, and shown to one or more users and summarize information;
Wherein presentation layer, explore layer and annotate layer in it is one or more it is middle to one or more users show it is described always Tie information.
2. system according to claim 1, wherein the unstructured data include it is following in it is one or more:Visitor Family experience data, teledata, e-mail data, social media data and transaction data.
3. system according to claim 1, wherein the unstructured data parser can also be operated and is used for:From one Individual or more remote source receives external data.
4. system according to claim 3, wherein the external data include it is following in it is one or more:Internet Data, government data and business data.
5. system according to claim 1, wherein one or more analysis tool bags applied to unstructured data Include it is following in it is one or more:Statistic algorithm, machine learning, natural language processing and text mining.
6. system according to claim 1, wherein the presentation layer show it is following in it is one or more:It is unstructured Data, the summary of unstructured data and the summary information.
7. system according to claim 1, wherein the exploration layer allows one or more users to change the summary The granularity of information, thus change the granularity of presentation layer.
8. system according to claim 1, one of them or more user can via annotation layer simultaneously with it is described non- Structured data analysis system interacts.
9. system according to claim 1, wherein showing the summary to one or more users also in combination layer Information.
10. a kind of unstructured data analysis method, including:
There is provided it is resident on the server and can be described unstructured via the unstructured data parser of browser access Data analysis algorithm, which can operate, to be used for:Unstructured data is received from one or more remote sources, to unstructured data Using one or more analysis tools, and shown to one or more users and summarize information;
Wherein presentation layer, explore layer and annotate layer in it is one or more it is middle to one or more users show it is described always Tie information.
11. according to the method for claim 10, wherein the unstructured data include it is following in it is one or more: Customer experience data, teledata, e-mail data, social media data and transaction data.
12. according to the method for claim 10, wherein the unstructured data parser can also be operated and is used for:From One or more remote sources receive external data.
13. according to the method for claim 12, wherein the external data include it is following in it is one or more:Interconnection Network data, government data and business data.
14. according to the method for claim 10, wherein one or more analysis tools applied to unstructured data Including one or more in following:Statistic algorithm, machine learning, natural language processing and text mining.
15. according to the method for claim 10, wherein the presentation layer show it is following in it is one or more:It is non-structural Change data, the summary of unstructured data and the summary information.
16. according to the method for claim 10, wherein the exploration layer allows one or more users' modifications described total The granularity of information is tied, thus changes the granularity of presentation layer.
17. according to the method for claim 10, one of them or more user can via annotation layer simultaneously with non-knot Structure data analysis system interacts.
18. according to the method for claim 10, wherein also in combination layer to one or more users show it is described always Tie information.
CN201610496280.9A 2015-05-11 2016-06-28 Unstructured data analysis system and method Active CN107368506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011265115.5A CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201562159662P 2015-05-11 2015-05-11
US15/151,572 2016-05-11
US15/151,572 US10452698B2 (en) 2015-05-11 2016-05-11 Unstructured data analytics systems and methods

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202011265115.5A Division CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Publications (2)

Publication Number Publication Date
CN107368506A true CN107368506A (en) 2017-11-21
CN107368506B CN107368506B (en) 2020-11-06

Family

ID=60312579

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201610496280.9A Active CN107368506B (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method
CN202011265115.5A Pending CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202011265115.5A Pending CN112732878A (en) 2015-05-11 2016-06-28 Unstructured data analysis system and method

Country Status (1)

Country Link
CN (2) CN107368506B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170657A (en) * 2018-01-04 2018-06-15 陆丽娜 A kind of natural language long text generation method
CN109299286A (en) * 2018-09-28 2019-02-01 北京赛博贝斯数据科技有限责任公司 The Knowledge Discovery Method and system of unstructured data
CN110413782A (en) * 2019-07-23 2019-11-05 杭州城市大数据运营有限公司 A kind of table automatic theme classification method, device, computer equipment and storage medium
CN112883186A (en) * 2019-11-29 2021-06-01 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
CN102750355A (en) * 2012-06-11 2012-10-24 清华大学 Visual management method for non-structured data management system
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
US20140040275A1 (en) * 2010-02-09 2014-02-06 Siemens Corporation Semantic search tool for document tagging, indexing and search
US9135242B1 (en) * 2011-10-10 2015-09-15 The University Of North Carolina At Charlotte Methods and systems for the analysis of large text corpora

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004534324A (en) * 2001-07-04 2004-11-11 コギズム・インターメディア・アーゲー Extensible interactive document retrieval system with index
US7849048B2 (en) * 2005-07-05 2010-12-07 Clarabridge, Inc. System and method of making unstructured data available to structured data analysis tools
KR101481253B1 (en) * 2013-03-14 2015-01-13 한국과학기술원 Method and system for providing summery of text document using word cloud
CN103473369A (en) * 2013-09-27 2013-12-25 清华大学 Semantic-based information acquisition method and semantic-based information acquisition system
US20160071212A1 (en) * 2014-09-09 2016-03-10 Perry H. Beaumont Structured and unstructured data processing method to create and implement investment strategies

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101308498A (en) * 2008-07-03 2008-11-19 上海交通大学 Text collection visualized system
US20140040275A1 (en) * 2010-02-09 2014-02-06 Siemens Corporation Semantic search tool for document tagging, indexing and search
CN102929894A (en) * 2011-08-12 2013-02-13 中国人民解放军总参谋部第五十七研究所 Online clustering visualization method of text
US9135242B1 (en) * 2011-10-10 2015-09-15 The University Of North Carolina At Charlotte Methods and systems for the analysis of large text corpora
CN102750355A (en) * 2012-06-11 2012-10-24 清华大学 Visual management method for non-structured data management system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170657A (en) * 2018-01-04 2018-06-15 陆丽娜 A kind of natural language long text generation method
CN109299286A (en) * 2018-09-28 2019-02-01 北京赛博贝斯数据科技有限责任公司 The Knowledge Discovery Method and system of unstructured data
CN110413782A (en) * 2019-07-23 2019-11-05 杭州城市大数据运营有限公司 A kind of table automatic theme classification method, device, computer equipment and storage medium
CN110413782B (en) * 2019-07-23 2022-08-26 杭州城市大数据运营有限公司 Automatic table theme classification method and device, computer equipment and storage medium
CN112883186A (en) * 2019-11-29 2021-06-01 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map
CN112883186B (en) * 2019-11-29 2024-04-12 智慧芽信息科技(苏州)有限公司 Method, system, equipment and storage medium for generating information map

Also Published As

Publication number Publication date
CN112732878A (en) 2021-04-30
CN107368506B (en) 2020-11-06

Similar Documents

Publication Publication Date Title
US10452698B2 (en) Unstructured data analytics systems and methods
US11003864B2 (en) Artificial intelligence optimized unstructured data analytics systems and methods
Jaton We get the algorithms of our ground truths: Designing referential databases in digital image processing
US9135242B1 (en) Methods and systems for the analysis of large text corpora
Isenberg et al. Visualization as seen through its research paper keywords
Alsallakh et al. The state‐of‐the‐art of set visualization
Liu et al. A survey on information visualization: recent advances and challenges
Dou et al. Paralleltopics: A probabilistic approach to exploring document collections
Cao et al. Facetatlas: Multifaceted visualization for rich text corpora
Yang et al. Cognitive impact of virtual reality sketching on designers’ concept generation
Li et al. Dynamic mapping of design elements and affective responses: a machine learning based method for affective design
Akerkar et al. Intelligent techniques for data science
Alper et al. Opinionblocks: Visualizing consumer reviews
Roberts et al. Visualising business data: A survey
Pillutla et al. Iterative generation of insight from text collections through mutually reinforcing visualizations and fuzzy cognitive maps
CN107368506A (en) Unstructured data analysis system and method
Mukkamala et al. Towards a formal model of social data
Isenberg et al. Toward a deeper understanding of visualization through keyword analysis
Das et al. Questo: Interactive construction of objective functions for classification tasks
Verspoor et al. Commviz: Visualization of semantic patterns in large social communication networks
McGee et al. Towards visual analytics of multilayer graphs for digital cultural heritage
Delias et al. Formulating the potentials of clustering of event data over multiple entities for decision support: a network embeddings approach
Jain Comprehensive survey on data science, lifecycle, tools and its research issues
Chen et al. Customer segmentation and classification from blogs by using data mining: an example of VOIP phone
Liu et al. Understanding Consumer Preferences---Eliciting Topics from Online Q&A Community

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant