CN107368506A - Unstructured data analysis system and method - Google Patents
Unstructured data analysis system and method Download PDFInfo
- Publication number
- CN107368506A CN107368506A CN201610496280.9A CN201610496280A CN107368506A CN 107368506 A CN107368506 A CN 107368506A CN 201610496280 A CN201610496280 A CN 201610496280A CN 107368506 A CN107368506 A CN 107368506A
- Authority
- CN
- China
- Prior art keywords
- data
- topic
- document
- unstructured data
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of unstructured data analysis system, including:Unstructured data parser, it is resident on the server and can be used to via browser access, the unstructured data parser:Unstructured data is received from one or more remote sources, applies one or more analysis tools to unstructured data, and shown to one or more users and summarize information;Wherein the summary information is shown in presentation layer, exploration layer and annotation layer to one or more user.The unstructured data parser also can be used to receive external data from one or more remote sources.The presentation layer show it is following in it is one or more:Unstructured data, the summary of unstructured data and summary information.The layer of exploring allows one or more users to change the granularity for summarizing information, thus changes the granularity of presentation layer.One or more users can interact with unstructured data analysis system simultaneously via annotation layer.
Description
The cross reference of related application
Entitled " the UNSTRUCTURED submitted on May 11st, 2015 of present patent application/patent requirements CO-PENDING
The DATA ANALYTICS SYSTEMS AND METHODS INCLUDING A VISUALIZATION INTERFACE " U.S.
Temporary patent application No.62/159,662 and entitled " the UNSTRUCTURED DATA submitted on May 11st, 2015
ANALYTICS SYSTEMS AND METHODS INCLUDING NATURAL LANGUAGE PROCESSING AND
STATISTICS FUNCTIONS " U.S. Provisional Patent Application No.62/159,683 priority, it incite somebody to action both by quoting
Full content be incorporated herein.
Technical field
The present invention relates generally to the method and system for analyzing big corpus of text and unstructured data.More specifically,
The present invention relates to analyzed using visual analyzing and topic modeling, visualization interface and natural language processing and statistical function
The method and system of big corpus of text and unstructured data.
Background technology
The management of a large amount of and growing set to text message and unstructured data is asking for challenge
Topic.The data repository of knowledgeable text message has become to popularize, and causes to arrange, excavate and analyze mass data.
With the increase of number of documents, the implication of learning text language material becomes to recognize with high costs and time-consuming.
For the researcher in natural language processing (NLP) field, summary automatically this challenge to big corpus of text is
Through as principal concern.In order to summarize corpus of text, researcher has been developed such as extracting and representing word
Use below the technology of the implicit semantic analysis (LSA) of the implication under environment etc.LSA, which is produced, can be used for document classification and gathers
The concept space of class.Recently, occurred as finding the semantically meaningful words in non-structured text set
The probability topic model of the favourable new technology of topic.In order to which the visualization further provided for corpus of text is summarized, sent out from knowledge
Now the researcher with visualization community field has been developed to based on LSA and probability topic model (probabilistic
Topic models) the two is supported the visualization (visualization) of big corpus of text and explores the work of (exploration)
Tool and technology.
Although probability topic model demonstrates their advantage in terms of explaining with semantic association, almost do not have
Interactive visual system supports the exploration and analysis to corpus of text using this model.Visualization based on example and
Probability implicit semantic method for visualizing projects document semantic two-dimentional (2D) while the topic of corpus of text is estimated
On chart.Although document clusters obey selected label well, it there's almost no what the interaction to document clusters was explored and analyzed
Chance.One exception is time-based visualization system TIARA, and it is using river figure (ThemeRiver) metaphor with based on words
Summarize text collection topic content visualization.By the analysis of TIARA systems, user can answer such as problems with:Document
What the staple of conversation in language material isAnd how topic is with time evolution
However, when analyzing big corpus of text, exist current text analysis visualization system be difficult to answer it is many other
Real world problem.Specifically, it is difficult to be answered with existing instrument on the problem of relation between topic and document.This problem
Including:What the file characteristics of topic distribution based on document areAnd any document once includes multiple topics (and this is more
What individual topic is)In the field of Scientific Strategy, such as the document with multiple topics can indicate interdisciplinary (that is, contain
Cover more than one knowledge body) publication.Similarly, in the context of social media analysis, the document with multiple topics can
To represent the unique news article related from different much-talked-about topics.
In order to overcome the shortcomings that associated with existing method and system, and in order to help user more effectively to understand greatly
Corpus of text, the present invention provide novel Visualized Analysis System, and it divides newest probability topic model, implicit Di Li Crays
Cloth (LDA) is integrated with interactive visual.In order to describe document language material, method and system of the invention is first by LDA extractions one
The semantically meaningful topic of group.Different from most of traditional clustering techniques that document is assigned to specific clusters, LDA models consider
In terms of the different topics of each separate document.This permits realizing a pair efficient comprehensive text for the larger document that can include multiple topics
Analysis.In order to protrude the property of model, across topic document is presented using parallel coordinate metaphor in method and system of the invention
Probability distribution.This present allows user to find single topic and more topic documents, and each topic for the document of concern
Relative importance.Further, since most of corpus of text are that intersexuality, system and method for the invention also illustrate sometimes in itself
Inscribe evolution with the time.
Exist in addition, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee
Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven
Basis strengthens their decision process.The technology absorbs text data, and each by depth calculation and statistic algorithm, identification
Theme, topic and produced problem in data set.Result is shown with interactive visual form so that any in company
People being capable of integrally or subtly analyze data.(such as the electronics postal of all types of text data-internal datas can be analyzed
Part, chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and news
Website).The technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjustment analysis
The highly customizable feature of effect is chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but several
Have no ability to excavate unstructured text data acquirement information.
The content of the invention
Again, in each example embodiment, method and system of the invention is by interactive visual and newest probability topic
Model tighter integration.Specifically, in order to solve the problems, such as herein above to propose, method and system of the invention utilizes parallel coordinate
(PC) metaphor is presented the probability distribution across topic document.The well-chosen presentation not illustrate only document and how many topic phases
Close, further it is shown that importance of each topic to document.In addition, the method and system of the present invention, which provides, can help user's base
Topic number in document divides one group of abundant interaction of collection of document automatically.Except showing the relation between topic and document
Outside, method and system of the invention is also supported for understanding other necessary tasks of collection of document, such as summarizes collection of document
The staple of conversation, and show topic with the time how evolution.
The method and system of the present invention can effectively solve the problems, such as that set includes when analyzing big corpus of text:Capture text
What the staple of conversation of shelves set isWhat the file characteristics of topic distribution based on document areAny document is once related to more
Individual topicAnd how the topic of concern is with time evolutionIn order to help user to answer these problems, method of the invention and it is
System is first by one group of semantically meaningful topic of LDA model extractions.In order to support based on topic model to collection of document
Visualization explore, method and system of the invention protrudes the topic of document language material and temporal characteristics using multiple views of coordinating
The two.The novel contribution of one of the method and system of the present invention is:Description to document by the probability distribution of topic, and support
To the interactive identification of single topic and more topic documents and more detailed inspection.
In an example embodiment, the present invention is provided to the method for the computerization of text data analysis, including:
At one or more processors the text data to be analyzed is received from one or more memories;It is one or more using this
Individual processor is formatted for subsequent analysis to text data;Using one or more processors, to text data
To extract one group of semantically meaningful topic, this organizes semantically meaningful topic and described jointly applied probability topic model
All or part of of text data;It is raw using the keyword weighting block performed on one or more processors
Into the topic cloud view that topic is expressed as to label-cloud, wherein each label-cloud is associated with multiple keywords;Using this one
The topic order module performed on individual or more processor, all or part of of generation expression text data is in multiple topics
On distribution Document distribution view;Use the document entropy computing module performed on one or more processors, generation
Represent that how many topic can belong to all or part of document scatter diagram view of circumferential edge;Using one or more at this
The interim topic trend computing module performed on individual processor, generation represent to talk about for all or part of of text data
The time view that topic changes over time;And in all or part of analysis to text data, show to user
Show one or more in topic cloud view, Document distribution view, document scatter diagram view and time view.Text data bag
Include it is following in it is one or more:Derived from multiple documents text data, derived from multiple files text data, from one
Text data and the text data derived from internet derived from individual or multiple data repositories.Probability topic model produces
Each topic is simultaneously expressed as the multinomial distribution on multiple keywords by one group of implicit topic.Text data is described as topic
Probability mixes.Alternatively, keyword is sorted to indicate them for giving the importance and relation to each other of topic.It is optional
Ground, keyword is protruded to indicate their importance to multiple topics.Topic is sorted, to represent their relation.Herein also
Various other illustrative functions are provided.
In another example embodiment, the present invention is provided to the method for the computerization of text data analysis, including:
One or more memories and one or more processors, the memory can be used to store the text to be analyzed
Data, the processor can be used to receive the text data to be analyzed;Performed on one or more processors
Algorithm, it can be used to:Text data is formatted for subsequent analysis;Performed on one or more processors
Algorithm, can be used to:To text data applied probability topic model, to extract one group of semantically meaningful topic,
The semantically meaningful topic of the group describes all or part of of text data jointly;In one or more processors
The keyword weighting block of upper execution, can be used to:Topic is expressed as the topic cloud view of label-cloud by generation, wherein each
Label-cloud is associated with multiple keywords;The topic order module performed on one or more processors, operable use
In:Generation represents the Document distribution view of all or part of distribution on multiple topics of text data;At this or
The document entropy computing module performed on more processors, can be used to:Generation represents that how many topic can belong to and counted herein
According to all or part of document scatter diagram view;The interim topic trend meter performed on one or more processors
Module is calculated, can be used to:Generation represents that the generation of the topic for all or part of of text data changes with the time
The time view of change;And display can be used to:In all or part of analysis to text data, show to user
Show one or more in topic cloud view, Document distribution view, document scatter diagram view and time view.Text data bag
Include it is following in it is one or more:Derived from multiple documents text data, derived from multiple files text data, from one
Text data and the text data derived from internet derived from individual or multiple data repositories.Probability topic model produces
One group of implicit topic, and each topic is expressed as the multinomial distribution on multiple keywords.Text data is described as topic
Probability mixing.Alternatively, keyword is sorted to indicate them for giving the importance and relation to each other of topic.Can
Selection of land, keyword is protruded to indicate their importance to multiple topics.Topic is sorted with the relation representing them.Herein
Various other illustrative functions are also provided.
Again, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee and existed
Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven
Basis strengthens their decision process.The technology absorbs text data, and by depth calculation and statistic algorithm, identifies per number
According to the theme in collection, topic and produced problem.Result is shown with interactive visual form so that anyone in company
Can integrally or subtly analyze data.Can analyze all types of text data-internal datas (such as Email,
Chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and News Network
Stand).Technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjusting analytical effect
Highly customizable feature be chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but are not almost had
Have the ability to excavate unstructured text data acquirement information.
In additional example embodiment, the invention provides a kind of unstructured data analysis system, including:It is unstructured
Data analysis algorithm, it is resident on the server and can be via browser access, and the unstructured data parser can
Operate for receiving unstructured data from one or more remote sources, one or more points are applied to unstructured data
Analysis instrument, and shown to one or more users and summarize information;(presentation) layer, exploration wherein is being presented
(exploration) one or more middle shown to one or more users in layer and annotation layer summarize information.Non- knot
Structure data include it is following in it is one or more:Customer experience data, teledata, e-mail data and social activity
Media data.The unstructured data parser also can be used to:External number is received from one or more remote sources
According to.External data include it is following in it is one or more:Internet data, government data and business data.To non-structural
Change data application one or more analysis tools include it is following in it is one or more:Statistic algorithm, machine learning and,
Natural language processing and text mining.Presentation layer show it is following in it is one or more:It is unstructured data, non-structural
Change the summary of data and summarize information.The layer of exploring allows one or more users to change the granularity for summarizing information, by
The granularity of this modification presentation layer.One or more users can hand over unstructured data analysis system simultaneously via annotation layer
Mutually.Shown also in combination layer to one or more users and summarize information.
In another additional example embodiment, the invention provides a kind of unstructured data analysis method, including:There is provided
Unstructured data parser, it is resident on the server and can analyzed via browser access, the unstructured data
Algorithm can be operated for receiving unstructured data from one or more remote sources, to unstructured data using one or
More analysis tools, and shown to one or more users and summarize information;Wherein in presentation layer, explore layer and annotation layer
In it is one or more it is middle to one or more users show summarize information.Unstructured data include it is following in one
Or more:Customer experience data, teledata, e-mail data and social media data.The unstructured data
Parser also can be used to:External data is received from one or more remote sources.External data include it is following in one
Individual or more:Internet data, government data and business data.Applied to unstructured data one or more
Analysis tool include it is following in it is one or more:Statistic algorithm, machine learning, natural language processing and text mining.
Presentation layer show it is following in it is one or more:In unstructured data, the summary of unstructured data and summary information
It is one or more.The layer of exploring allows one or more users to change the granularity for summarizing information, thus changes presentation layer
Granularity.One or more users can interact with unstructured data analysis system simultaneously via annotation layer.Also combining
Shown in layer to one or more users and summarize information.
Brief description of the drawings
The present invention is had shown and described herein by reference to each accompanying drawing, and similar reference symbol is used to optionally identify class in accompanying drawing
As method and step/system component, and in accompanying drawing:
Fig. 1 is the schematic diagram for an example embodiment for showing the visualText Concordance instrument of the present invention;
Fig. 2 is that the example for the topic cloud view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 3 is that the example for the Document distribution view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 4 is that the method according to the invention and system are shown on a topic, two topics and more than two topic
A series of charts of Document distribution;
Fig. 5 is that the example for the topic cloud view for showing the visualText Concordance instrument of the present invention is shown;
Fig. 6 is that the example for the time view for showing the visualText Concordance instrument of the present invention is shown;And
Fig. 7 is the schematic diagram for an example embodiment for showing the unstructured data analysis system according to the present invention;
Fig. 8 is the schematic diagram for another example embodiment for showing the unstructured data analysis system of the present invention;
Fig. 9 is the schematic diagram for the additional example embodiment for showing the non-structured data analysis system of the present invention;
Figure 10 is the schematic diagram for another example embodiment for showing the unstructured data analysis system of the present invention;
Figure 11 is the signal of an example embodiment of the presentation layer for showing the unstructured data analysis system of the present invention
Diagram;
Figure 12 is the signal of an example embodiment of the exploration layer for showing the unstructured data analysis system of the present invention
Diagram;And
Figure 13 is the signal of an example embodiment of the annotation layer for showing the unstructured data analysis system of the present invention
Diagram.
Embodiment
Two-wire works, i.e., text analyzing model and text visualization technology are the main inspirations of the Preliminary design of the present invention.
Then these concepts are refined and are extended based on it, are described in more detail below.
The first major progress in text-processing is vector space model (VSM).In the model, text is represented as height
Vector in dimensional space, wherein each dimension is associated with a unique terminology in document.A VSM well known example is
TF-IDF, it assesses significance level of the word for the document in language material.Although VSM shows having for it with practical experience
Effect property, but numerous inherent defects be present in terms of the statistical framework between capturing document and in document in it.
The shortcomings that to overcome VSM, researcher have been introduced into LSA, and LSA is to be reduced to term document matrix to capture language
The factor analysis of the much lower dimension subspace of most number variable in material.Although LSA overcomes VSM some shortcomings,
It is that it also has its limitation.New feature space is difficult to explain, reason is that each dimension is one from luv space
The linear combination of group word.
After the limitation for recognizing LSA, researcher proposes generating probability model to Document Modeling.For example, researcher
Have been introduced into representing the generation model with the word of probability topic and the content of document, rather than pure space representation.It is this
A kind of unique advantage represented is that each topic can be explained independently, there is provided based on the phase for selecting relational terms
The probability distribution of the word of dry cluster.The implicit structure that is made up of one group of topic of LDA model hypothesis;Produced by following manner every
Individual document:The distribution based on topic is selected, then randomly generates each word according to by using the topic for being distributed selection.Example
Such as, as shown in by analysis science summary and newapaper file, the topic extracted captures having in other unstructured datas
The structure of implication.In cognition aspect, prediction word association and language of the LDA models in various Language Processings and store tasks
Performance is good in terms of justice association and fuzzy effect.
Due to the various advantages of LDA models, method and system of the invention extracts given text first by the model
The semantically significant topic of one group of language material.Then probability results are presented with intuitive manner in the method and system of the present invention, with
So that when analyzing big corpus of text, user can easily consume complex model.
In addition to except in autotext treatment technology the advantages of, artificial intelligence still plays crucial work when analyzing corpus of text
With.A large amount of visualization systems and technology based on text handling method are therefore, it has been developed to, to keep user in process.
For example, using VSM, instrument is had been introduced into so that Email content visualizes, it is therefore intended that go through according to session
History describes relation.Keyword in visualization is produced based on TF-IDF algorithms.
Other instruments allow users to visually explore text by social networks metaphor based on implicit semantic analysis result
This language material.Other visualization systems are used for multidimensional projecting method (such as Principle components analysis (PCA) and/or multidimensional chi
Spend (MDS)) so that corpus of text visualizes.These shadow casting techniques are mentally similar with LSA, due to they by text representation be will
Vector of the term frequency as their features, then identify relatively low dimensional projections space.Visualization system is therefore based on these bags
Include the shadow casting technique including IN-SPIRE.Recently, in order that big classifying documents are collection visualized, other people have been proposed using
In the projection based on topology and the two level framework of visualization tool.However, most of tradition with document to be assigned to specific clusters
Clustering technique is different, in terms of method and system of the invention considers the different topics of each separate document.
From topic model it is first expose, visualization system is because this model is relative to previous text-processing skill
The advantage of art and use these models.Visualization and probability implicit semantic visualization tool based on example have projected document
Onto static 2D charts, while estimate the topic of corpus of text.Although visualization cluster result ratio obtains from multidimensional projecting method
Result it is more preferable, but there are some limitations in it.First, as the quantity of extraction topic increases, the document clusters in 2D projections are not
It can be separated again based on topic.In addition, there's almost no in these visualization tools for document clusters interaction excavate and
The space of analysis.Recently, TIARA is had been introduced into, i.e., a kind of time-based interactive visualization system, it is with time-sensitive
Mode the topic extracted from given corpus of text is presented.TIARA is provided on topic with time evolution to topic
Good general introduction.However, the relation between document and topic is not clear.
Therefore, method and system of the invention also presents document in addition to describing the topic evolution with time development
Across the probability distribution of the topic extracted.Therefore, method and system of the invention provides the text of the topic distribution based on them
The general introduction of shelves feature, and allow users to identify the document for being once related to multiple topics.
The method and system of the present invention supports the exploration to collection of document on many levels.In overview level, it is
System auxiliary user answer problems with:What the staple of conversation of collection of document isAnd what document is characterized in the set
In facet (facet) aspect, system supports for example following activity:Identify specific topics time trend, and identification with it is more
The related document of individual concern topic.In level of detail, system allows the detailed content for accessing each separate document as needed.
One of based on newest topic model, system uses multiple coordination views, and each view solves the above problems.
Referring now particularly to Fig. 1, in an example embodiment, visualText Concordance instrument 10 of the invention
Overall structure includes:Offline Text Pretreatment 12 and topic modeling module 14.Text Pretreatment module 12 can be used to phase
The text for closing document 16 is placed under appropraite condition for subsequent treatment, exploration and analysis.This Text Pretreatment can include but
Be not limited to from social media (for example, Twitter is puted up and Facebook profiles), books (for example, coming from Gutenberg
The document of online book entry) and other documents (for example, Email, Word document etc.) text pretreatment.
As described above, topic model has some advantages relative to traditional text treatment technology.Therefore, it is of the invention visual
Change corpus of text analysis tool 10 and summarize relevant documentation 16 using the probability topic model in topic modeling module 14.More specifically
Ground, LDA are used first to extract one group of semantically meaningful topic.LDA produces one group of implicit topic, and each topic is expressed
For the multinomial distribution based on keyword, and assume that each document can be described as the probability mixing of these topics.P (z) is special
Determine the distribution based on topic z in document.Assuming that text collection 16 includes D document and T topic.It is using visual to determine topic
Change the iterative process of corpus of text analysis tool 10.The instrument 10 allows users to alternatively specify multiple topics to be considered as at it
Analysis domain in be necessary.User is allowed to be built based on the discovery of the virtual interactive interface from them and investigation to change topic
Mould module 14 so that they can change the quantity of topic and/or the iteration number of definition procedure.VisualText Concordance
Instrument 10 also allows users to add, remove and merge topic to topic modeling module 14.
Therefore, collection of document 16 is pretreated to remove stop word etc. first.Then, Stamford topic modeling tool case
Etc. (STMT) be used to extract topic set from collection of document 16.The topic and probability Document distribution of extraction serve as it is other can
Depending on the input of change.
The visual design of the instrument 10 of the present invention includes four and coordinates to summarize, and it can be by either individually or in combination suitable
Graphic user interface (GUI) on show and operate:(1) Document distribution view 18 of the document across the probability distribution of topic is shown;
(2) the topic cloud 20 of the content of the topic of extraction is presented;(3) the time view 22 of the time evolution of prominent topic;And (4) promote
Enter document scatter diagram view 24 of single topic relative to the interactive selection of more topic documents.Each in four general introductions is served
Different purposes, and they are coordinated by one group of abundant user mutual.In addition, when selecting any document, regard in detail
The content of text of that document is presented in figure as needed.
In order to help user quickly to catch the main points of collection of document, the staple of conversation is rendered as marking in topic cloud view 20
Sign cloud.In topic cloud view 20, often row shows a topic, and it is for example including the multiple keywords related to that topic.
Because each topic is modeled as the multinomial distribution based on keyword, the weight of each keyword indicates its weight for topic
The property wanted.In order to encapsulate this information in label-cloud, align keyword from left to right, wherein placing most important pass at beginning
Keyword.Further, since a keyword can occur in multiple topics, the display size or weight of each keyword reflect
Its appearance situation in all topics.However, those skilled in the art will be apparent from that other configurations can be used.There is provided in Fig. 2
The example of topic cloud view 20.In order to aid in user to understand the staple of conversation in collection of document 16, topic is presented in the sequence, makes
Obtain semantically similar topic to be closely packed together so that continuity be present when browsing topic successively.Because LDA models are not talked with
Relationship modeling between topic, topic is resequenced by defining similarity measurement.VisualText Concordance instrument 10
The similarity measurement for the degree of closeness for representing topic is characterized using woods lattice (Hellinger) distance function.VisualText language
Material analysis tool 10 visualizes measuring similarity, to provide a user the understanding of the semantic layer to topic distribution, and by right
Topic space clustering helps to reduce their cognitive overload.
Topic cloud view 20 also provides the user one group of interaction to help user's fast understanding topic.For example, in specific pass
Hovering will cause occur highlighting to the every other of that keyword in label-cloud on keyword.User can also search for
The particular keywords of concern.In addition, topic cloud view 20 provides pass with every other view close cooperation with rapid as needed
In the information of specific topics.
Part produces topic cloud view 20 by online keyword weighting block 26, and online keyword weighting block 26 can be grasped
Act on the result of polymerization topic modeling module.It topic is given to this based on probability of the word in given topic in word
Language is classified, and word more likely will be placed in the top of classification queue.The value calculated with topic modeling module 14 marks
The probable value.For example, determine the word in topic cloud view by the frequency of occurrences of the word in whole corpus of text
Size, and be normalized based on maximum word frequency rate.For example, frequency is higher, word is bigger.For example, the acquiescence of instrument 10 represents every
50 most possible words of individual topic.User can pass through the quantity of interactive modifying word.
In order to which the general introduction of document to be provided as to the mixing of topic, instrument 10 of the invention protrudes each document across all extractions
The distribution of the topic gone out.Document probability distribution is converted into the class signal shape pattern for representing each document by selected expression.More specifically
Ground, using parallel coordinate metaphor, wherein each axle represents a topic and every line represents to gather a document in 16.
The point is illustrated in Fig. 3.In the arrangement, all variable (i.e. topic) uniform intervals and each variable are shared from 0 to 1
Identical value scope.Therefore, when checking Document distribution view 18, it is not necessary to managed based on document in each individually value on axle
Document is solved, and can be based on the pattern integrally on all axles to understand document.However, those skilled in the art will be obvious
Other configurations can be used.
LDA it is a kind of be limited in it direct dialogue topic occur between cross correlation modeling, but in most of texts
In language material, the cross correlation between topic appearance can be naturally enough anticipated.The instrument 10 of the present invention is by making between topic
Cross-correlation is more outstanding to overcome the limitation using visualization.Coincidentally, one of parallel coordinate visualization is characterised by more holding
Easily find the association between adjacent axle.Therefore, the mode for causing topic as semantic category adjacent to each other can be used to sort topic,
So that the association between similar topic becomes visually prominent.The topic similitude is according to two words in whole documents 16
Euclidean distance between topic defines:
Wherein dkIt is one of D document in whole set 16, and P (dk) it is that k-th of document is general on whole topics
Rate is distributed.Therefore, P (dk| z=i) represent when generating document k topic i probability.When in interface selected by topic is plotted as
During axle, the topic most concentrated with probability is started and is then based on the lookup of the distance between topic and the most similar topic of actualite.
Fig. 3 illustrates that the document across topic after topic is resequenced visualizes.Relation between the most similar topic of any two is (i.e.
On adjacent axle) become visually to can recognize that.
Part produces Document distribution view 18 by online topic order module 28, and the online topic order module 28 can be grasped
Act on the signal expression for performing above-mentioned function and separate document.This signal is the explanation of different nature to document.Depending on
Figure 18 shows that there is the document being significantly distributed to pay special attention to particular topic on single topic, but with 2 or 3 topics
The document of distribution indicate variable focus.
When exploring document in the distribution on topic, the topic number that easily can be had based on them finds given document
Different characteristic is presented.Fig. 4 shows the document 32 and more than two topic of document 30, two topics of concern only one topic
Document 34.Different topic numbers in document can be construed to the different characteristic under the context of given collection of document 16.Example
Such as, in the set of scientific publications, there is the publication related to specific field of scientific study of the document representation of a topic
Thing.Document with two or more topics more likely represents research article interdisciplinary, and it generally integrates two or more
Individual professional knowledge body.
In addition, Document distribution view 18 provides abundant interaction set, such as brush, highlighted etc..Necessarily compare on brush topic
The scope of example allows user to select the document for having particular probability for that specific topics.Topic cloud view is come from by synthesis
20 and Document distribution view 18 both the information related to main topic and file characteristics, user can effectively develop to document
The general introduction of set 16.
Document distribution view 18 allows users to identify the text of concern specific topics by the upper extent on brush topic
Shelves.However, identifying that the document related to two or more topics is less easy in big language material, reason is that they are high
The single topic document of probable value is covered.In order to alleviate the problem, with can with easily separated single topic document and more topics text
The mode of shelves separates whole documents.This is document scatter diagram view 24.
Class signal sample probability distribution pattern is converted into as can be seen that each document in Document distribution view 18.At this
In expression, have more topics document show must clearly be paid close attention to than those a topic document noise it is bigger.In information theory,
Shannon entropy is the measurement of the amount of the uncertainty associated with stochastic variable.Assuming that topic is to be directed to each text in our contexts
The stochastic variable of shelves, Shannon entropy can be used for distinguishing clean signal and noise signal.Therefore, instrument 10 of the invention is applied
Shannon entropy distinguishes document the topic number that has based on document.Each document based on it across the probability distribution of topic entropy quilt
It is calculated as:
Wherein P (dk) it is probability distribution of k-th of document on whole topics.Then can be in document scatter diagram view 24
In entropy based on each document and its most probable value (being normalized to [0,1]) on topic come draw each document (referring to
Fig. 5).In the presentation, for example, single topic (having higher maximum and relatively low entropy) document is in the upper left corner of scatter diagram, and
The lower right corner captures the document with higher topic numbers (with relatively low maximum and higher entropy).In selection, pie chart is shown
The topic distribution of particular document is described.In Figure 5, document selected by each pie chart expression, wherein each color represents a topic.
As indicated, the document with smaller entropy shows as the pie chart of solid line circle;And the document with larger entropy is shown as with more
Color, instruction entropy are corresponding with inputting the topic number in document.
In a word, document scatter diagram view 24 allows users to interactively know by the selection to document in different zones
Not Ju You requirement topic document subgroup.Part produces document scatter diagram view 24 by document entropy computing module 36,
The document entropy computing module 36 can be used to perform above-mentioned function and the packet to the document in any given corpus of text.
Document scatter diagram view 24 is intentionally grouped based on the entropy of document to document, and visually illustrates to give on language material at that
Concern, it is to pay close attention to single theme or variable theme to imply that language material.
Because most of collection of document 16 are with accumulated time, this temporal information, which is presented, to be helped to aid in user to understand language material
Topic how evolution.Referring now particularly to Fig. 6, time view 22 is created as interactive river figure (ThemeRiver), its
In each band represent a topic.In corpus of text, each document is associated with timestamp, thus can by it is each when
Between distribution of the document on the topic plus and to determine height of each band with the time in frame.The unit of time frame depends on language
Material, for example, be probably within 1 year the right times unit for scientific publications, and one month or even one day for news corpus
To be more suitable.After selection time unit, document is divided into corresponding time frame based on timestamp.However, for every
Individual time frame, by the distribution in the time frame to the topic from document plus and to calculate the height of each topic.
For example, in both topic cloud view 20 and Document distribution view 18, the order of topic is (from top to bottom) identical.It is logical
Cross and normalized cumulant used between all adjacent topics, by interpolation color or pattern frequency spectrum, for topic assign color or
Pattern.As a result, a pair of similar topics are assigned more similar color or pattern.
In a word, how the topic of the offer of time view 22 collection of document 16 summarizes with the visualization of time evolution.Except the table
Beyond showing, various interactions are also supported in time view 22.Selection to time frame (a vertical time unit) causes to selected
The filtering of all documents of issue in time frame.Similarly, for example, the intersection of the topic band and time frame in time view 22
The selection on selected topic with the document more than 30% probability caused to being issued during the time frame is clicked on point.Cause
This, can identify that generation of what document to topic is made that shared in special time period.Time view 22 is by disclosing text
Shelves gather the temporal information hidden in 16 and allow user rich to add based on time and topic execution filtering.
The generation time view 22 of part passage time topic trend computing module 38, the time topic trend computing module 38
It can be used to perform above-mentioned function and the inspection to detailed documentation.Time view 22, which allows users to directly select, for example to exist
Document in particular time range simultaneously obtains corresponding data.Time view 22 is by disclosing the document associated with this description
Details plays a crucial role in the visualization pattern for showing to identify to user and in the basis of trend.
When selecting any document, instrument 10 of the invention provides the details of the actual text content of the document of concern.By
In any topic model all far from perfection, the function of detailed view is dual:First, it provides the user context to carry out
The deep understanding of the keyword associated to topic with topic;Secondly, it helps the pattern shown in user's checking visualization.
Due to understand big corpus of text 16 can be related to the utilization to all four views, it is necessary to carefully ponder all views it
Between coordination.In topic aspect, hover and will be dashed forward in other views on the topic in any view for being related to topic expression
Go out to show same topic.For example, if user is hovered on an axle in Document distribution view 18, in topic cloud view 20
Same topic is highlighted with both time views 22.Therefore, user can rapid integrated keyword on specific topics,
The information of Document distribution and time trend.In addition, view is also coordinated by color or pattern, wherein each topic is in whole views
In there is same color or pattern.
In document aspect, any document set is selected to be protruded in other views in the view including each document
Show same collection of document.For example, the brush operation in document scatter diagram view 20 is immediately reflected in Document distribution view 18,
And vice versa.When user selected in document scatter diagram view 24 it is several with two prominent topics (i.e. intermediate range)
During document, the topic combination that the distribution of these documents helps user to understand document is checked.
In terms of the time, the filtering to the document of writing/issue in special time period is supported.For example, in time view
Click in 22 on a time frame (i.e. a vertical time unit) causes all documents to being issued in selected time span
Filtering.Similarly, the click on the crosspoint of the topic band in time view 22 and time frame causes in the period
The selection of following documents of period issue:The topic that those documents have accounts for main contributions to those documents (for example, more than 30%
Probability).This selection is shown in both Document distribution view 18 and document scatter diagram view 24.The function allows user's base
Document is filtered in the time of concern and topic, and then checks the document issued in selected time frame.
The instrument 10 of the present invention allows user to be explored from multiple viewpoints and inquire about big document language material 16.From topic cloud view 20
Start, even user can check the summary of language material 16 and identify the topic keyword of concern.According to Document distribution view 18, use
Family can position the topic of concern and select to pay close attention to the document of the topic by carrying out brush operation on the vertical axis.User is then
Selected document sets can be visually identified by checking the distribution in Document distribution view 18 and document scatter diagram view 24
Which close related to other topics.In addition, user always can be based on the details that selection checks document.For example, if user thinks
Interdiscipline/multidisciplinary publication in language material 16 is identified, he/her is provided as passing through selection in document scatter diagram view 24
This point is realized to the document in the lower right corner in centre.If in addition, user for the passage time factor inquire about language material 16 it is interested,
Then he/her can be clicked on by being clicked on a time frame or on the crosspoint of special time frame and topic
To perform selection in time view 22.In a word, instrument 10 of the invention supports corpus of text 16 using multiple coordination views
Interaction exploration.Each in view is designed to solve one in four major issues.
In order to assess efficiency of the instrument 10 of the present invention in terms of four target problems are answered, instrument 10 is applied to explore
And two corpus of text are analyzed, the two corpus of text include what is authorized from the National Nature fund (NSF) of 2010 2006
Publication in science motion and IEEE VAST collections of thesis.
Case study 1.Analysis science motion.In the case study, we describe the data of our collections first.Then
We characterize aiming field and show the group task based on us with the dialogue summary of NSF project administrator.Finally, Wo Menzhan
It is existing that instrument can how auxiliary expert user solves these tasks.
Data Collection and preparation.Provide funds to determine and manage to examine the instrument whether can be made with supporting item manager
Authorize investment structure, we collect first as computer and information science and engineering (CISE) board of directors a part information with
The motion authorized from 2000 to 2010 of intelligence system (IIS) department.The set is formed by being authorized close to 4000, wherein having
It is related to the structural data of the number of authorizing, the board of directors, department, project, project administrator, primary investigators and date of grant;With
And the motion summary of the form with non-structured text.We handle the summary of all collections, wherein each summary forms language
Single document in material.We remove the list of standard disabling word.That give the vocabulary of our 334,447 words.We
Then 30 topics are extracted from language material using LDA models.
Portray in domain.The core of NSF mission is:By the research in traditional sphere of learning is provided with funds (including
Identification widely influences), and provided with funds to variable and research interdisciplinary, keep the U.S. to be in and find forward position,.For
The former is realized, NSF project administrator needs to identify suitable reviewer and group member to ensure that optimal possible go together is commented
Examine.In order to efficiently perform the latter, project administrator needs to identify emerging field and research topic, to be interdiscipline and can
The research of variation is provided with funds.In addition to making investment decision, project administrator also needs to manage their investment of authorizing and tied
Structure.Although project administrator had been made fine in the past, they need new method to help them, and reason is science
Naturally fast-changing characteristic and the notable growth of motion quantity submitted.Advanced tasks are mapped to eecutable item, we
Devise and decision-making and three tasks that to authorize investment structure related.Task 1 pays close attention to the topic based on new motion and proposes new motion
Hand over packet.The task filters Ziwen it is understood that the staple of conversation of corpus of text based on them relative to the feature of topic
Shelves set.Task 2 is the suitable reviewer that identification is submitted for motion, and it further relates to know whether submission is related to multiple topics
To collect correct expert.Finally, task 3 pay close attention to be related to find with the time development topic trend authorize fund structure when
Between aspect.
Expert assesses.Because NSF project administrator is especially busy, we have invited preceding NSF project administrators to carry out me
Expert assess.Participant has 2 years working experiences of the project administrator as NSF.In the beginning of the assessment, we
Spending 30 minutes proves each visual system design and function.Then, we require that participant performs following three using instrument
Individual task.
Task 1.200 motions submitted recently are grouped based on topic.Since topic cloud view, participant is fast
Speed browses the topic of extraction to obtain the general view to submitting motion recently.Because participant is responsible for robotics and computer vision
The motion in field, her notice is quickly concerned about on the two topics by she.In topic of the selection concern on robotics
Motion when, participant in detailed view gaze swept title to verify their correlation.Although participant ensures each
The motion of selection is related, and she is also noted that position of the motion in document scatter diagram view is scattered.Due in bottom right
The motion of position more likely includes two or more topics, and participant is interesting to know these motions which is further related to other words
Topic.Further filtered by the motion that in document scatter diagram view those are looked like with more cross discipline, participant
It was found that they are related to the other field of such as Neuscience and society's communication etc.It is related literary when being selected in Document distribution view
During shelves, detailed view is called to allow project administrator to check the PIs previously authorized.
Task 2.Identify suitable reviewer.In order to identify reviewer, participant wants motion being roughly grouped first.Base
Explored in initial, participant summarizes and substantially exists two groups of motions:The core of one group of concern robotics field, and another group of use
Knowledge body from such as Neuscience with the other field of social communication etc.In order to identify the reviewer of two groups of motions, ginseng
Want to find PIs from the motion previously authorized with person.By checking historical data, project administrator is in Document distribution view
Position the topic on robotics.Then she carries out brush operation to select relevant with the topic carry in the top extent of axle
Case.Finally, participant turns to the PI that detailed view had previously been authorized to check in robotics field.For the interdiscipline in group 2
Motion, participant undergoes similar process to identify other experts from other association areas (such as Neuscience), with clothes
It is engaged in evaluation figure, it is ensured that optimal possible peer review.
Task 3.The time trend of investment structure is authorized in analysis.In investment structure aspect, preceding project administrator is interested in
Check the time trend of field that she is responsible in recent years.By exploring time view, participant has found to award in robotics field
The trend stability for the motion given, although the overall quantity for the motion authorized during 2006 and 2009 is increasing.With robot
Stable tendency is different, and the motion quantity authorized on the topic of " helping disabled person using technology " increases year by year.Preceding project
Manager comments on that the view is valuable for her, and reason is that the view enables her to check and is difficult in other ways
It was found that the Investment Trend on different topics.
In a word, participant thinks each view in instrument with understanding purpose good design.She comments on, the work
Tool can play a driving role in the workflow of project administrator.Specifically, the fact that she likes:Our instrument can be with
The automatic motion for suggesting more cross discipline, reason are that this is difficult to judge with traditional approach.She also likes the cooperation between view,
This information from same language material different aspect that helps her rapid integrated.
Case study 2.Analyze VAST proceedings.With the maturation in visual analyzing field, how the field is looked back
Evolution is beneficial.A kind of mode for solving the problem is analyzed by the publication of most important meeting-place receiving in visual analyzing
Thing.In the case study, we recruit four researchers to explore since the field in 2006 has started in VAST meetings/seat
The paper of Tan Huizhong issues.Because all users are familiar with visual analyzing field, it is intended that encouragement is freely explored, and this is with
The task of the satisfactory texture in face is opposite.After evaluation, the discovery of participant is classified as two groups by us:It was found that the time of topic drills
Interesting subdomains in the causality entered between funds source, and studying visualization analysis field.
Data Collection and preparation.We collect what is issued from 2006 to 2010 in VAST meetings/forum first
Whole papers.Collect 123 publications altogether.Then each publication is resolved to including title, author, delivers year by we
Limit, summary, the field of main body and bibliography.Our the whole main bodys to every article perform topic and modeled (from introduction to knot
By), wherein every article forms a document in language material.Removal standard disables word, and 317,315 words are left to us
Vocabulary.The not co-orbital record of each VAST meetings is directed to based on us, we are extracted 19 topics from language material.
User assesses.In four researchers that we recruit, two are the senior researchers in visual analyzing field,
And another two is the doctor using visual analyzing as their main research interests.In the assessment, we are all participations
Person provides advanced tasks and encourages more free excavate.After the instrument is introduced, we require that each participant identifies field
Interior core topic and the field be between past 5 years how evolution.We will roughly be classified as two groups using pattern:
Identification rise/topic of decline, and use the system as teaching tools.
Identification rise/decline topic.After whole topics being swept in topic cloud view, a senior researcher
Comment on:Topic well meets the paper tracking from VAST meetings.When checking the time trend of each topic, participant's note
Anticipate to several patterns for clearly rising and declining.For example, originally the topic on news-video analysis has attracted many concerns,
But concern is reduced rapidly year by year.He is also noted that on Network monitoring and the similar trend on the topic of analysis.Will
The pattern is associated with his knowledge, and participant explains the trend, because when the field starts, by as at that time
The Department of Homeland Security (DHS) of Main sources of capital Finance has guided the Focus Area.Next, participant turns to the pattern risen,
It indicates the concern in those topics caused in recent years.Specifically, since 2008, topic trend and uncertainty
Both analysis and topic dimensional analysis and reduction have attracted more concerns.Equally by the knowledge phase of the pattern and he itself
Association, participant comment on the foundation of this data for being likely to be introduced by NSF and DHS joints and visual analyzing
(FODAVA) result of project.
Understand the field of visual analyzing.Another senior researcher's (it teaches visual analyzing course at that time) is commented on:
He can be seen that the instrument for he course it is useful.Student can explore whole VAST publications, and identify and concern topic
Relevant paper is for course demonstration.Similarly, another participant wants to check in visual analyzing field in text
What has done in terms of analysis.He positions topic first, then selects the high publication of ranking on the topic in Document distribution view
Thing.His gaze swept Article Titles in detailed view, and verify that paper selected by whole is satisfied by his interest.He is also noted that
Some papers in the selection seem related to other topics of such as entity extraction and data base querying etc.The study it
Afterwards, he requires the screen capture to detailed view so that he can search the paper that he identifies during the Learning Studies.
In a word, participant thinks that the instrument assists in them and explores the evolution in visual analyzing field, and is based on
They investigate own interests identification publication for further.
It will be appreciated by those skilled in the art that the various modules and process of the present invention are realized using processing equipments such as computers
's.The processing equipments such as this computer can include one or more universal or special processors, such as microprocessor, numeral
Signal processor, customized processor and field programmable gate array (FPGA) and programmed instruction (including the software uniquely stored
With both firmwares), it controls one or more processors, with reference to specific non-processor, realize the present invention method and
In the function of system some, most of or repertoire.Alternatively, some or all functions can be by the journey without storage
The state machine of sequence instruction is realized, each function or function in ASIC in one or more application specific integrated circuits (ASIC)
Some combinations be implemented as customized logic.Of course, it is possible to the combination using the above method.Furthermore, it is possible to via with it
Being used for for upper storage can to the non-transient computer of the computer-readable code of the programmings such as computer, server, electrical equipment, equipment
Read storage medium come realize in some example embodiments, computer, server, electrical equipment, equipment etc. each can include place
Device is managed to perform the function of being described herein and require.The example that this computer can show disrespect on storage medium includes but is not limited to:Hard disk,
Light storage facilities, magnetic storage apparatus, read-only storage (ROM), programmable read only memory (PROM), erasable programmable are only shown disrespect on
Memory (EPROM), Electrically Erasable Read Only Memory (EEPROM), flash memory etc..When computer-readable in non-transient
When being stored in medium, software can include that processor can be made in this execution by the instruction of computing device, processor response
And/or other any circuits perform one group of operation, step, method, process, algorithm etc..
Again, the present invention makes to include analyst, marketing personnel, commercial leader, information technologist and c-type employee and existed
Interior company can obtain exercisable opinion from any kind of text data.The technology allows people according to data-driven
Basis strengthens their decision process.The technology absorbs text data, and by depth calculation and statistic algorithm, identifies per number
According to the theme in collection, topic and emerging problem.Result is shown with interactive visual form so that any in company
People can integrally or subtly analyze data.(such as the electronics postal of all types of text data-internal datas can be analyzed
Part, chat, investigation, call center and concern group), or external data (such as Social Media, comment website, forum and news
Website).The technology can handle a large amount of language, it is ensured that can analyze from global feedback loop.However, make us adjustment analysis
The highly customizable feature of effect is chosen.Most of companies are just sitting on the precious deposits of unstructured text data, but several
Have no ability to excavate unstructured text data acquirement information.
Generally, software of the invention complexity Visualization Platform in transmit the data analysis based on deep learning, its
Disclose, analyze in the broad range in business decision field and speculate executable strategy.It is to find to influence sale, client takes
The advantageous manner of contact in the data of business, operation and risk analysis stakeholder is by call center's audio, Email, new
News, social media, chat, transaction data, client feedback and analysis connect.Structural data is also utilized, including retail
Transaction, survey data, personal profiles etc., and country and International Industry, government and the specific data source of product.Software is can be by
What any browser device accessed, prediction modeling, artificial intelligence and statistics NLP are incorporated, to analyze any type of non-knot
Structure data.Visualization is integrally and/or subtly to provide.Whole system 40 is schematically shown in Fig. 7.System 40 makes
With the multilingual API of high-throughput, for being extracted using complicated term extraction, entity designator, geographical space designator extracts,
Time indicator is extracted and the analysis of opinion mood carries out information flag.System 40 also using data-driven semantic machine study and
Cluster, associated using automatic term, count topic summary, influencer's interference, the content ordering of context-aware, content network pass
Connection and product center analysis.
Referring now particularly to Fig. 8 and 9, in an example embodiment, the invention provides help company to find from data
To the information platform 45 of the enhancing of the shortest path of income.It is brought together the data silo of fragment, creates top layer
Unified visual analyzing layer, and enable the user from multiple commercial functions effectively and collaboratively extract valuable to see
Solution.Platform 45 is safely located at the top in tissue data lake and compatible with the multiple grades of data infrastructure.It passes through depth
Calculate and statistic algorithm absorbs unstructured data (for example, Email, message registration) and structural data (example automatically
Such as, sale, budget, finance).Its feedback point and data point of processing number in terms of necessarily, and identify the theme in tissue, words in real time
Topic and positive produced problem.It helps dynamically that customer experience trend is associated with whole company datas.Platform 45 is complete
It is interactive and easy to use.Anyone in tissue, employee, analyst, sellers from front to commercial lead
Person and c-type employee, can integrally or subtly be interacted with data, customize they itself instrument board and with other people shared hairs
It is existing.In addition to data analysis background engine, platform 45 is also experienced with the UI of the user strengthened completely and supported.The present invention is
User, which provides, has the customizable visual perfect instrument board of pixel.This make it that the analysis work of presentation user is much easier
It is and more controllable.Exploring the rich interactive in layer allows user quickly to start to analyze details and keeps contextual information around it.
Present invention ensure that and flexible Data analytic environment ensure user never lost while details is slipped into general aspect with
The contact of data.This has surmounted only several visualizations;Consumer's Experience is expanded into various useful data analyses and visualization.
It is easy like never before to be annotated and cooperated on analysis results.The present invention changed completely people can find, share and
The mode to be cooperated in analysis task.User can annotate and share their discovery with colleague, support in each data analysis group
Inside and outside cooperation.In a word, the present invention strengthens decision-making by providing the true environment of plan of data analysis.
Figure 10 is the schematic diagram for another example embodiment for showing the unstructured data analysis system 50 of the present invention.It is logical
Often, such as with commercial enterprise customer experience data 52, teledata 54, e-mail data 56, the social media being closely related
Data 58 and other data 60, polymerize in data repository 62, and such as outside of internet data, government data etc
Data source 64 is drawn into unstructured data parser 66, and the unstructured data parser 66 for example resides in network clothes
It is engaged on device, and can be via browser access.Such as specific descriptions herein above, unstructured data parser 66 is to data
Applied forecasting modeling, artificial intelligence and statistics NLP, to disclose, analyze, speculate and visualize executable information.Advantageously, may be used
To check executable information by various commercial 68, stakeholder or other users, its all can add or use other
Mode, which is changed, to be visualized and shares result via public interactive user interface 70.
Figure 11 is that an example of the presentation layer 80 for showing the unstructured data analysis system 50 (Fig. 8) of the present invention is implemented
The schematic diagram of example;Generally, presentation layer 80 allows display on unstructured data and/or the various summary information of result.Example
Such as, presentation layer 80 is illustrated as showing customer experience data 82, teledata 84 and sales data 86.
Figure 12 is an example reality of the exploration layer 90 for showing the non-structured data analysis system 50 (Fig. 8) of the present invention
Apply the schematic diagram of example.Generally, exploring layer 90 allows display on unstructured data and/or the various summary information of result.
Exploring layer 90 also allows selection time granularity and is shown with further details.This " slipping into downwards " also corresponding renewal includes
Other visualizations including presentation layer 80.For example, snapshot 94 is illustrated as selecting from customer experience data 92.
Figure 13 is an example reality of the annotation layer 100 for showing the unstructured data analysis system 50 (Fig. 8) of the present invention
Apply the schematic diagram of example.Annotation layer 100 is configured as showing various results, and customer experience data 102, teledata 104,
Email 106, social media data 108, other data 110 etc., and user comment 112 is received, the user comment 112 can
To be accessed via shared user interface 114 by whole users or selected user.
Although illustrate and describe the present invention, this area skill with reference to preferred embodiment and its particular example herein
Art personnel, which will be apparent from other embodiment and example, can also perform similar functions and/or realize similar results.Thus understand,
All this equivalent embodiments and example within the spirit and scope of the present invention, and are intended to be covered by appended claims.
Claims (18)
1. a kind of unstructured data analysis system, including:
It is resident on the server and can be via the unstructured data parser of browser access, the unstructured data
Parser, which can operate, to be used for:Unstructured data is received from one or more remote sources, to unstructured data application
One or more analysis tools, and shown to one or more users and summarize information;
Wherein presentation layer, explore layer and annotate layer in it is one or more it is middle to one or more users show it is described always
Tie information.
2. system according to claim 1, wherein the unstructured data include it is following in it is one or more:Visitor
Family experience data, teledata, e-mail data, social media data and transaction data.
3. system according to claim 1, wherein the unstructured data parser can also be operated and is used for:From one
Individual or more remote source receives external data.
4. system according to claim 3, wherein the external data include it is following in it is one or more:Internet
Data, government data and business data.
5. system according to claim 1, wherein one or more analysis tool bags applied to unstructured data
Include it is following in it is one or more:Statistic algorithm, machine learning, natural language processing and text mining.
6. system according to claim 1, wherein the presentation layer show it is following in it is one or more:It is unstructured
Data, the summary of unstructured data and the summary information.
7. system according to claim 1, wherein the exploration layer allows one or more users to change the summary
The granularity of information, thus change the granularity of presentation layer.
8. system according to claim 1, one of them or more user can via annotation layer simultaneously with it is described non-
Structured data analysis system interacts.
9. system according to claim 1, wherein showing the summary to one or more users also in combination layer
Information.
10. a kind of unstructured data analysis method, including:
There is provided it is resident on the server and can be described unstructured via the unstructured data parser of browser access
Data analysis algorithm, which can operate, to be used for:Unstructured data is received from one or more remote sources, to unstructured data
Using one or more analysis tools, and shown to one or more users and summarize information;
Wherein presentation layer, explore layer and annotate layer in it is one or more it is middle to one or more users show it is described always
Tie information.
11. according to the method for claim 10, wherein the unstructured data include it is following in it is one or more:
Customer experience data, teledata, e-mail data, social media data and transaction data.
12. according to the method for claim 10, wherein the unstructured data parser can also be operated and is used for:From
One or more remote sources receive external data.
13. according to the method for claim 12, wherein the external data include it is following in it is one or more:Interconnection
Network data, government data and business data.
14. according to the method for claim 10, wherein one or more analysis tools applied to unstructured data
Including one or more in following:Statistic algorithm, machine learning, natural language processing and text mining.
15. according to the method for claim 10, wherein the presentation layer show it is following in it is one or more:It is non-structural
Change data, the summary of unstructured data and the summary information.
16. according to the method for claim 10, wherein the exploration layer allows one or more users' modifications described total
The granularity of information is tied, thus changes the granularity of presentation layer.
17. according to the method for claim 10, one of them or more user can via annotation layer simultaneously with non-knot
Structure data analysis system interacts.
18. according to the method for claim 10, wherein also in combination layer to one or more users show it is described always
Tie information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011265115.5A CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562159662P | 2015-05-11 | 2015-05-11 | |
US15/151,572 | 2016-05-11 | ||
US15/151,572 US10452698B2 (en) | 2015-05-11 | 2016-05-11 | Unstructured data analytics systems and methods |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011265115.5A Division CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107368506A true CN107368506A (en) | 2017-11-21 |
CN107368506B CN107368506B (en) | 2020-11-06 |
Family
ID=60312579
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610496280.9A Active CN107368506B (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method |
CN202011265115.5A Pending CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011265115.5A Pending CN112732878A (en) | 2015-05-11 | 2016-06-28 | Unstructured data analysis system and method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN107368506B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108170657A (en) * | 2018-01-04 | 2018-06-15 | 陆丽娜 | A kind of natural language long text generation method |
CN109299286A (en) * | 2018-09-28 | 2019-02-01 | 北京赛博贝斯数据科技有限责任公司 | The Knowledge Discovery Method and system of unstructured data |
CN110413782A (en) * | 2019-07-23 | 2019-11-05 | 杭州城市大数据运营有限公司 | A kind of table automatic theme classification method, device, computer equipment and storage medium |
CN112883186A (en) * | 2019-11-29 | 2021-06-01 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text collection visualized system |
CN102750355A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Visual management method for non-structured data management system |
CN102929894A (en) * | 2011-08-12 | 2013-02-13 | 中国人民解放军总参谋部第五十七研究所 | Online clustering visualization method of text |
US20140040275A1 (en) * | 2010-02-09 | 2014-02-06 | Siemens Corporation | Semantic search tool for document tagging, indexing and search |
US9135242B1 (en) * | 2011-10-10 | 2015-09-15 | The University Of North Carolina At Charlotte | Methods and systems for the analysis of large text corpora |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2004534324A (en) * | 2001-07-04 | 2004-11-11 | コギズム・インターメディア・アーゲー | Extensible interactive document retrieval system with index |
US7849048B2 (en) * | 2005-07-05 | 2010-12-07 | Clarabridge, Inc. | System and method of making unstructured data available to structured data analysis tools |
KR101481253B1 (en) * | 2013-03-14 | 2015-01-13 | 한국과학기술원 | Method and system for providing summery of text document using word cloud |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
US20160071212A1 (en) * | 2014-09-09 | 2016-03-10 | Perry H. Beaumont | Structured and unstructured data processing method to create and implement investment strategies |
-
2016
- 2016-06-28 CN CN201610496280.9A patent/CN107368506B/en active Active
- 2016-06-28 CN CN202011265115.5A patent/CN112732878A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101308498A (en) * | 2008-07-03 | 2008-11-19 | 上海交通大学 | Text collection visualized system |
US20140040275A1 (en) * | 2010-02-09 | 2014-02-06 | Siemens Corporation | Semantic search tool for document tagging, indexing and search |
CN102929894A (en) * | 2011-08-12 | 2013-02-13 | 中国人民解放军总参谋部第五十七研究所 | Online clustering visualization method of text |
US9135242B1 (en) * | 2011-10-10 | 2015-09-15 | The University Of North Carolina At Charlotte | Methods and systems for the analysis of large text corpora |
CN102750355A (en) * | 2012-06-11 | 2012-10-24 | 清华大学 | Visual management method for non-structured data management system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108170657A (en) * | 2018-01-04 | 2018-06-15 | 陆丽娜 | A kind of natural language long text generation method |
CN109299286A (en) * | 2018-09-28 | 2019-02-01 | 北京赛博贝斯数据科技有限责任公司 | The Knowledge Discovery Method and system of unstructured data |
CN110413782A (en) * | 2019-07-23 | 2019-11-05 | 杭州城市大数据运营有限公司 | A kind of table automatic theme classification method, device, computer equipment and storage medium |
CN110413782B (en) * | 2019-07-23 | 2022-08-26 | 杭州城市大数据运营有限公司 | Automatic table theme classification method and device, computer equipment and storage medium |
CN112883186A (en) * | 2019-11-29 | 2021-06-01 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map |
CN112883186B (en) * | 2019-11-29 | 2024-04-12 | 智慧芽信息科技(苏州)有限公司 | Method, system, equipment and storage medium for generating information map |
Also Published As
Publication number | Publication date |
---|---|
CN112732878A (en) | 2021-04-30 |
CN107368506B (en) | 2020-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10452698B2 (en) | Unstructured data analytics systems and methods | |
US11003864B2 (en) | Artificial intelligence optimized unstructured data analytics systems and methods | |
Jaton | We get the algorithms of our ground truths: Designing referential databases in digital image processing | |
US9135242B1 (en) | Methods and systems for the analysis of large text corpora | |
Isenberg et al. | Visualization as seen through its research paper keywords | |
Alsallakh et al. | The state‐of‐the‐art of set visualization | |
Liu et al. | A survey on information visualization: recent advances and challenges | |
Dou et al. | Paralleltopics: A probabilistic approach to exploring document collections | |
Cao et al. | Facetatlas: Multifaceted visualization for rich text corpora | |
Yang et al. | Cognitive impact of virtual reality sketching on designers’ concept generation | |
Li et al. | Dynamic mapping of design elements and affective responses: a machine learning based method for affective design | |
Akerkar et al. | Intelligent techniques for data science | |
Alper et al. | Opinionblocks: Visualizing consumer reviews | |
Roberts et al. | Visualising business data: A survey | |
Pillutla et al. | Iterative generation of insight from text collections through mutually reinforcing visualizations and fuzzy cognitive maps | |
CN107368506A (en) | Unstructured data analysis system and method | |
Mukkamala et al. | Towards a formal model of social data | |
Isenberg et al. | Toward a deeper understanding of visualization through keyword analysis | |
Das et al. | Questo: Interactive construction of objective functions for classification tasks | |
Verspoor et al. | Commviz: Visualization of semantic patterns in large social communication networks | |
McGee et al. | Towards visual analytics of multilayer graphs for digital cultural heritage | |
Delias et al. | Formulating the potentials of clustering of event data over multiple entities for decision support: a network embeddings approach | |
Jain | Comprehensive survey on data science, lifecycle, tools and its research issues | |
Chen et al. | Customer segmentation and classification from blogs by using data mining: an example of VOIP phone | |
Liu et al. | Understanding Consumer Preferences---Eliciting Topics from Online Q&A Community |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |