CN103544255B - Text semantic relativity based network public opinion information analysis method - Google Patents
Text semantic relativity based network public opinion information analysis method Download PDFInfo
- Publication number
- CN103544255B CN103544255B CN201310482522.5A CN201310482522A CN103544255B CN 103544255 B CN103544255 B CN 103544255B CN 201310482522 A CN201310482522 A CN 201310482522A CN 103544255 B CN103544255 B CN 103544255B
- Authority
- CN
- China
- Prior art keywords
- text
- information
- similarity
- public
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 30
- 238000011156 evaluation Methods 0.000 claims abstract description 7
- 238000001914 filtration Methods 0.000 claims abstract 2
- 238000000034 method Methods 0.000 claims description 48
- 239000000284 extract Substances 0.000 claims description 22
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000011218 segmentation Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 15
- 238000007621 cluster analysis Methods 0.000 claims description 12
- 230000006872 improvement Effects 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 238000004088 simulation Methods 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 2
- 238000005520 cutting process Methods 0.000 claims description 2
- 230000003447 ipsilateral effect Effects 0.000 claims description 2
- 238000013517 stratification Methods 0.000 claims description 2
- 230000008859 change Effects 0.000 claims 1
- 238000005065 mining Methods 0.000 abstract description 9
- 238000012545 processing Methods 0.000 abstract description 9
- 238000007781 pre-processing Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 abstract description 3
- 239000000203 mixture Substances 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000009412 basement excavation Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000000205 computational method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 239000000243 solution Substances 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004530 micro-emulsion Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a text semantic relativity based network public opinion information analysis system. The system comprises a network public opinion information acquisition module, a public opinion information extraction module, a public opinion information preprocessing module, a public opinion information mining module and a public opinion information analysis module. The network public opinion information acquisition module is used for acquiring various public opinion information rich in content from a webpage. The public opinion information extraction module and the public opinion information preprocessing module are used for preliminarily filtering and segmenting the acquired public opinion information, extracting meta-information of a text part, creating a feature semantic network diagram of texts, and performing weighting computation and feature extraction to provide services for public opinion information mining. The public opinion information mining module is used for classifying the texts by adopting a semantic similarity based improved text clustering analysis method. The public opinion information analysis module is used for performing OLAP (on-line analytical processing) multi-dimensional statistics on mined data of the public opinion information, and analyzing public opinion evaluation indices to provide support for relevant public opinion decision making. By the system, the problem that semantic information of words in the texts is incomplete is solved, and clustering analysis and hot topic extraction of dynamic data in a large-scale network environment are realized efficiently.
Description
Technical field
The present invention relates to technical field of network information, a kind of network public sentiment information relevant based on text semantic divides
Analysis method.
Background technology
Society, the Internet has penetrated in daily life, the instant messaging work such as microblogging, forum, blog
Tool has become as people and obtains information, and then the important channel of the information that gives opinion, propagates.By the network platform, public feelings information
Propagating rapidly, cause extensive concern, its speed propagated is soon, scope is extensively, power of influence is big, is far from traditional media comparable,
The features such as the anonymous interactivity of cyberspace, non-space-time are restricted, make the public opinion strength that network public-opinion this strand is powerful, to society
Development and stabilization can produce certain impact and impact.The network public-opinion in front, like " positive energy ", promotes and promotes social development;
Negative network public-opinion forms negative effect to social stability, causes public sentiment crisis.Thus, Strengthens network public feelings information monitoring,
Analyzing, manage, to stable society order, building a harmonious society has important practical significance.Prison timely to network public sentiment information
Survey, correctly judge decision-making, respond the most in time, actively adopt an effective measure and dissolve public sentiment crisis, become network public-opinion management work
The emphasis made and difficulties.
Summary of the invention
Need solve to ask in the feature of network public sentiment information in above-mentioned background technology and network public sentiment information management
Topic, the present invention provides a kind of network public sentiment information relevant based on text semantic to analyze method.
The technical solution adopted for the present invention to solve the technical problems is, a kind of network public-opinion relevant based on text semantic
Information analysis method.Employing includes that network public sentiment information acquisition module, public feelings information extract module, public feelings information pretreatment mould
The network public sentiment information analysis system that block, public feelings information excavate module, public feelings information analyzes module and comprise public feelings information data base
System, and comprise the steps:
A. network public sentiment information acquisition module gathers various public feelings information from webpage, and stores public feelings information data base
In;
B. the public feelings information that step a is gathered by public feelings information extraction module and public feelings information pretreatment module carries out tentatively mistake
Filter and cutting, the content information that extraction text is comprised, excavates for public feelings information and provides data, services;
C., on the basis of step b, public feelings information excavates module and uses improvement Clustering Analysis of Text based on semantic similarity
Method, generates classification and describes information, filter out the text message comprised in cluster analysis result;Feature based is utilized to add up
TFIDF words-frequency feature computational methods statistics category feature, obtain Based on Class Feature Word Quadric, select noun as candidate categories Feature Words,
According to candidate feature word weight sequencing, using the bigger candidate feature word of weighted value as classification key word, utilize classification key word
Between semantic relation, formed classification results;Identify and set up new network public-opinion theme, detect, follow the tracks of existing public sentiment theme
Related content;
D. last, the data that public feelings information analysis module is excavated public feelings information through step c carry out OLAP multidimensional statistics
Analyze, analyze the public sentiment evaluation metrics such as public sentiment subject content attention rate, public sentiment theme Sentiment orientation.
In step a, described public feelings information acquisition module, is to be acquired network public sentiment information source, with general net
Unlike network reptile, crawling of its webpage to be completed, and web page contents is formatted process, extract public sentiment
Theme and content, the data obtained is stored in txt form or html formatted file, and stores public feelings information data base;Network carriage
Feelings information acquisition module uses timesharing to access, IP address is changed in timing and simulation browser carries out three kinds of technology of single-sign-on and combines
Carry out anti-shielding.Network public sentiment information acquisition module uses timesharing to access, IP address is changed in timing and simulation browser carries out list
Point logs in three kinds of technology combinations and carries out anti-shielding.What network public sentiment information acquisition module performed concretely comprises the following steps: described public sentiment is believed
What breath acquisition module performed concretely comprises the following steps, and from the beginning of the URL of predefined theme related web page, obtains the text in webpage
Information, and from current web page, extract new URL put in queue, until the public feelings information meeting condition gathers complete, URL team
Till being classified as sky;The web page text information collected is stored in public feelings information data base according to field classification, it is provided that public sentiment
Information Extracting module is called.
Described public feelings information extraction module, is to remove the irrelevant contents in webpage, as the advertisement in webpage, navigation information,
The noise data such as picture, copyright notice, extracts the metamessage of the body part useful to the analysis of public opinion, is reconstructed text,
To have the representational information aggregation of theme together;Described public feelings information pretreatment module, is to the public feelings information source gathered
After the extraction module extraction of described public feelings information, carry out Chinese word segmentation process, filter stop words, name Entity recognition, part of speech
Mark, syntax parsing and Feature Words extract, and set up positive sequence index and inverted index;Set up text feature semantic network figure, with literary composition
The entity E comprised in Ben is as the node of figure, and the semantic relation between two entities is as the directed edge of figure, the language between entity
Justice relation combines the word frequency information weight as node, and the weight of directed edge represents entity relationship significance level in the text,
Described entity E includes things entity NE, event entity VE, event relation entity RE;The word frequency of statistics text and text frequency letter
Breath, then carries out Feature Words extraction, and the vocabulary choosing embodiment text feature shows the text.
In stepb, described public feelings information extraction module, it is to remove the irrelevant contents in webpage, extracts the analysis of public opinion
The metamessage of useful body part, is reconstructed text, will have the representational information aggregation of theme together;Described carriage
Feelings information pre-processing module, be to gather public feelings information source through described public feelings information extraction module extraction after, carry out Chinese
Word segmentation processing, filter stop words, name Entity recognition, part-of-speech tagging, syntax parsing and Feature Words extract, set up positive sequence index and
Inverted index;Setting up text feature semantic network figure, the entity E comprised in text is as the node of figure, between two entities
Semantic relation as the directed edge of figure, the semantic relation between entity combines the word frequency information weight as node, directed edge
Weight represent that entity relationship significance level in the text, described entity E include things entity NE, event entity VE, event
Relationship entity RE;The word frequency of statistics text and text frequency information, then carry out Feature Words extraction, chooses and embodies text feature
Vocabulary shows the text.
The text analyzings such as network public sentiment information text mining to be realized, natural language processing, first have to carry out word segmentation processing,
Use for reference the achievement in research in domestic Chinese word segmentation field, use the ICTCLAS Chinese that Inst. of Computing Techn. Academia Sinica develops
The functions such as word segmentation that morphological analysis system is had, part-of-speech tagging, name Entity recognition, by public feelings information text
Hold and carry out participle, extract the length word more than two.After text participle, filter useless the disabling of computer understanding text
Word, retains the word of the parts of speech such as noun, verb, adnoun, dynamic shape word, obtains alternative features word set, effectively reduce the size of index,
Increase recall precision, improve accuracy rate.Through the text document of word segmentation processing, set up positive sequence index and inverted index, it is achieved use
The inquiry at family is mutual.Text through participle, part-of-speech tagging, remove stop words after, set up the Feature Semantics network of text, statistics literary composition
The information such as this word frequency and text frequency, are then weighted and feature extraction etc..
In step c, described public feelings information is excavated module, is text set to be carried out pretreatment, at Chinese word segmentation
Reason, stop words filter and after structured tag information analysis, and text data set Information Extracting module generated is special according to text
Levy the text semantic feature description structure that semantic network figure builds, utilize method for evaluating similarity to calculate the semantic phase between text
Like degree, build similarity matrix, use improvement Clustering Analysis of Text method based on semantic similarity to generate cluster result;Cluster
Analysis result generates classification and describes information, filters out the text message comprised in cluster analysis result;Feature based is utilized to add up
TFIDF words-frequency feature computational methods statistics category feature, obtain candidate categories Feature Words, select noun special as candidate categories
Levy word, according to candidate feature word weight sequencing, determine that candidate feature word, as classification key word, utilizes classification crucial using weighted value
Semantic relation between word, forms classification results;Result builds knowledge base, and knowledge base can also be configured with simultaneously
Support the text mining function such as public sentiment motif discovery, public sentiment sentiment classification.
In step d, described public feelings information analyze module, be to be stored in public feelings information data base through step c
The data excavated carry out OLAP multidimensional statistics analysis, analyze public sentiment theme attention rate, public sentiment content erotic degree, public sentiment Spreading and diffusion
Degree, public sentiment issue the public sentiment evaluation metrics such as disturbance degree, grasp in time for relevant departments public sentiment issue dynamically, in good time public feelings information,
Make correct decisions and support is provided.
Compared with prior art, the method have the advantages that
1. current network public feelings information has reflected the spies such as magnanimity, dynamic, imperfection, form of expression multiformity
Point, and existing public feelings information is analyzed method and is often ignored the dependency relation of public feelings information content of text, causes public feelings information
Analysis result is inaccurate;The present invention uses the text feature semantic network graph model building public feelings information text, describes at text
Structure introduces the contact between phrase semantic association and context of co-text;In conjunction with improvement text cluster based on semantic similarity
Algorithm, mining analysis goes out the content that in public feelings information text, context semanteme is relevant.
2. by setting up the text feature semantic network figure of public feelings information text, by upper between word in public feelings information text
Hereafter relation forms characteristic item and the directed graph structure of weight composition, while retaining text word Context information structure,
Enhance the intension that in text, word context is semantic, preferably describe semantic information implicit in text and theme feature, solve
The certainly problem of phrase semantic loss of learning in text.
3. improvement Text Clustering Algorithm based on semantic similarity is suitable under large-scale network environment dynamic data
Cluster analysis and public sentiment theme focus find, by text semantic Similarity Measure, build text semantic similarity matrix, deeply
Degree excavates the content that in public feelings information text, context semanteme is relevant, detects in time, follows the tracks of new subject events;In employing class
The theme method for expressing at multiple centers, selects similar as this class text of similarity maximum at text center each in class
Degree, is effectively improved running efficiency of system, and along with the increase of amount of text, cluster analysis effect can become apparent from.
Accompanying drawing explanation
Fig. 1 is that the embodiment of the present invention analyzes the workflow diagram of method based on the network public sentiment information that text semantic is relevant.
Detailed description of the invention
Below in conjunction with the drawings and specific embodiments, the present invention will be further described.But embodiments of the present invention do not limit
In this.
As it is shown in figure 1, in the method for the present invention, including network public sentiment information acquisition module, public feelings information extraction module, carriage
Feelings information pre-processing module, public feelings information excavate module, public feelings information analyzes module and the network comprising public feelings information data base
Public feelings information analyzes system.Its handling process is:
(1) public feelings information collection
Network public sentiment information source is acquired, unlike general web crawlers, its webpage to be completed
Crawl, and web page contents is formatted process, extract useful public feelings information, such as theme and the content of public sentiment, institute
Obtain data and be stored in txt form or html formatted file, write original public feelings information data base.Concretely comprise the following steps: according to default
Network public sentiment information acquisition strategies, from the beginning of the URL of multiple sub-pages, sends the finger following http agreement by each generic port
Make (using GET method);Remote server returns the document of HTML type according to the content of application instruction.Public feelings information gathers mould
Block is collected in returning to document and is first preserved to caching after all of information, is then delivered in data base preserve, obtains in webpage
Text message;In obtaining web page text information process, from current web page, constantly extract emerging hyperlink URL access,
And reject the hyperlink URL accessed, such iterative cycles, until the web page text information gathering meeting search strategy is complete
Finish, till the URL queue not accessed is sky.The web page text information gathered is stored in data base according to field classification, carries
Call for public feelings information extraction module.
Network public sentiment information acquisition module generally uses timesharing to access, IP address is changed in timing, simulation browser carries out list
The anti-shielding strategy that the multiple technologies such as some login combine.For many website such as forums, blog, microblogging etc. by the user side of login
Formula could access, and uses the strategy of simulation browser to be easier to realize here, utilizes microemulsion sample injection developing instrument Visual
The Web Browser control that Studio2008 provides is the API Calls of MS internet explorer, utilizes the simulation of SSO single-sign-on to carry
Handing over user name and password login, after waiting that user login information has loaded, page jump to corresponding URL address, by submitting to
Key word is retrieved, it is thus achieved that the source file of required webpage.
The web page text information gathered includes web content information, Web structure and uses record information two parts.Web content
Information comprises the content of text information such as headline, body matter, review information, Web structure and Web and uses record information to comprise
The statistical information such as click volume, pageview, comment amount.
(2) public feelings information extraction
The info web gathered contains the noise datas such as advertisement, navigation information, picture, copyright notice, divides public feelings information
Really it is desirable that the metamessage of body part for analysis, dispose these irrelevant contents, extract and public feelings information is analyzed useful
The metamessage of body part, for the follow-up excavation of text, analyze service be provided.Idiographic flow is as follows:
(2-1) align web page text first by Tidy instrument and carry out HTML markup standardization, then utilize html parser
Tools build HTML tree, using HTML markup as the node of tree, so represents and is easy to the management to HTML code and operation, permissible
Preferably code is carried out structuring excavation.
(2-2) from the public feelings information source gathered, the phases such as title, key word, text, length, renewal time and URL are extracted
Pass information, title can intercept label<tITLE>with</TITLE>between information;Key word is included in html file head
META label, can extract from META label information;Temporal information can be extracted by pattern match analysis and web page analysis.
(2-3) what text extracted concretely comprises the following steps: select suitable key word, obtains the URL address of related web page, passes through
Access the server at place, URL address, obtain the html source code of webpage;Delete the useless labelling row in web page source code, protect
Stay webpage body content;By in HTML code paragraph symbol (as</p>,<br>deng) replace with special symbol (such as * [/p] *, *
[/br] * etc.), carriage return character and newline replace with line Separator, use row structure storage mode, retain web page contents form;Carry
Take the text between every a line HTML markup "<" and ">";Special symbol (such as * [/p] *, * [/br] * etc.) is replaced with the carriage return character,
Keep the original paragraph of text;Result character string is removed the special ESC of HTML (such as ", < etc.) process, knot
Close regular expression, mate and extract final text result.
The relevant informations such as title, key word, text, length, renewal time and URL are extracted from the public feelings information source gathered
After, the reconstruct of public feelings information extraction module text message to be realized.
Text reconstruct is by analyzing the public feelings information existence form such as Internet news, forum postings, microblogging blog article and text
Architectural feature, forms " purport block " by the information of representative topic, information composition " content blocks " of remainder, to improve
Cluster analysis effect.
Text for web page news reconstructs, and is the title of web page news and first segment information composition " purport block ", remaining
News describe information and comment content composition " content blocks ".
Text for forum postings reconstructs, and is that title and the main note of model are formed " purport block ", by money order receipt to be signed and returned to the sender and follow-up
Inforrnation purifying processes, and removes and does not has the model of Chinese character content and use the conventional model evaluating word, selects some models to constitute
" content blocks ".
(3) public feelings information pretreatment
Public feelings information extraction after, followed by Chinese word segmentation process, name Entity recognition, part-of-speech tagging, syntax parsing,
The pretreatment such as Feature Words extraction, are saved in result in data base.At network public sentiment information text mining to be realized, natural language
The text analyzings such as reason, first have to carry out word segmentation processing, use for reference the achievement in research in domestic Chinese word segmentation field, use the Chinese Academy of Sciences
Chinese lexical analysis system ICTCLAS of Institute of Computing Technology development carries out participle and the part-of-speech tagging of text, by Chinese
Word segmentation processing, extracts the length word more than two.The function of ICTCLAS has the participle of Chinese text, part-of-speech tagging, new word identification
Deng;The method using actor model (role model) is named Entity recognition;Support that user defines as required individual simultaneously
Property dictionary, not only has the higher precision of word segmentation, and participle effect is preferable.Code is as follows:
After text participle, filter the stop words useless to computer understanding text, retain noun, verb, adnoun,
The word of the parts of speech such as dynamic shape word, obtains alternative features word set, to avoid the lengthy and jumbled of text, effectively reduces the size of index, increases inspection
Rope efficiency, improves retrieval rate.
Through the text of word segmentation processing, set up positive sequence index and inverted index, it is achieved the inquiry of user is mutual.For positive sequence
Index, according to the sequence of word frequency, selects top n word to represent text, is expressed as with Hash table:<filename, key word phrase>;
After setting up positive sequence index, the key word in search text, find out the All Files name comprising this key word, set up file noun
Group, can obtain inverted index, be expressed as with Hash table:<key word, filename phrase>.
The foundation of index and the retrieval service of index realize based on Apache open source projects Lucene, and Lucene provides complete
Query engine and index engine, text analyzing engine;Use Hadoop storage and the index file of management magnanimity.
Index to set up process as follows:
1. create index and write object IndexWriter.Vocabulary resolver, different vocabulary solutions need to be provided during this Object Creation
Parser uses different dictionaries.Select ThesaurusAnalyzer, it is possible to extract synopsis;
2. for taking from each result set one the Document object of establishment in data base;
3. the data element in result set is respectively created a Field object, and adds Document object to;
4. write this Document object.
The process of indexed search is: first create query parser, and this query parser needs Field object name and right
The parameters such as the vocabulary resolver answered;Query object is obtained again by query parser and keyword;Retrieval is obtained by query object
Result set, result set is made up of Document object.
Text through participle, part-of-speech tagging, remove stop words after, set up the Feature Semantics network of text, statistics text
The information such as word frequency and text frequency, are then weighted and feature extraction etc..
Text feature semantic network figure be a kind of entity and semantic relation thereof to express the directed graph of public feelings information, with literary composition
The entity E(comprised in Ben includes things entity NE, event entity VE, event relation entity RE) as the node of figure, two realities
Semantic relation between body is as the directed edge of figure, and the semantic relation between entity combines the word frequency information weight as node,
The weight of directed edge represents entity relationship significance level in the text.By the introducing of network node weights and based on concept
Merging and simplify, building text feature semantic network figure, the core extracting text is semantic.The word i.e. represented by network node
Merging, node weights are added;Remerging directed edge, directed edge weights are added, and build text feature semantic network figure, describe text
In semantic information and theme feature.Concrete concept is described as follows:
C1: things entity NE is defined as NE(id, concept, property, power).Id represents entity identification,
Concept represents entitative concept, and property represents entity attribute, and power represents weight.
C2: event entity VE is defined as VE(id, concept, property, power, isN, subT, objT1,
ObjT2).In addition to the several data item comprising NE, whether isN represents is negative, and subT represents main body entity gauge outfit, objTl
With the gauge outfit that objT2 represents object entity 1 and 2.
C3: event relation entity RE is defined as RE(id, concept, property, power, isN, subT, objT).RE
Just can be fully described with a pair Subjective and Objective entity.
Text feature semantic network graph model analytical procedure is as follows:
S1: when analyzing text, first in units of statement, build each bar statement characteristic of correspondence semantic network figure.By
Sentence is analyzed every and is created which NE, and NE and attribute information thereof are charged to entity information table.
After S2:NE analyzes, analyze VE, the concept of registration VE, attribute, subject and object.The VE that Subjective and Objective is identical is real
Body surface is shown as same VE, otherwise arranges different id.
S3: next analyze RE.RE is it is noted that make a distinction with NE, VE, the concept of RE, attribute, main body, object in analysis
It is registered in entity information table.
S4: analyze after terminating, obtain the entity information table of this statement.Entity information table describes the relation between entity,
It is used for constructing entity relationship diagram, between NE and VE, between RE and NE, VE, by different line handles between entity E from attribute T
Entity relationship visualizes.
S5: on the basis of analyzing the Feature Semantics network building first statement, by the Feature Semantics net of follow-up statement
Network figure merges, and first merges node, remerges directed edge.
S6: when merging node, the node identical for word between node or semantic similarity being met threshold condition merges,
Node weights are added;Otherwise retain this node.
S7: directed edge merge, be merge after node between exist directed edge merge, directed edge weights be added.
S8: update the new weights that weights are this node merging node adjacency limit, the semantic relation between strengthening node.
S9: after exporting the Feature Semantics network of all merging statements, completes the Feature Semantics network of whole text
Structure.
Next step is to part of speech feature weight assignment, accurately to indicate text.Retouch according to Chinese part of speech feature and complete event
State key element (time, place, personage and event content), in conjunction with Chinese Academy of Sciences's Chinese part of speech label sets, text feature weight
Assignment is divided into: title weighted value is 3, and subtitle and keyword weight value are 2, and summary weighted value is 1.5, the first sentence of section and section tail sentence
Weighted value is 1.3.
Public feelings information is after pretreatment, and title, text and reply for text arrange different labels, is calculating weight
Time, read the label information of key word, complete the assignment of the position weight of word.
(4) public feelings information excavates
Public feelings information excavate module, be that text set is being carried out pretreatment, including Chinese word segmentation process, stop words filter and
After structured tag information analysis, text data set Information Extracting module generated, according to text feature semantic network figure structure
The text semantic feature description structure built, utilizes method for evaluating similarity to calculate the semantic similarity between text, builds similar
Degree matrix, uses improvement Clustering Analysis of Text method based on semantic similarity to generate cluster result;Cluster analysis result generates
Classification describes information, filters out the text message comprised in cluster analysis result;The TFIDF word frequency utilizing feature based to add up is special
Levying computational methods statistics category feature, obtain candidate categories Feature Words, selection noun is as candidate categories Feature Words, according to candidate
Term weight function sorts, and determines that candidate feature word, as classification key word, utilizes the semanteme between classification key word using weighted value
Relation, forms classification results;Result builds knowledge base, and knowledge base can also be configured with support public sentiment theme simultaneously
The text mining functions such as discovery, public sentiment sentiment classification.
First between the similarity defined and calculate between text, i.e. text, the degree of correlation of discussed theme, uses Sim
(D1,D2) represent text D1With text D2Between similarity.Similarity span between zero and one, with text D1And D2Phase
It is directly proportional like degree.Similarity between text is the biggest, shows that the theme correlation degree between text is the biggest.Language between text
Justice method for evaluating similarity is as follows:
If the public feelings information through step b extracts and pretreated text is D1(t11,t12,t13,…,t1m), D2(t21,
t22,t23,…,t2m), calculate text D1In all key word t1iWith text D2In all key word t2iSimilarity, formed similar
Degree matrix is as follows:
Simij(1=i, j=m) represents text D1Key word t1iWith text D2Key word t2jSimilarity;M(D1,D2) represent
Text D1With text D2Between similarity matrix;I is text D1Key word number;M is text D2Key word number;
Word similarity formula is: S (T1,T2)=Max(i=1,2,…,n;j=1,2,…,m)S(y1i,y2j), i.e. word
Language similarity is the maximum in the two all senses of a dictionary entry of word (multiple meaning of a word that a word is comprised) similarity.
Traversal similarity matrix M successively, finds the key word correspondence combination that similarity Sim value is maximum, and deletes correspondence
Row and column.Then proceeding to travel through similarity matrix M and find the maximum key word combination of Similarity value, iterative cycles is until matrix M
For null value matrix.Finally utilize the similarity maximum key word composite sequence obtained, try to achieve text D1And D2Semantic similarity,
Computing formula is as follows:
Wherein, max is the maximum of similarity Sim;I is text D1Key word number;J is text D2Key word number.
Improvement Clustering Analysis of Text method based on semantic similarity, is described as follows:
First, to the text of all collections after pretreatment, use TFIDF weighting method that all categories key word is entered
Row characteristic weighing, extracts m optimal characteristics key word and is formed original based on keyword feature vector Di*.
2. according to described knowledge base, original is carried out pretreatment based on key word in keyword feature vector Di*: knowing
Know and the vocabulary with Keywords matching is found in storehouse and is replaced, form new characteristic vector Di, Di=(T1,T2,…,Ti),i=1,
2,3,…,m。
3. form m characteristic vector D of n texti, utilize text semantic calculating formula of similarity to calculate the text gathered
Between semantic similarity, form the similarity matrix M of text set, and obtain the average similarity MA of all characteristic vectors.Meter
Calculation formula is as follows:
4. setting three similarity thresholds, a multiplicity threshold value is 0.9, and a theme central threshold is 0.5, Yi Jiyi
Individual new theme threshold value is 0.3;
5. text is compared with central theme, if the initial center similarity of text and central theme is more than multiplicity threshold
Value 0.9, it is believed that the text belongs to the same content text of same subject;If similarity is less than new theme threshold value 0.3, the then text
Need a newly-built class;If similarity is in the range of 0~0.5, then the text belongs to the not ipsilateral discussion of same subject
Core content text, is labeled as second center, by that analogy, forming the cluster result of the stratification at multiple center.
6., for the theme method for expressing at multiple centers, select text to make with the maximum of the similarity at each center in class
Similarity for this class text.
Improvement Text Clustering Algorithm based on semantic similarity is suitable under large-scale network environment gathering dynamic data
Alanysis and public sentiment theme focus find, new events can be detected in time, detect, follow the tracks of new public sentiment theme;Use in class many
The public sentiment theme method for expressing at individual center, is effectively improved running efficiency of system, and along with the increase of amount of text, effect can be more
Add substantially.
5) public feelings information analysis
Described public feelings information is analyzed module and is carried out the data through the excavation of step c being stored in public feelings information data base
OLAP multidimensional statistics is analyzed, and analyzes the public sentiment evaluation metrics such as public sentiment subject content attention rate, public sentiment theme Sentiment orientation, is relevant
Department grasps public sentiment in time and issues public feelings information dynamically, in good time, makes correct decisions offer support.
By the public sentiment theme gathering, processing and mining analysis produces, it is expressed as: T=(T1,T2,…,Tn), wherein TiTable
Show the text of public sentiment theme.The attention rate of public sentiment subject text is expressed as: Ti=αNp+βNr, the attention rate tolerance public affairs of public sentiment theme
Formula is:Wherein α, β represent weight, NpRepresent the hits of public sentiment subject text,
NrRepresent comment number;Np_i represents the hits of i-th public sentiment subject text, and Nr_i represents commenting of i-th public sentiment subject text
Opinion number.Due to Np>Nr, through statistics, α value is 0.02, and β value is 0.98.
The Sentiment orientation of public sentiment theme cluster analysis based on public sentiment subject text data describe.First a fault is set
Value, only when the tendency metric of text is more than threshold, text just shows polarity (front property, negative).The tendency of text
Metric is just, then the text is the comment in front, otherwise is then negative comment.
Public feelings information passes through collection, pretreatment, Information Extracting, excavates and analyze, and can obtain the detailed number of public sentiment theme
According to, processing according to the public sentiment indicator evaluation system set up, the result of process provides decision-making to help.
Claims (7)
1. analyze method based on the network public sentiment information that text semantic is relevant, it is characterised in that: use and include network public sentiment information
Acquisition module, public feelings information extraction module, public feelings information pretreatment module, public feelings information excavate module, public feelings information analyzes mould
Block and the network public sentiment information comprising public feelings information data base analyze system, and comprise the steps:
A. network public sentiment information acquisition module gathers various public feelings information from webpage, and stores in public feelings information data base;
B. public feelings information extraction module and public feelings information pretreatment module public feelings information that step a is gathered tentatively filter with
Cutting, the content information that extraction text is comprised, excavates for public feelings information and provides data, services;
C., on the basis of step b, public feelings information excavates module and uses improvement Clustering Analysis of Text method based on semantic similarity,
Generate classification and describe information, filter out the text message comprised in cluster analysis result;Utilize the TFIDF word that feature based is added up
Frequently feature calculation method statistic category feature, obtains Based on Class Feature Word Quadric, and selection noun is as candidate categories Feature Words, according to candidate
Term weight function sorts, and using the bigger candidate feature word of weighted value as classification key word, utilizes the language between classification key word
Justice relation, forms classification results;Identify and set up new network public-opinion theme, detect, follow the tracks of inside the Pass the phase of existing public sentiment theme
Hold;
D. last, public feelings information is analyzed module and public feelings information is carried out OLAP multidimensional statistics analysis through the data that step c is excavated,
Analyze the public sentiment evaluation metrics such as public sentiment subject content attention rate, public sentiment theme Sentiment orientation;
In step a, described public feelings information acquisition module, is to be acquired network public sentiment information source, webpage to be completed
Crawl, and web page contents is formatted process, extract theme and the content of public sentiment, the data obtained is stored in txt lattice
Formula or html formatted file, and store public feelings information data base;Network public sentiment information acquisition module uses timesharing to access, regularly
Change IP address and simulation browser carries out three kinds of technology combinations of single-sign-on and carries out anti-shielding.
The network public sentiment information relevant based on text semantic the most according to claim 1 analyzes method, it is characterized in that, described
What public feelings information acquisition module performed concretely comprises the following steps, and from the beginning of the URL of predefined theme related web page, obtains in webpage
Text message, and from current web page, extract new URL put in queue, until the public feelings information meeting condition has gathered
Finish, till URL queue is sky;The web page text information collected is stored in public feelings information data base according to field classification,
Public feelings information extraction module is provided to call.
The network public sentiment information relevant based on text semantic the most according to claim 1 analyzes method, it is characterized in that, in step
In rapid b, described public feelings information extraction module, it is to remove the irrelevant contents in webpage, extracts the textual useful to the analysis of public opinion
The metamessage divided, is reconstructed text, will have the representational information aggregation of theme together;Described public feelings information pretreatment
Module, be to gather public feelings information source through described public feelings information extraction module extraction after, carry out Chinese word segmentation process, filtration
Stop words, name Entity recognition, part-of-speech tagging, syntax parsing and Feature Words extract, and set up positive sequence index and inverted index;Set up
Text feature semantic network figure, the entity E comprised in text is as the node of figure, the semantic relation conduct between two entities
The directed edge of figure, the semantic relation between entity combines the word frequency information weight as node, the weight presentation-entity of directed edge
Relation significance level in the text, described entity E includes things entity NE, event entity VE, event relation entity RE;Statistics
The word frequency of text and text frequency information, then carry out Feature Words extraction, and the vocabulary choosing embodiment text feature shows the text.
The network public sentiment information relevant based on text semantic the most according to claim 3 analyzes method, it is characterized in that, in step
In rapid c, described public feelings information excavates module, is that text set is being carried out pretreatment, filters including Chinese word segmentation process, stop words
After structured tag information analysis, text data set Information Extracting module generated, according to text feature semantic network figure
The text semantic feature description structure built, utilizes method for evaluating similarity to calculate the semantic similarity between text, builds phase
Seemingly spend matrix, use improvement Clustering Analysis of Text method based on semantic similarity to generate cluster result;Cluster analysis result is raw
Become classification to describe information, filter out the text message comprised in cluster analysis result;Utilize the TFIDF word frequency that feature based is added up
Feature calculation method statistic category feature, obtains candidate categories Feature Words, and selection noun is as candidate categories Feature Words, according to time
Select term weight function to sort, determine that candidate feature word, as classification key word, utilizes the language between classification key word using weighted value
Justice relation, forms classification results;Result is built knowledge base.
5. analyze method according to the network public sentiment information relevant based on text semantic described in claim 3 or 4, it is characterized in that,
Text feature semantic network figure is the directed graph utilizing entity and semantic relation thereof to express public feelings information, by network node table
The word shown merges, and node weights are added;Remerging directed edge, directed edge weights are added, and build text feature semantic network figure,
Semantic information in text and theme feature are described.
The network public sentiment information relevant based on text semantic the most according to claim 4 analyzes method, it is characterized in that, text
Between semantic similarity evaluation methodology be:
If the public feelings information through step b extracts and pretreated text is D1(t11,t12,t13,…,t1m), D2(t21,t22,
t23,…,t2m), calculate text D1In all key word t1iWith text D2In all key word t2iSimilarity, formed similarity
Matrix is as follows:
Simij(1=i, j=m) represents text D1Key word t1iWith text D2Key word t2jSimilarity;M(D1,D2) represent text
D1With text D2Between similarity matrix;I is text D1Key word number;M is text D2Key word number;
Word similarity formula S (T1,T2)=Max(i=1,2 ..., n;J=1,2 ..., m)S(y1i,y2j), i.e. Words similarity is two words
Maximum in language all senses of a dictionary entry similarity, the described senses of a dictionary entry refers to multiple meaning of a word that a word is comprised;
Traversal similarity matrix M successively, finds the key word correspondence combination that similarity Sim value is maximum, and delete the row of correspondence with
Row;Then proceeding to travel through similarity matrix M and find the key word combination of Sim value maximum, iterative cycles is until matrix M is null value square
Battle array;Finally utilize the similarity maximum key word composite sequence obtained, try to achieve text D1And D2Semantic similarity, computing formula
As follows:
Wherein, max is the maximum of similarity Sim;I is text D1Key word number;J is text D2Key word number.
The network public sentiment information relevant based on text semantic the most according to claim 6 analyzes method, it is characterized in that, based on
The improvement Clustering Analysis of Text method of semantic similarity is:
1) first to the text of all collections after pretreatment, use TFIDF weighting method that all categories key word is carried out spy
Levy weighting, extract m optimal characteristics key word and formed original based on keyword feature vector Di*;
2) according to described knowledge base, original is carried out pretreatment based on key word in keyword feature vector Di*: in knowledge base
In find the vocabulary with Keywords matching and replaced, form new characteristic vector Di, Di=(T1,T2,…,Ti), i=1,2,
3,…,m;
3) m characteristic vector D of n text is formedi, utilize text semantic calculating formula of similarity to calculate between the text gathered
Semantic similarity, form the similarity matrix M of text set, and obtain the average similarity MA of all characteristic vectors;Calculate public affairs
Formula is as follows:
Wherein, n is textual data;
4) setting three similarity thresholds, a multiplicity threshold value is 0.9, and a theme central threshold is 0.5, and one new
Theme threshold value is 0.3;
5) text is compared with central theme, if the initial center similarity of text and central theme is more than multiplicity threshold value
0.9, it is believed that the text belongs to the same content text of same subject;If similarity is less than new theme threshold value 0.3, then the text needs
Want a newly-built class;If similarity is in the range of 0~0.5, then the text belongs to the core that the not ipsilateral of same subject is discussed
Heart content text, is labeled as second center, by that analogy, forming the cluster result of the stratification at multiple center;
6) for the theme method for expressing at multiple centers, select text and the maximum of the similarity at each center in class as this
The similarity of class text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310482522.5A CN103544255B (en) | 2013-10-15 | 2013-10-15 | Text semantic relativity based network public opinion information analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310482522.5A CN103544255B (en) | 2013-10-15 | 2013-10-15 | Text semantic relativity based network public opinion information analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103544255A CN103544255A (en) | 2014-01-29 |
CN103544255B true CN103544255B (en) | 2017-01-11 |
Family
ID=49967707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310482522.5A Active CN103544255B (en) | 2013-10-15 | 2013-10-15 | Text semantic relativity based network public opinion information analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103544255B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446409A (en) * | 2018-09-19 | 2019-03-08 | 杭州安恒信息技术股份有限公司 | A kind of recognition methods of the target object of doubtful multiple level marketing behavior |
Families Citing this family (151)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103902659B (en) * | 2014-03-04 | 2017-06-27 | 深圳市至高通信技术发展有限公司 | A kind of the analysis of public opinion method and corresponding device |
CN103886051A (en) * | 2014-03-13 | 2014-06-25 | 电子科技大学 | Comment analysis method based on entities and features |
CN104915359B (en) * | 2014-03-14 | 2019-05-28 | 华为技术有限公司 | Theme label recommended method and device |
CN103927545B (en) * | 2014-03-14 | 2017-10-17 | 小米科技有限责任公司 | Clustering method and relevant apparatus |
CN103902674B (en) * | 2014-03-19 | 2017-10-27 | 百度在线网络技术(北京)有限公司 | The acquisition method and device of the comment data of particular topic |
CN103838886A (en) * | 2014-03-31 | 2014-06-04 | 辽宁四维科技发展有限公司 | Text content classification method based on representative word knowledge base |
CN103841216A (en) * | 2014-04-01 | 2014-06-04 | 深圳市科盾科技有限公司 | Network public opinion monitoring system based on cloud platform |
CN104199829B (en) * | 2014-07-25 | 2017-07-04 | 中国科学院自动化研究所 | Affection data sorting technique and system |
CN104346425B (en) * | 2014-07-28 | 2017-10-31 | 中国科学院计算技术研究所 | A kind of method and system of the internet public feelings index system of stratification |
CN104217718B (en) * | 2014-09-03 | 2017-05-17 | 陈飞 | Method and system for voice recognition based on environmental parameter and group trend data |
CA2959651C (en) * | 2014-09-03 | 2021-04-20 | The Dun & Bradstreet Corporation | System and process for analyzing, qualifying and ingesting sources of unstructured data via empirical attribution |
CN104268194A (en) * | 2014-09-19 | 2015-01-07 | 国家电网公司 | Method for dynamically generating public opinion brief report |
CN105574047A (en) * | 2014-10-17 | 2016-05-11 | 任子行网络技术股份有限公司 | Website main page feature analysis based Chinese website sorting method and system |
CN104504150B (en) * | 2015-01-09 | 2017-09-29 | 成都布林特信息技术有限公司 | News public sentiment monitoring system |
CN105992194B (en) * | 2015-01-30 | 2019-10-29 | 阿里巴巴集团控股有限公司 | The acquisition methods and device of network data content |
CN104699763B (en) * | 2015-02-11 | 2017-10-17 | 中国科学院新疆理化技术研究所 | The text similarity gauging system of multiple features fusion |
CN106156041B (en) * | 2015-03-26 | 2019-05-28 | 科大讯飞股份有限公司 | Hot information finds method and system |
CN106156192A (en) * | 2015-04-21 | 2016-11-23 | 北大方正集团有限公司 | Public sentiment data clustering method and public sentiment data clustering system |
CN106294358A (en) * | 2015-05-14 | 2017-01-04 | 北京大学 | The search method of a kind of information and system |
CN104820629B (en) * | 2015-05-14 | 2018-01-30 | 中国电子科技集团公司第五十四研究所 | A kind of intelligent public sentiment accident emergent treatment system and method |
CN104915453A (en) * | 2015-07-01 | 2015-09-16 | 北京奇虎科技有限公司 | Method, device and system for classifying POI information |
CN104899339A (en) * | 2015-07-01 | 2015-09-09 | 北京奇虎科技有限公司 | Method and system for classifying POI (Point of Interest) information |
CN105183803A (en) * | 2015-08-25 | 2015-12-23 | 天津大学 | Personalized search method and search apparatus thereof in social network platform |
CN105183478B (en) * | 2015-09-11 | 2018-11-23 | 中山大学 | A kind of webpage reconstructing method and its device based on color transfer |
CN106528581B (en) * | 2015-09-15 | 2019-05-07 | 阿里巴巴集团控股有限公司 | Method for text detection and device |
CN106649367B (en) * | 2015-10-30 | 2020-03-03 | 北京国双科技有限公司 | Method and device for detecting keyword popularization degree |
US10872103B2 (en) * | 2015-11-03 | 2020-12-22 | Hewlett Packard Enterprise Development Lp | Relevance optimized representative content associated with a data storage system |
CN105279277A (en) * | 2015-11-12 | 2016-01-27 | 百度在线网络技术(北京)有限公司 | Knowledge data processing method and device |
CN105389389B (en) * | 2015-12-10 | 2018-09-25 | 安徽博约信息科技股份有限公司 | A kind of network public-opinion propagation situation medium control analysis method |
CN105447202A (en) * | 2015-12-31 | 2016-03-30 | 宁波公众信息产业有限公司 | Internet information collecting system |
CN105677802A (en) * | 2015-12-31 | 2016-06-15 | 宁波公众信息产业有限公司 | Internet information analysis system |
CN105677873B (en) * | 2016-01-11 | 2019-03-26 | 中国电子科技集团公司第十研究所 | Text Intelligence association cluster based on model of the domain knowledge collects processing method |
CN105740238B (en) * | 2016-03-04 | 2019-02-01 | 北京理工大学 | A kind of event relation intensity map construction method merging sentence justice information |
CN105956069A (en) * | 2016-04-28 | 2016-09-21 | 优品财富管理有限公司 | Network information collection and analysis method and network information collection and analysis system |
CN105956070A (en) * | 2016-04-28 | 2016-09-21 | 优品财富管理有限公司 | Method and system for integrating repetitive records |
CN106126558B (en) * | 2016-06-16 | 2019-09-20 | 东软集团股份有限公司 | A kind of public sentiment monitoring method and device |
CN106294542B (en) * | 2016-07-25 | 2018-03-30 | 北京市信访矛盾分析研究中心 | A kind of letters and calls data mining methods of marking and system |
CN106294619A (en) * | 2016-08-01 | 2017-01-04 | 上海交通大学 | Public sentiment intelligent supervision method |
JP2019536137A (en) * | 2016-10-25 | 2019-12-12 | コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. | Knowledge diagnosis based clinical diagnosis support |
CN106651696B (en) * | 2016-11-16 | 2020-10-27 | 福建天泉教育科技有限公司 | Approximate question pushing method and system |
CN106776724B (en) * | 2016-11-16 | 2020-09-08 | 福建天泉教育科技有限公司 | Question classification method and system |
CN106599054B (en) * | 2016-11-16 | 2019-12-24 | 福建天泉教育科技有限公司 | Method and system for classifying and pushing questions |
CN108090040B (en) * | 2016-11-23 | 2021-08-17 | 北京国双科技有限公司 | Text information classification method and system |
CN107045524B (en) * | 2016-12-30 | 2019-12-27 | 中央民族大学 | Method and system for classifying network text public sentiments |
CN107016068A (en) * | 2017-03-21 | 2017-08-04 | 深圳前海乘方互联网金融服务有限公司 | Knowledge mapping construction method and device |
CN107918633B (en) * | 2017-03-23 | 2021-07-02 | 广州思涵信息科技有限公司 | Sensitive public opinion content identification method and early warning system based on semantic analysis technology |
CN107145516B (en) * | 2017-04-07 | 2021-03-19 | 北京捷通华声科技股份有限公司 | Text clustering method and system |
CN107066585B (en) * | 2017-04-17 | 2019-10-01 | 济南大学 | A kind of probability topic calculates and matched public sentiment monitoring method and system |
CN107085608A (en) * | 2017-04-21 | 2017-08-22 | 上海喆之信息科技有限公司 | A kind of effective network hotspot monitoring system |
CN107093021A (en) * | 2017-04-21 | 2017-08-25 | 深圳市创艺工业技术有限公司 | Electricity power engineering goods and materials contract is honoured an agreement sincere public sentiment monitoring system |
CN107038156A (en) * | 2017-04-28 | 2017-08-11 | 北京清博大数据科技有限公司 | A kind of hot spot of public opinions Forecasting Methodology based on big data |
CN107291808A (en) * | 2017-05-16 | 2017-10-24 | 南京邮电大学 | It is a kind of that big data sorting technique is manufactured based on semantic cloud |
CN107220236A (en) * | 2017-05-23 | 2017-09-29 | 武汉朱雀闻天科技有限公司 | It is a kind of to determine the doubtful naked method and device for borrowing student |
CN107315778A (en) * | 2017-05-31 | 2017-11-03 | 温州市鹿城区中津先进科技研究院 | A kind of natural language the analysis of public opinion method based on big data sentiment analysis |
CN107292743A (en) * | 2017-06-07 | 2017-10-24 | 前海梧桐(深圳)数据有限公司 | The intelligent decision making method and its system invested and financed for enterprise |
CN107231570A (en) * | 2017-06-13 | 2017-10-03 | 中国传媒大学 | News data content characteristic obtains system and application system |
CN107358344B (en) * | 2017-06-29 | 2021-09-03 | 浙江图讯科技股份有限公司 | Enterprise hidden danger management method and management system thereof, electronic equipment and storage medium |
CN107291697A (en) * | 2017-06-29 | 2017-10-24 | 浙江图讯科技股份有限公司 | A kind of semantic analysis, electronic equipment, storage medium and its diagnostic system |
CN107276854B (en) * | 2017-07-27 | 2021-11-09 | 浩鲸云计算科技股份有限公司 | MOLAP statistical analysis method under big data |
CN107491438A (en) * | 2017-08-25 | 2017-12-19 | 前海梧桐(深圳)数据有限公司 | Business decision elements recognition method and its system based on natural language |
CN107527289B (en) * | 2017-08-25 | 2021-08-06 | 上海优扬新媒信息技术有限公司 | Investment portfolio industry configuration method, device, server and storage medium |
CN107679084B (en) * | 2017-08-31 | 2021-09-28 | 平安科技(深圳)有限公司 | Clustering label generation method, electronic device and computer readable storage medium |
CN107679977A (en) * | 2017-09-06 | 2018-02-09 | 广东中标数据科技股份有限公司 | A kind of tax administration platform and implementation method based on semantic analysis |
CN107918644B (en) * | 2017-10-31 | 2020-12-08 | 北京锐思爱特咨询股份有限公司 | News topic analysis method and implementation system in reputation management framework |
CN107908694A (en) * | 2017-11-01 | 2018-04-13 | 平安科技(深圳)有限公司 | Public sentiment clustering method, application server and the computer-readable recording medium of internet news |
CN108052527A (en) * | 2017-11-08 | 2018-05-18 | 中国传媒大学 | Method is recommended in film bridge piecewise analysis based on label system |
CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
CN108197638B (en) * | 2017-12-12 | 2020-03-20 | 阿里巴巴集团控股有限公司 | Method and device for classifying sample to be evaluated |
CN110019720B (en) * | 2017-12-19 | 2022-02-08 | 阿里巴巴(中国)有限公司 | Comment content acquisition method and system |
CN108062306A (en) * | 2017-12-29 | 2018-05-22 | 国信优易数据有限公司 | A kind of index system establishment system and method for business environment evaluation |
CN108363784A (en) * | 2018-01-20 | 2018-08-03 | 西北工业大学 | A kind of public sentiment trend estimate method based on text machine learning |
CN108595466B (en) * | 2018-02-09 | 2022-05-10 | 中山大学 | Internet information filtering and internet user information and network card structure analysis method |
CN108287922B (en) * | 2018-02-28 | 2022-03-08 | 福州大学 | Text data viewpoint abstract mining method fusing topic attributes and emotional information |
CN108536762A (en) * | 2018-03-21 | 2018-09-14 | 上海蔚界信息科技有限公司 | A kind of high-volume text data automatically analyzes scheme |
CN108681977B (en) * | 2018-03-27 | 2022-05-31 | 成都律云科技有限公司 | Lawyer information processing method and system |
CN108550380A (en) * | 2018-04-12 | 2018-09-18 | 北京深度智耀科技有限公司 | A kind of drug safety information monitoring method and device based on public network |
CN108628994A (en) * | 2018-04-28 | 2018-10-09 | 广东亿迅科技有限公司 | A kind of public sentiment data processing system |
CN108932291B (en) * | 2018-05-23 | 2022-08-23 | 福建亿榕信息技术有限公司 | Power grid public opinion evaluation method, storage medium and computer |
CN108804594A (en) * | 2018-05-28 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of construction method and device of news content full-text search engine |
CN110633373B (en) * | 2018-06-20 | 2023-06-09 | 上海财经大学 | Automobile public opinion analysis method based on knowledge graph and deep learning |
CN110727794A (en) * | 2018-06-28 | 2020-01-24 | 上海传漾广告有限公司 | System and method for collecting and analyzing network semantics and summarizing and analyzing content |
CN109145085B (en) * | 2018-07-18 | 2020-11-27 | 北京市农林科学院 | Semantic similarity calculation method and system |
CN109376237B (en) * | 2018-09-04 | 2024-05-28 | 中国平安人寿保险股份有限公司 | Client stability prediction method, device, computer equipment and storage medium |
CN109408808B (en) * | 2018-09-12 | 2023-08-22 | 中国传媒大学 | Evaluation method and evaluation system for literature works |
CN109214008A (en) * | 2018-09-28 | 2019-01-15 | 珠海中科先进技术研究院有限公司 | A kind of sentiment analysis method and system based on keyword extraction |
CN109299271B (en) * | 2018-10-30 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Training sample generation method, text data method, public opinion event classification method and related equipment |
CN109558586B (en) * | 2018-11-02 | 2023-04-18 | 中国科学院自动化研究所 | Self-evidence scoring method, equipment and storage medium for statement of information |
CN109582953B (en) * | 2018-11-02 | 2023-04-07 | 中国科学院自动化研究所 | Data support scoring method and equipment for information and storage medium |
CN109635074B (en) * | 2018-11-13 | 2024-05-07 | 平安科技(深圳)有限公司 | Entity relationship analysis method and terminal equipment based on public opinion information |
CN109189934B (en) * | 2018-11-13 | 2024-07-19 | 平安科技(深圳)有限公司 | Public opinion recommendation method, public opinion recommendation device, computer equipment and storage medium |
CN109635107A (en) * | 2018-11-19 | 2019-04-16 | 北京亚鸿世纪科技发展有限公司 | The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source |
CN109526027B (en) * | 2018-11-27 | 2022-07-01 | 中国移动通信集团福建有限公司 | Cell capacity optimization method, device, equipment and computer storage medium |
CN109766438B (en) * | 2018-12-12 | 2024-07-16 | 平安科技(深圳)有限公司 | Resume information extraction method, resume information extraction device, computer equipment and storage medium |
CN110046292B (en) * | 2018-12-13 | 2024-04-23 | 创新先进技术有限公司 | Public opinion data processing method, device, equipment and storage medium |
CN111435594A (en) * | 2019-01-14 | 2020-07-21 | 珠海格力电器股份有限公司 | Method and device for acquiring cooking parameters of cooking appliance and cooking appliance |
CN109977995A (en) * | 2019-02-11 | 2019-07-05 | 平安科技(深圳)有限公司 | Text template recognition methods, device and computer readable storage medium |
CN110134844A (en) * | 2019-04-04 | 2019-08-16 | 平安科技(深圳)有限公司 | Subdivision field public sentiment monitoring method, device, computer equipment and storage medium |
CN110110156A (en) * | 2019-04-04 | 2019-08-09 | 平安科技(深圳)有限公司 | Industry public sentiment monitoring method, device, computer equipment and storage medium |
CN110188196B (en) * | 2019-04-29 | 2021-10-08 | 同济大学 | Random forest based text increment dimension reduction method |
CN110222172B (en) * | 2019-05-15 | 2021-03-16 | 北京邮电大学 | Multi-source network public opinion theme mining method based on improved hierarchical clustering |
CN110119416A (en) * | 2019-05-16 | 2019-08-13 | 重庆八戒传媒有限公司 | A kind of service data analysis system and method |
CN110188168B (en) * | 2019-05-24 | 2021-09-03 | 北京邮电大学 | Semantic relation recognition method and device |
CN110348539B (en) * | 2019-07-19 | 2021-05-07 | 知者信息技术服务成都有限公司 | Short text relevance judging method |
CN112348421A (en) * | 2019-08-08 | 2021-02-09 | 北京国双科技有限公司 | Data processing method and device |
CN110472055B (en) * | 2019-08-21 | 2021-09-14 | 北京百度网讯科技有限公司 | Method and device for marking data |
CN110532492A (en) * | 2019-08-27 | 2019-12-03 | 东北大学 | A kind of forum data management classification system and method |
CN112541105A (en) * | 2019-09-20 | 2021-03-23 | 福建师范大学地理研究所 | Keyword generation method, public opinion monitoring method, device, equipment and medium |
CN110705288A (en) * | 2019-09-29 | 2020-01-17 | 武汉海昌信息技术有限公司 | Big data-based public opinion analysis system |
CN110852090B (en) * | 2019-11-07 | 2024-03-19 | 中科天玑数据科技股份有限公司 | Mechanism characteristic vocabulary expansion system and method for public opinion crawling |
CN110991190B (en) * | 2019-11-29 | 2021-06-29 | 华中科技大学 | Document theme enhancement system, text emotion prediction system and method |
CN110968668B (en) * | 2019-11-29 | 2023-03-14 | 中国农业科学院农业信息研究所 | Method and device for calculating similarity of network public sentiment topics based on hyper-network |
CN110990389A (en) * | 2019-11-29 | 2020-04-10 | 上海易点时空网络有限公司 | Method and device for simplifying question bank and computer readable storage medium |
CN111158973B (en) * | 2019-12-05 | 2021-06-18 | 北京大学 | Web application dynamic evolution monitoring method |
CN111144575B (en) * | 2019-12-05 | 2022-08-12 | 支付宝(杭州)信息技术有限公司 | Public opinion early warning model training method, early warning method, device, equipment and medium |
CN111160019B (en) * | 2019-12-30 | 2023-08-15 | 中国联合网络通信集团有限公司 | Public opinion monitoring method, device and system |
CN111241077B (en) * | 2020-01-03 | 2023-06-09 | 四川新网银行股份有限公司 | Identification method of financial fraud based on internet data |
CN111259635A (en) * | 2020-01-09 | 2020-06-09 | 智业软件股份有限公司 | Method and system for completing and predicting medical record written text |
CN111291186B (en) * | 2020-01-21 | 2024-01-09 | 北京捷通华声科技股份有限公司 | Context mining method and device based on clustering algorithm and electronic equipment |
CN111291162B (en) * | 2020-02-26 | 2024-04-09 | 深圳前海微众银行股份有限公司 | Quality inspection example sentence mining method, device, equipment and computer readable storage medium |
CN111401074A (en) * | 2020-04-03 | 2020-07-10 | 山东爱城市网信息技术有限公司 | Short text emotion tendency analysis method, system and device based on Hadoop |
CN111563190B (en) * | 2020-04-07 | 2023-03-14 | 中国电子科技集团公司第二十九研究所 | Multi-dimensional analysis and supervision method and system for user behaviors of regional network |
CN111797333B (en) * | 2020-06-04 | 2021-04-20 | 南京擎盾信息科技有限公司 | Public opinion spreading task display method and device |
CN111708886A (en) * | 2020-06-11 | 2020-09-25 | 国网天津市电力公司 | Public opinion analysis terminal and public opinion text analysis method based on data driving |
CN111914096B (en) * | 2020-07-06 | 2024-02-02 | 同济大学 | Public opinion knowledge graph-based public transportation passenger satisfaction evaluation method and system |
CN111831922B (en) * | 2020-07-14 | 2021-02-05 | 深圳市众创达企业咨询策划有限公司 | Recommendation system and method based on internet information |
CN111914141B (en) * | 2020-07-30 | 2023-01-10 | 广州城市信息研究所有限公司 | Public opinion knowledge base construction method and public opinion knowledge base |
CN112084298A (en) * | 2020-07-31 | 2020-12-15 | 北京明略昭辉科技有限公司 | Public opinion theme processing method and device based on rapid BTM |
CN112214576B (en) * | 2020-09-10 | 2024-02-06 | 深圳价值在线信息科技股份有限公司 | Public opinion analysis method, public opinion analysis device, terminal equipment and computer readable storage medium |
CN112184323A (en) * | 2020-10-13 | 2021-01-05 | 上海风秩科技有限公司 | Evaluation label generation method and device, storage medium and electronic equipment |
CN112528197B (en) * | 2020-11-20 | 2023-07-07 | 四川新网银行股份有限公司 | System and method for monitoring network public opinion in real time based on artificial intelligence |
CN112464653A (en) * | 2020-12-03 | 2021-03-09 | 合肥天源迪科信息技术有限公司 | Real-time event identification and matching method based on communication short message |
CN112650848A (en) * | 2020-12-30 | 2021-04-13 | 交控科技股份有限公司 | Urban railway public opinion information analysis method based on text semantic related passenger evaluation |
CN113282702B (en) * | 2021-03-16 | 2023-12-19 | 广东医通软件有限公司 | Intelligent retrieval method and retrieval system |
CN113032653A (en) * | 2021-04-02 | 2021-06-25 | 盐城师范学院 | Big data-based public opinion monitoring platform |
CN113822038B (en) * | 2021-06-03 | 2024-06-25 | 腾讯科技(深圳)有限公司 | Abstract generation method and related device |
CN113468333B (en) * | 2021-09-02 | 2021-11-19 | 华东交通大学 | Event detection method and system fusing hierarchical category information |
CN113836307B (en) * | 2021-10-15 | 2024-02-20 | 国网北京市电力公司 | Power supply service work order hot spot discovery method, system, device and storage medium |
CN114281994B (en) * | 2021-12-27 | 2022-06-03 | 盐城工学院 | Text clustering integration method and system based on three-layer weighting model |
CN114386422B (en) * | 2022-01-14 | 2023-09-15 | 淮安市创新创业科技服务中心 | Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction |
CN114491207A (en) * | 2022-01-18 | 2022-05-13 | 平安普惠企业管理有限公司 | Public opinion analysis method and related product |
CN114692593B (en) * | 2022-03-21 | 2023-04-07 | 中国刑事警察学院 | Network information safety monitoring and early warning method |
CN114385890B (en) * | 2022-03-22 | 2022-05-20 | 深圳市世纪联想广告有限公司 | Internet public opinion monitoring system |
CN114462393A (en) * | 2022-04-12 | 2022-05-10 | 安徽数智建造研究院有限公司 | Webpage text information extraction method and device, terminal equipment and storage medium |
CN115082947B (en) * | 2022-07-12 | 2023-08-15 | 江苏楚淮软件科技开发有限公司 | Paper letter quick collecting, sorting and reading system |
CN115757793B (en) * | 2022-11-29 | 2023-09-05 | 海南达润丰企业管理合伙企业(有限合伙) | Topic analysis early warning method and system based on artificial intelligence and cloud platform |
CN116521858B (en) * | 2023-04-20 | 2024-04-30 | 浙江浙里信征信有限公司 | Context semantic sequence comparison method based on dynamic clustering and visualization |
CN117786249A (en) * | 2023-12-27 | 2024-03-29 | 王冰 | Network real-time hot topic mining analysis and public opinion extraction system |
CN117743376B (en) * | 2024-02-19 | 2024-05-03 | 蓝色火焰科技成都有限公司 | Big data mining method, device and storage medium for digital financial service |
CN117910467B (en) * | 2024-03-15 | 2024-05-10 | 成都启英泰伦科技有限公司 | Word segmentation processing method in offline voice recognition process |
CN118520174B (en) * | 2024-07-19 | 2024-09-27 | 西安银信博锐信息科技有限公司 | Customer behavior feature extraction method based on data analysis |
CN118656495A (en) * | 2024-08-20 | 2024-09-17 | 湖南数据产业集团有限公司 | Public opinion publishing traceability method, device, equipment and storage medium thereof |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101529418A (en) * | 2006-01-19 | 2009-09-09 | 维里德克斯有限责任公司 | Systems and methods for acquiring analyzing mining data and information |
CN101788988A (en) * | 2009-01-22 | 2010-07-28 | 蔡亮华 | Information extraction method |
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8874581B2 (en) * | 2010-07-29 | 2014-10-28 | Microsoft Corporation | Employing topic models for semantic class mining |
-
2013
- 2013-10-15 CN CN201310482522.5A patent/CN103544255B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101529418A (en) * | 2006-01-19 | 2009-09-09 | 维里德克斯有限责任公司 | Systems and methods for acquiring analyzing mining data and information |
CN101788988A (en) * | 2009-01-22 | 2010-07-28 | 蔡亮华 | Information extraction method |
CN102708096A (en) * | 2012-05-29 | 2012-10-03 | 代松 | Network intelligence public sentiment monitoring system based on semantics and work method thereof |
Non-Patent Citations (1)
Title |
---|
基于语义相似度的文本聚类算法的研究;孙爽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080115(第01期);I140-15 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109446409A (en) * | 2018-09-19 | 2019-03-08 | 杭州安恒信息技术股份有限公司 | A kind of recognition methods of the target object of doubtful multiple level marketing behavior |
Also Published As
Publication number | Publication date |
---|---|
CN103544255A (en) | 2014-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN106874378B (en) | Method for constructing knowledge graph based on entity extraction and relation mining of rule model | |
Chen et al. | Websrc: A dataset for web-based structural reading comprehension | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
CN107229668B (en) | Text extraction method based on keyword matching | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
CN102662952B (en) | Chinese text parallel data mining method based on hierarchy | |
CN105512687A (en) | Emotion classification model training and textual emotion polarity analysis method and system | |
CN112650848A (en) | Urban railway public opinion information analysis method based on text semantic related passenger evaluation | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN105068991A (en) | Big data based public sentiment discovery method | |
CN103226578A (en) | Method for identifying websites and finely classifying web pages in medical field | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
CN111899089A (en) | Enterprise risk early warning method and system based on knowledge graph | |
CN105653668A (en) | Webpage content analysis and extraction optimization method based on DOM Tree in cloud environment | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
CN101231661A (en) | Method and system for digging object grade knowledge | |
CN104268148A (en) | Forum page information auto-extraction method and system based on time strings | |
CN104965823A (en) | Big data based opinion extraction method | |
CN106776672A (en) | Technology development grain figure determines method | |
CN103309862A (en) | Webpage type recognition method and system | |
CN102929902A (en) | Character splitting method and device based on Chinese retrieval | |
CN105183765A (en) | Big data-based topic extraction method | |
CN104346382B (en) | Use the text analysis system and method for language inquiry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220425 Address after: 213000 room 1505, No. 9-1, Taihu East Road, Xinbei District, Changzhou City, Jiangsu Province Patentee after: CHANGZHOU HUALONG NETWORK TECHNOLOGY CO.,LTD. Address before: Gehu Lake Road Wujin District 213164 Jiangsu city of Changzhou province No. 1 Patentee before: CHANGZHOU University |