CN105068991A - Big data based public sentiment discovery method - Google Patents

Big data based public sentiment discovery method Download PDF

Info

Publication number
CN105068991A
CN105068991A CN201510458540.9A CN201510458540A CN105068991A CN 105068991 A CN105068991 A CN 105068991A CN 201510458540 A CN201510458540 A CN 201510458540A CN 105068991 A CN105068991 A CN 105068991A
Authority
CN
China
Prior art keywords
topic
data
blog
talked
much
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510458540.9A
Other languages
Chinese (zh)
Inventor
肖会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Original Assignee
CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd filed Critical CHENGDU DINGZHIHUI SCIENCE AND TECHNOLOGY Co Ltd
Priority to CN201510458540.9A priority Critical patent/CN105068991A/en
Publication of CN105068991A publication Critical patent/CN105068991A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention proposes a big data based public sentiment discovery method. The method comprises: step S100, data acquisition: performing data acquisition on network data based on a distributed cloud computing mode; step S200, data preprocessing: performing preprocessing on the network data acquired in the step S100, performing word segmentation and part-of-speech tagging processing on the acquired network data first, then performing punctuation mark processing, expression character processing and stop word processing and finally obtaining a feature item used for representing a text; step S300, topic extraction: based on the preprocessed network data, extracting a topic from the network data; and step S400, public sentiment analysis: performing public sentiment analysis based on the topic extracted in the step S300. The method adopts the distributed cloud computing mode and can perform mining and analysis on various network data acquired on a large scale.

Description

A kind of public sentiment discover method based on large data
Technical field
The present invention relates to data processing field, be specifically related to a kind of public sentiment discover method based on large data.
Background technology
Along with the development of Web2.0 technology, there is earth-shaking change in internet.Internet, by static Web page and information, is transformed into the display platform of " group intelligence " that everybody participates in.By blog, microblogging, BBS, SNS, news analysis etc., netizen freely can issue the viewpoint idea of oneself and comment on any event.In the epoch of a network interconnection, any suggestion, idea all may affect large quantities of people, form network public opinion.Present stage, increasing event showed: network public opinion is affecting public opinion development trend, even as a kind of mode of main public opinion.On network consciously or the unconscious condition of the people will of the people expressed, the value orientation give expression to the concern of the hot spot of society and viewpoint more and more have research and reference value to netizen.
The detection of much-talked-about topic and focus incident and extraction become one of branch of topic detection and tracking research in recent years, from internet data, mainly news, blog, forum, obtain in social network sites and data such as search daily record etc. focus feature group or with behavior phrase, carry out abstract extraction, thus obtain much-talked-about topic.Owing to traditional blog, microblogging and social network sites existing the topic and data that grow with each passing day, much-talked-about topic can be supplied to the reference of user search keyword, and such as, the much-talked-about topic in Baidu search engine is recommended, the much-talked-about topic of acquisition is recommended, and can upgrade constantly.These impel hot news to become the focus of public opinion undoubtedly, and the sight line of wide influence masses guides popular public opinion, embody the society and politics attitude of the common people to a certain extent.In the face of the network public-opinion of diversification expresses demand, if do not add guiding, negative network public-opinion will cause certain harm to the public safety of society.
In sum, carry out Internet public opinion analysis, in real time control public sentiment situation, forming the network public-opinion environment in front, for building a harmonious socialist society, there is practical guided significance.But at present for the method based on text opining mining and classification or seldom of internet, with instant discovery focus, analyzing and processing the requirement assessing public opinion still has distance, therefore, be necessary to provide a kind of method based on text opining mining and classification.The method is applied in Timeliness coverage and the much-talked-about topic of monitor network public sentiment, is conducive to allowing the common people understand social hotspots information in real time, supports for government and relevant departments' policies provide auxiliary.The combing of reinforcement information, ensures the timely, accurate and comprehensive of public sentiment, effectively can prevent the propagation of flame.
In addition, along with the develop rapidly of the application such as mobile Internet, Internet of Things, there is explosive growth in global metadata amount.The growth at full speed of data volume imply that and entered large data age now.In prior art, the platform based on Hadoop is adopted to the process of large data.Hadoop is a Distributed Computing Platform of increasing income, and its core comprises HDFS (HadoopDistributedFilesSystem, Hadoop distributed file system).The many merits (mainly comprising high fault tolerance, high scalability etc.) of HDFS allows user to be deployed on cheap hardware by Hadoop, builds distributed type assemblies, forms distributed system.HBase (HadoopDataBase, Hadoop database) be the distributed data base system that the storage of high reliability, high-performance, row, scalable, real-time read-write are provided be based upon on distributed file system HDFS, be mainly used to store destructuring and semi-structured unstructured data.
Summary of the invention
For solving problems of the prior art, the present invention proposes a kind of public sentiment discover method based on large data.
A kind of public sentiment discover method based on large data that the present invention proposes, comprising:
Step S100, data acquisition, carry out data acquisition based on distributed cloud computing mode to network data, described data acquisition is realized by web crawlers; By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS;
Step S200, data prediction, carries out pre-service to the network data that step S100 gathers, and first carries out participle and part-of-speech tagging process to the network data gathered, then carrying out punctuation mark process, expression character process and stop words process, finally obtaining the characteristic item for representing text;
Step S300, topic extracts, and based on pretreated network data, therefrom extracts topic.
Step S400, the analysis of public opinion, the topic extracted based on step S300 carries out the analysis of public opinion.
Wherein, step S200 comprises further:
High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in the text, and the quality Q (t) of characteristic item t is expressed as:
Q ( t ) = l t 2 ( Σ i = 1 N f i 2 - 1 N ( Σ i = 1 N f i ) 2 ) ,
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
Wherein, step S300 comprises: carry out text cluster to the text that step S200 pre-service obtains, and utilizes hierarchical clustering algorithm to calculate the average of each class textual object, obtains the initial cluster center of k-means algorithm.Utilize k-means algorithm by recalculating the distance of each text object and cluster centre, revise the ownership class of level cluster result textual object, algorithm steps is as follows:
(1) the number k of cluster centre is determined;
(2) utilize the method for hierarchical clustering to carry out Hierarchical clustering analysis to data set, obtain the average of k class, using they initial cluster centers as k-means;
(3) calculate the distance of each text object and cluster centre, text object is divided into from representated by the nearest cluster centre of text object bunch;
(4) value obtained is utilized to recalculate the cluster centre of each bunch;
(5) (3) and (4) are repeated, until the class belonging to each text object no longer changes;
Namely the class that cluster result obtains is defined as topic.
Preferably, the present invention comprises further:
Determine much-talked-about topic, the temperature of the topic obtained by following formulae discovery step S300,
R i=α 1·RF i2·RT i3·CN i4·DN i,
Wherein, R irepresent the temperature of topic i, RF i: the report frequency representing topic i, RT i: represent within the predetermined N days time, to the report number of days of topic i and the ratio of all number of days, CN i: represent that the netizen of topic i within predetermined number of days reads quantity to its click, DN i: represent that the netizen of topic i within predetermined number of days is to its comment number; α 1, α 2, α 3, α 4for weight coefficient; Work as R iwhen being greater than given threshold value R, topic i is defined as much-talked-about topic.
Preferably, described in the present invention, network data comprises blog, microblogging, forum, the data of the several classification of news report webpage, for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.
The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the much-talked-about topic in different pieces of information source, and then determining the temperature of topic further, thus can more objectively obtain current hotspot topic.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the analysis of public opinion of the present invention;
Embodiment
Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
See Fig. 1, a kind of public sentiment discover method based on large data that the present invention proposes.The following embodiment of the present invention is mainly that the situation of microblogging is described for example with data source, be blog, the news report situation such as webpage, forum for data source, those skilled in the art can implement according to the known technology of given method and this area.
Step S100, data acquisition
Based on distributed cloud computing mode, data acquisition is carried out to network data, described network data comprises the data of blog, microblogging, forum, the several classification of news report webpage, and described network data is marked according to blog, microblogging, forum, the several classification of news report webpage, and stores described network data respectively by described classification; Wherein, it is reported that webpage refers to the webpage of the news that the news media website such as the portal websites such as Tengxun's news, Sina News and such as People's Daily provides.
Described data acquisition is realized by web crawlers.By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS.
Step S200, data prediction, carries out pre-service to the network data that step S100 gathers,
First participle and part-of-speech tagging process are carried out to the network data gathered;
Due to microblogging text formatting freely, there is the feature of the brief and fragmentation of content, for the ease of analyzing, needing to carry out standardization processing to content of microblog.
(1) punctuation mark process
In microblogging, often there will be the situation that punctuation mark combinationally uses or punctuation mark is used in conjunction.Although this usage does not meet syntax rule, be often used to express some single symbols emotion beyond expression of words in microblogging.Such as " ~ " often expresses a kind of fluctuating and extension of emotion in the text.As " very glad ~ ~ ~! ".In order to the tone and the intensity of expresser can be reflected in microblogging text-processing.Need punctuation mark combination and the situation that is used in conjunction are marked.
(2) expression character process
Character or character combination that some are used for embodying expression is often there will be in microblogging.Such as--! Represent embarrassment etc.This kind of symbol often contains emotion or the attitude tendency of microblog users.For this kind of character, mark by setting up a mapping table comprising main emoticon.Because these expression characters concentrate on several class, make an exception less, we take statistical computation mode for wherein most symbol foundation mapping, and pedestrian's work of going forward side by side evaluation marks, and composes with feeling polarities weight.
(3) stop words process
Based in the automatic classification of SVM, stop words does not have the function word of real justice and the not strong neutral words of classification color, and the frequency of occurrences of this kind of word is relatively high.The quality that effectively process can improve word segmentation speed and raising subsequent analysis is carried out separately to them.Still need to mark to these words during word segmentation.In order to correctly stop words can be processed, need to use inactive vocabulary.The true and false structure of inactive vocabulary and the identification of stop words be stop words process key.At present, the foundation of inactive vocabulary has been come by the machine learning mode of manual type and Corpus--based Method.The method of machine learning counts high frequency words and automatically constructs inactive vocabulary from language material, or from preliminary participle, obtains stop words and constantly update in follow-up participle and checking.
Step S300, topic extracts, and based on pretreated network data, therefrom extracts much-talked-about topic;
Microblogging has dapper feature, and the maximum limit that the standard of microblogging sends out number of words is 140 words, that is will meaning sake of clarity within this restriction.So-called document sets refers to each microblogging, the thousands of bar of microblogging quantity, so the quantity of document sets is also huge.Much-talked-about topic finds to refer to and huge micro-blog information is classified, and finds the focus of society's words, inquires into the emotion that the common people hold, understand public demands at any time.
Text cluster technology is the basis that topic finds and follows the tracks of, and good cluster can produce classifies accurately, reduces large class phenomenon.Text cluster mainly comprises feature extraction, text representation, text cluster, opining mining four steps.
The feature of so-called text refers in text the general name can transmitting text message and semantic vocabulary and character.Undressed text is characterized in that unconspicuous, needs manual intervention, uses certain method to extract from text.Feature extraction is in order to diminishing of making the dimension of text vector try one's best, and the speed of computer treatmenting information and efficiency are improved.Feature extraction mainly comprises two main steps: pre-service, high-quality word retrieval.Main three parts of document pre-service: remove the filtration of stop words, part of speech, invalid word filtering.Go stop words first to set up a stop words dictionary, but then coupling remove the very high word whether having practical significance of those frequencies of occurrences, such as " " " " " " etc.Part of speech is filtered and is based upon on the basis of part-of-speech tagging, processes disaggregatedly vocabulary of different nature, and experiment proves, adjective and the raising of adverbial word to the effect of cluster do not have too large improvement, so should give removal, only retain noun, verb and initialism.
Vocabulary after pre-service is still huge, so still need to carry out second step, high-quality word retrieval.Each characteristic item in document implies a mass value, the words-frequency feature of so-called mass value mainly feature based item, response feature item contribution degree in the text.Quality is larger, illustrates that contribution is larger, can stay for text cluster; Otherwise, then reject.
The quality Q (t) of characteristic item t is expressed as:
Q ( t ) = l t 2 ( Σ i = 1 N f i 2 - 1 N ( Σ i = 1 N f i ) 2 ) ,
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
Vector space model is shown as the hyperspace of characteristic item composition text table, and each characteristic item is as the one dimension in space, and such text model can be expressed as follows:
D={t 1, t 2..., t n, the wherein number of n representative feature item.
K-means clustering algorithm is a kind of widely used clustering algorithm, it using the average of all samples in each cluster subset as cluster centre, by iterative process by Data Placement to different classes of, make the function of evaluation clustering performance reach optimum.This algorithm has good Clustering Effect for the lengthy document of process continuous type, but algorithm for noise and isolated point data very sensitive, data so on a small quantity can have a huge impact average, for this data with discrete type attribute and Sparse feature of microblogging text, Clustering Effect is unsatisfactory.
Hierarchical clustering algorithm is another kind of widely used clustering algorithm.This algorithm carries out hierachical decomposition to given set of data objects.According to the generation type of level, be divided into again the hierarchical clustering algorithm of cohesion and the hierarchical clustering algorithm of division.Hierarchical clustering algorithm can control noises effectively, reduces the impact of isolated point data for cluster result.But hierarchical clustering algorithm is existing defects also, once a division or merging are performed in cluster process, just can not revise.If a certain step goes wrong in cluster process, so such mistake will be extended, and the interative computation of stratification makes error constantly accumulate, and causes cluster result poor.
In order to effectively overcome the defect that these two kinds of algorithms exist, make full use of the advantage on their each comfortable text clusters, these two kinds of algorithms combine and carry out text cluster by we simultaneously.Utilize hierarchical clustering algorithm to calculate the average of each class textual object, obtain the initial cluster center of k-means algorithm.Utilizing k-means algorithm by recalculating the distance of each text object and cluster centre, revising the ownership class of level cluster result textual object.Algorithm steps is as follows:
(1) the number k of cluster centre is determined;
(2) utilize the method for hierarchical clustering to carry out Hierarchical clustering analysis to data set, obtain the average of k class, using they initial cluster centers as k-means;
(3) calculate the distance of each text object and cluster centre, text object is divided into from representated by the nearest cluster centre of text object bunch;
(4) value obtained is utilized to recalculate the cluster centre of each bunch;
(5) (3) and (4) are repeated, until the class belonging to each text object no longer changes.
Text object in above-mentioned algorithm, is the document (text d) in described network data.
Namely the class that above-mentioned cluster result obtains is defined as topic.
Much-talked-about topic is exactly take internet as communication media, by general population's extensive concern, and wide-scale distribution can spread and continue for some time in a short period of time, the information aggregate of internet public feelings can be reflected, wherein also comprise and the semantization of internet hot spots topic described and comprises the circulation way etc. of topic.Internet hot spots topic is normally by information that numerous netizens pay close attention to, relevant information can the appearance of large frequency in a network, wherein focus word compares the General Matters that directly can describe out much-talked-about topic, a much-talked-about topic must have the characteristic quantity of multiple much-talked-about topic to describe, and has certain similarity between these characteristic quantities.
The characteristic quantity of much-talked-about topic has:
The report frequency of topic; For an important topic, media will increase than usual to its relevant report, so this report frequency also can have influence on the attention rate of much-talked-about topic, that is within a period of time about the report quantity of some topics with report sum ratio, ratio is larger, then the suffered attention rate of topic is higher.
The duration of topic; For a much-talked-about topic, if the report duration of media to it is longer, and it is also long to cause the time that netizen discusses, and so also just illustrates that the concerned degree of this topic is larger.Because each topic has the concerned time attribute of oneself, the initial time that we define topic is the topic incipient time, be exactly when the report of topic is lower than moment of certain threshold value between extinction time, therefore, the span of the time of event is defined as the difference between this topic start time and extinction time.
The amount of reading of topic; Because the report of certain much-talked-about topic is mostly from WEB website, so click the netizen reading topic relevant report more, also just illustrate that the attention rate that this topic is subject to is larger, we just can read with the click of the relevant report of much-talked-about topic the attention rate that quantity records topic.
The comment number of topic; If the number of reviews of netizen to a certain much-talked-about topic is more on internet, also can illustrate that the comment number of topic is also the factor affecting the topic degree of correlation.
For each text in the application, the report time of capital mark text (such as it is reported the time, blog, microblogging, forum deliver the time), the number of clicks of text, the number of reviews of text, the report frequency of topic obtained after cluster can being determined according to the report time of text, the duration of topic; The amount of reading of the topic obtained after can determining cluster according to the number of clicks of text, the comment number of the topic obtained after can determining cluster according to the number of reviews of text
RF i: the report frequency representing topic i;
RT i: represent within the predetermined N days time, media are to about effective report number of days of topic i and the ratio of all number of days, and when the report quantity about topic i in one day is greater than some threshold values, namely we assert that this day is just effective report number of days of topic i;
CN i: represent that the netizen of topic i within predetermined number of days reads quantity to its click;
DN i: represent that the netizen of topic i within predetermined number of days is to its comment number;
Topic temperature computing formula:
R i=α 1·RF i2·RT i3·CN i4·DN i,
R irepresent the temperature of topic i, α 1, α 2, α 3, α 4for weight coefficient, work as R iwhen being greater than given threshold value R, topic i is defined as much-talked-about topic.
The present invention is for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.
The focus reflected due to blog, microblogging, forum, news report webpage may difference to some extent, so the content paid close attention to when blog, microblogging, forum, news report webpage should be the content that temperature is the highest simultaneously, in blog, microblogging, forum, news report webpage, three content temperatures simultaneously paid close attention to are taken second place, in blog, microblogging, forum, news report webpage, two content temperatures simultaneously paid close attention to are taken second place again, and the content temperature only having to pay close attention in blog, microblogging, forum, news report webpage is relatively minimum.
Based on extracting the topic determined, various the analysis of public opinion work can be carried out.
The present invention adopts distributed cloud computing mode, can excavate, analyze the extensive various network datas gathered; And by carrying out computational analysis respectively to different pieces of information source data, obtaining the much-talked-about topic in different pieces of information source, and then determining the temperature of topic further, thus can more objectively obtain current hotspot topic.The present invention is for the unit such as Party and government offices, large enterprise and organize Timeliness coverage network sensitive information, grasp network public-opinion focus, hold network public-opinion trend, tackle network public-opinion crisis provides robotization, systematization and scientific Informational support.Effectively improve described network public-opinion monitoring system judge accuracy, for the subsequent treatment of the micro-letter public feelings information of network provide more truly, accurately basis.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims (5)

1., based on a public sentiment discover method for large data, comprising:
Step S100, data acquisition, carry out data acquisition based on distributed cloud computing mode to network data, described data acquisition is realized by web crawlers; By the network data of distributed storage device storage of collected, described distributed storage device realizes based on HDFS;
Step S200, data prediction, carries out pre-service to the network data that step S100 gathers, and first carries out participle and part-of-speech tagging process to the network data gathered, then carrying out punctuation mark process, expression character process and stop words process, finally obtaining the characteristic item for representing text;
Step S300, topic extracts, and based on pretreated network data, therefrom extracts topic;
Step S400, the analysis of public opinion, the topic extracted based on step S300 carries out the analysis of public opinion.
2., as claimed in claim 1 based on the public sentiment discover method of large data, wherein, step S200 comprises further:
High-quality word retrieval, each characteristic item that step S200 obtains implies a mass value, its response feature item contribution degree in the text, and the quality Q (t) of characteristic item t is expressed as:
Q ( t ) = l t 2 ( Σ i = 1 N f i 2 - 1 N ( Σ i = 1 N f i ) 2 ) ,
Wherein, N represents the quantity of all documents, f irepresent the number of times that document feature item t occurs in document i, l tthe length of representation feature item t,
Setting threshold value Q, the characteristic item for Q (t) >Q is retained, otherwise deletes.
3. as claimed in claim 1 based on the public sentiment discover method of large data, wherein, step S300 comprises: carry out text cluster to the text that step S200 pre-service obtains, and utilizes hierarchical clustering algorithm to calculate the average of each class textual object, obtains the initial cluster center of k-means algorithm.Utilize k-means algorithm by recalculating the distance of each text object and cluster centre, revise the ownership class of level cluster result textual object, algorithm steps is as follows:
(1) the number k of cluster centre is determined;
(2) utilize the method for hierarchical clustering to carry out Hierarchical clustering analysis to data set, obtain the average of k class, using they initial cluster centers as k-means;
(3) calculate the distance of each text object and cluster centre, text object is divided into from representated by the nearest cluster centre of text object bunch;
(4) value obtained is utilized to recalculate the cluster centre of each bunch;
(5) (3) and (4) are repeated, until the class belonging to each text object no longer changes;
Namely the class that cluster result obtains is defined as topic.
4., as claimed in claim 3 based on the public sentiment discover method of large data, comprise further:
Determine much-talked-about topic, the temperature of the topic obtained by following formulae discovery step S300,
R i=α 1·RF i2·RT i3·CN i4·DN i,
Wherein, R irepresent the temperature of topic i, RF i: the report frequency representing topic i, RT i: represent within the predetermined N days time, to the report number of days of topic i and the ratio of all number of days, CN i: represent that the netizen of topic i within predetermined number of days reads quantity to its click, DN i: represent that the netizen of topic i within predetermined number of days is to its comment number; α 1, α 2, α 3, α 4for weight coefficient; Work as R iwhen being greater than given threshold value R, topic i is defined as much-talked-about topic.
5., as claimed in claim 1 based on the public sentiment discover method of large data, wherein, described network data comprises blog, microblogging, forum, the data of the several classification of news report webpage, for blog, microblogging, forum, the data of news report webpage independently carry out topic extraction, supposes for blog, microblogging, forum, the much-talked-about topic set that the data of news report webpage extract is respectively BLOG, M-BLOG, BBS, NEWS, calculates BLOG, M-BLOG, BBS, the common factor of NEWS, the result obtained is defined as the first much-talked-about topic set, calculates BLOG, M-BLOG, BBS, NEWS is every three intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and be defined as the set of second hot area topic, calculates BLOG, M-BLOG, BBS, NEWS is every two intersection of sets collection wherein, and the institute obtained is resultful and deduct the first much-talked-about topic set and the set of second hot area topic is defined as the 3rd much-talked-about topic set, set B LOG, M-BLOG, BBS, NEWS and deduct the first much-talked-about topic set, the result of the set of second hot area topic and the 3rd much-talked-about topic set is defined as the 4th much-talked-about topic set.
CN201510458540.9A 2015-07-30 2015-07-30 Big data based public sentiment discovery method Pending CN105068991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510458540.9A CN105068991A (en) 2015-07-30 2015-07-30 Big data based public sentiment discovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510458540.9A CN105068991A (en) 2015-07-30 2015-07-30 Big data based public sentiment discovery method

Publications (1)

Publication Number Publication Date
CN105068991A true CN105068991A (en) 2015-11-18

Family

ID=54498365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510458540.9A Pending CN105068991A (en) 2015-07-30 2015-07-30 Big data based public sentiment discovery method

Country Status (1)

Country Link
CN (1) CN105068991A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106504102A (en) * 2016-10-20 2017-03-15 宁波江东大金佰汇信息技术有限公司 High-quality node detection system in a kind of social networkies based on computer big data
CN106599065A (en) * 2016-11-16 2017-04-26 北京化工大学 Food safety online public opinion early warning system based on Storm distributed framework
CN106970925A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 The abnormity early warning method and apparatus of User Perspective
CN107193867A (en) * 2017-04-07 2017-09-22 广东精点数据科技股份有限公司 Much-talked-about topic analysis method based on big data
CN107193797A (en) * 2017-04-26 2017-09-22 天津大学 The much-talked-about topic detection of Chinese microblogging and trend forecasting method
CN107895008A (en) * 2017-11-10 2018-04-10 中国电子科技集团公司第三十二研究所 Information hotspot discovery method based on big data platform
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
CN108363784A (en) * 2018-01-20 2018-08-03 西北工业大学 A kind of public sentiment trend estimate method based on text machine learning
CN109284432A (en) * 2018-08-22 2019-01-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Network public opinion analysis system based on big data platform
CN109325524A (en) * 2018-08-31 2019-02-12 中国科学院自动化研究所 Track of issues and changes phase division methods, system and relevant device
CN109446394A (en) * 2018-09-27 2019-03-08 武汉大学 For network public-opinion event based on modular public sentiment monitoring method and system
CN109526027A (en) * 2018-11-27 2019-03-26 中国移动通信集团福建有限公司 A kind of cell capacity optimization method, device, equipment and computer storage medium
CN110609950A (en) * 2019-08-02 2019-12-24 济南大学 Public opinion system search word recommendation method and system
CN111666765A (en) * 2020-06-02 2020-09-15 国家计算机网络与信息安全管理中心 Fraud topic analysis method and system based on k-means text clustering
CN111859230A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method for monitoring hot spot trend of internet information
CN112765442A (en) * 2018-06-25 2021-05-07 中译语通科技股份有限公司 Network emotion fluctuation index monitoring and analyzing method and system based on news big data
CN113065975A (en) * 2021-03-23 2021-07-02 深圳市网联安瑞网络科技有限公司 Method, system and terminal for calculating focusing degree and evolution relation of network public sentiment topics

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050074058A (en) * 2004-01-13 2005-07-18 주식회사 이뉴스네트웍 System for automatically sending to other web site news automatically classified on internet, and control method thereof
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
CN102110140A (en) * 2011-01-26 2011-06-29 桂林电子科技大学 Network-based method for analyzing opinion information in discrete text
CN103744962A (en) * 2014-01-06 2014-04-23 浪潮(北京)电子信息产业有限公司 Method and device for realizing K-means clustering
US20150066904A1 (en) * 2013-08-29 2015-03-05 Hewlett-Packard Development Company, L.P. Integrating and extracting topics from content of heterogeneous sources
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104504150A (en) * 2015-01-09 2015-04-08 成都布林特信息技术有限公司 News public opinion monitoring system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050074058A (en) * 2004-01-13 2005-07-18 주식회사 이뉴스네트웍 System for automatically sending to other web site news automatically classified on internet, and control method thereof
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
CN102110140A (en) * 2011-01-26 2011-06-29 桂林电子科技大学 Network-based method for analyzing opinion information in discrete text
US20150066904A1 (en) * 2013-08-29 2015-03-05 Hewlett-Packard Development Company, L.P. Integrating and extracting topics from content of heterogeneous sources
CN103744962A (en) * 2014-01-06 2014-04-23 浪潮(北京)电子信息产业有限公司 Method and device for realizing K-means clustering
CN104462286A (en) * 2014-11-27 2015-03-25 重庆邮电大学 Microblog topic finding method based on modified LDA
CN104504150A (en) * 2015-01-09 2015-04-08 成都布林特信息技术有限公司 News public opinion monitoring system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
单月光: "基于微博的网络舆情关键技术的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
尹培培: "大数据时代的网络舆情分析系统", 《广播与电视技术》 *
熊祖涛: "基于Web文本信息抽取的微博舆情分析", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
程锦彬 钱钢: "基于用户活跃程度的网络话题热度计算", 《江苏科技信息》 *
范绍瑜: "基于海量舆情信息的网络舆情分析系统的设计与实现", 《中国优秀硕士学文论文全文数据库信息科技辑》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106970925A (en) * 2016-01-14 2017-07-21 阿里巴巴集团控股有限公司 The abnormity early warning method and apparatus of User Perspective
CN106970925B (en) * 2016-01-14 2020-07-03 阿里巴巴集团控股有限公司 User viewpoint abnormity early warning method and device
CN106504102A (en) * 2016-10-20 2017-03-15 宁波江东大金佰汇信息技术有限公司 High-quality node detection system in a kind of social networkies based on computer big data
CN106599065B (en) * 2016-11-16 2019-12-13 北京化工大学 Food safety network public opinion early warning system based on Storm distributed framework
CN106599065A (en) * 2016-11-16 2017-04-26 北京化工大学 Food safety online public opinion early warning system based on Storm distributed framework
CN108170692A (en) * 2016-12-07 2018-06-15 腾讯科技(深圳)有限公司 A kind of focus incident information processing method and device
CN107193867A (en) * 2017-04-07 2017-09-22 广东精点数据科技股份有限公司 Much-talked-about topic analysis method based on big data
CN107193797A (en) * 2017-04-26 2017-09-22 天津大学 The much-talked-about topic detection of Chinese microblogging and trend forecasting method
CN107193797B (en) * 2017-04-26 2020-08-18 天津大学 Hot topic detection and trend prediction method for Chinese microblog
CN107895008A (en) * 2017-11-10 2018-04-10 中国电子科技集团公司第三十二研究所 Information hotspot discovery method based on big data platform
CN108363784A (en) * 2018-01-20 2018-08-03 西北工业大学 A kind of public sentiment trend estimate method based on text machine learning
CN112765442A (en) * 2018-06-25 2021-05-07 中译语通科技股份有限公司 Network emotion fluctuation index monitoring and analyzing method and system based on news big data
CN109284432A (en) * 2018-08-22 2019-01-29 华东计算技术研究所(中国电子科技集团公司第三十二研究所) Network public opinion analysis system based on big data platform
CN109325524A (en) * 2018-08-31 2019-02-12 中国科学院自动化研究所 Track of issues and changes phase division methods, system and relevant device
CN109446394A (en) * 2018-09-27 2019-03-08 武汉大学 For network public-opinion event based on modular public sentiment monitoring method and system
CN109526027A (en) * 2018-11-27 2019-03-26 中国移动通信集团福建有限公司 A kind of cell capacity optimization method, device, equipment and computer storage medium
CN109526027B (en) * 2018-11-27 2022-07-01 中国移动通信集团福建有限公司 Cell capacity optimization method, device, equipment and computer storage medium
CN111859230A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method for monitoring hot spot trend of internet information
CN111859230B (en) * 2019-04-30 2024-02-06 北京智慧星光信息技术有限公司 Control method for monitoring hot spot trend of internet information
CN110609950B (en) * 2019-08-02 2022-09-16 济南大学 Public opinion system search word recommendation method and system
CN110609950A (en) * 2019-08-02 2019-12-24 济南大学 Public opinion system search word recommendation method and system
CN111666765A (en) * 2020-06-02 2020-09-15 国家计算机网络与信息安全管理中心 Fraud topic analysis method and system based on k-means text clustering
CN113065975A (en) * 2021-03-23 2021-07-02 深圳市网联安瑞网络科技有限公司 Method, system and terminal for calculating focusing degree and evolution relation of network public sentiment topics

Similar Documents

Publication Publication Date Title
CN105068991A (en) Big data based public sentiment discovery method
CN104965823A (en) Big data based opinion extraction method
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
CN106598944B (en) A kind of civil aviaton's security public sentiment sentiment analysis method
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN105183765A (en) Big data-based topic extraction method
Luo et al. An effective approach to tweets opinion retrieval
CN104965931A (en) Big data based public opinion analysis method
CN103246644B (en) Method and device for processing Internet public opinion information
Geçkil et al. A clickbait detection method on news sites
CN105138577A (en) Big data based event evolution analysis method
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN105893611A (en) Method for establishing interest theme semantic network facing to social networking services
Gu et al. Service package recommendation for mashup creation via mashup textual description mining
CN107918644A (en) News subject under discussion analysis method and implementation system in reputation Governance framework
CN104965930A (en) Big data based emergency evolution analysis method
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
Chang et al. Improving recency ranking using twitter data
CN106126605A (en) A kind of short text classification method based on user's portrait
Ashraf et al. Author profiling on bi-lingual tweets
CN104346382A (en) Text analysis system and method employing language query
CN105159879A (en) Automatic determination method for network individual or group values
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network
Atoum Detecting cyberbullying from tweets through machine learning techniques with sentiment analysis
Chen et al. Research on clustering analysis of Internet public opinion

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151118

RJ01 Rejection of invention patent application after publication