Disclosure of Invention
The present invention is directed to overcoming at least one of the above-mentioned drawbacks (shortcomings) of the prior art, and providing a public opinion analysis method for solving the problem of lack of a reasonable network public opinion data mining method.
The technical scheme adopted by the invention is as follows:
a public opinion analysis method comprising: collecting public opinion data; preprocessing public opinion data to obtain structured public opinion information; generating public opinion objects according to public opinion information, wherein the public opinion objects comprise object identifications, object categories, space-time information, semantic information, emotion information and relationship information; carrying out similarity matching on the public opinion objects and the public opinion cases in the public opinion case library to obtain the most similar public opinion cases; and obtaining a public opinion control scheme according to the most similar public opinion case analysis.
Under the multi-granularity public opinion space-time object description attribute framework, the structured public opinion information is generated into the public opinion object comprising the object identification and five types of attributes, and the time characteristics, the space characteristics, the evolution and the propagation characteristics of public opinion entities can be subjected to abstract description so as to rapidly process mass multi-source public opinion data, mine valuable public opinion information, provide data support for subsequent similarity matching with public opinion cases and public opinion control scheme analysis, and provide scientific decision reference for supervision of network public opinion.
Further, performing similarity analysis on the public opinion object and the public opinion cases in the public opinion case library to obtain the most similar public opinion cases, including: carrying out structural similarity matching on the attribute structure of the public opinion object and the attribute structure of the public opinion case in the public opinion case library to obtain an event type to which the public opinion object belongs; and performing attribute similarity matching on the attribute of the public opinion object and the attribute value of the public opinion case belonging to the event type in the public opinion case library to obtain the most similar public opinion case.
The case structure of different public opinion cases has different characteristics, and the public opinion case information may be incomplete, and the information description of the case control scheme may be incomplete. And analyzing the structural similarity between the public opinion object and the public opinion case can well solve the problem of attribute deficiency. Firstly, carrying out structural similarity matching on attribute structures, matching out event types to which public opinion objects belong, and then carrying out attribute similarity matching on attribute values according to the matched event types, so that the matching accuracy can be greatly improved.
Further, performing structural similarity matching on the attribute structure of the public opinion object and the attribute structure of the public opinion case in the public opinion case library to obtain an event type to which the public opinion object belongs, including:
remembering a certain public opinion object asThe +.f in the public opinion case base>The event type is->Event type->Consists of a plurality of public opinion cases;
calculating public opinion objects according toAnd event type->Structural similarity->:
;
Is an empirical factor->Is subject of public opinion->Attribute number of->Is subject of public opinion->And event type->The same attribute number, ++>For event type->The number of attributes of (a);
according to the structural similarityJudging the public opinion object->The event type is->Event type->。
Further, performing attribute similarity matching on the attribute of the public opinion object and the attribute value of the public opinion case belonging to the event type in the public opinion case library to obtain the most similar public opinion case, including:
remembering a certain public opinion object asPublic opinion object->The event type is->Event type->The public opinion case base belongs to event type +.>Is +.>;
The conditional probability is calculated according to the following formula:
;
;
;
is the event type in the public opinion case base>And is->The number of public opinion cases with matched attributes,is the total number of public opinion cases in the public opinion case library, </i >>Is subject of public opinion->Attribute of->For event type->Has the attribute->The number of public opinion cases->Is event type +.>Attribute weights of (a);
according to conditional probabilityJudging and obtaining the most similar public opinion cases.
Further, preprocessing public opinion data includes:
identifying and removing useless characters of public opinion data, and/or performing word segmentation on the public opinion data and removing dead words of the public opinion data, and/or extracting keywords of the public opinion data based on a word frequency statistics method, and/or extracting entity names of the public opinion data based on a named entity recognition method, and/or gathering the public opinion data based on a topic clustering method, and/or performing topic extraction on the public opinion data based on a co-word analysis method; and/or extracting emotion tendentiousness text in the public opinion data based on a text mining technology.
And (3) removing useless characters and stop words, and simultaneously extracting keywords, entity names, public opinion topics and emotion tendentiousness texts, so that the public opinion data can be initially classified, and structured public opinion information is formed.
Further, the object class comprises a text class, a topic class and a theme class, wherein the public opinion object of the object class is generated according to the expression text of the public opinion information, the public opinion object of the object class is generated according to the expression topic of the public opinion information, and the public opinion object of the object class is generated according to the expression topic of the public opinion information.
The object categories represent the types of the public opinion data processing and analysis objects, and the public opinion information expression models related to different object categories are different. According to the order of gradually increasing semantic abstraction degree, classification of the public opinion objects is divided into three categories of text, topic and theme, so that further mining of public opinion information is facilitated.
Further, the semantic information includes semantic granularity and semantic content.
The semantic information of each public opinion object can be divided into a plurality of semantic granularities, each semantic granularity can have a plurality of semantic records, and each semantic record has a respective number and semantic content.
Further, the emotion information comprises emotion subjects, emotion objects, emotion categories and emotion intensities;
generating emotion information in the public opinion object according to the public opinion information comprises the following steps:
extracting emotion subjects and emotion objects from public opinion information based on a named entity recognition method;
identifying emotion words from emotion object contexts based on an association rule mining method, and determining emotion categories according to the emotion words;
and judging the emotion strength according to the emotion words based on the emotion tendency judging method.
The emotion information quadruple structure, namely the emotion subject, the emotion object, the emotion category and the emotion strength, can realize the mining of public opinion and emotion, further analyze the changes generated by the opinion and emotion along with the development of the public opinion situation, and provide data support for the subsequent analysis of the public opinion object.
Further, emotion words include emotion words, negatives, degree adverbs and symbolic expressions.
Emotion words, negatives, degree adverbs and symbolic expressions have certain importance for distinguishing emotion categories and emotion intensities in emotion information.
Further, the relationship information includes a relationship category and other public opinion objects having a relationship with the public opinion object, and the relationship category includes an association relationship, an aggregation relationship and a dependency relationship.
The relationship information can express the relationship among the public opinion objects, and the relationship among the public opinion objects can be easily combed by describing the category of the relationship information through the association relationship, the aggregation relationship and the dependency relationship.
Compared with the prior art, the invention has the beneficial effects that: by constructing a multi-granularity public opinion space-time object description attribute frame, public opinion information is generated under the attribute frame to form a public opinion object, and the time characteristics, the space characteristics, the evolution and the propagation characteristics of public opinion entities can be subjected to abstract description so as to rapidly process mass multi-source public opinion data and obtain valuable public opinion information, thereby providing data support for subsequent public opinion case similarity matching and public opinion control scheme analysis and timely and correctly making decisions of public opinion supervision control.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the invention. For better illustration of the following embodiments, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the actual product dimensions; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
As shown in fig. 1, the embodiment provides a public opinion analysis method, which includes:
s1, collecting public opinion data, wherein the public opinion data can comprise public opinion under project public opinion data, reports of media to the project public opinion data and discussion of the project public opinion data;
s2, preprocessing public opinion data to obtain structured public opinion information;
s3, generating a public opinion object according to the public opinion information, wherein the public opinion object comprises an object identification, an object category, space-time information, semantic information, emotion information and relation information;
s4, carrying out similarity matching on the public opinion object and the public opinion cases in the public opinion case library to obtain the most similar public opinion cases;
s5, obtaining a public opinion control scheme according to the most similar public opinion case analysis.
Public opinion is a collection of individual opinions expressed by people in the process of social events, problem generation, development and change, and comprises expression, transmission, interaction and influence of the individual emotion opinions. The process of generating and developing social events and problems has a respective lifecycle. In different time phases of life cycle, individual opinions are changed continuously and have specific spatial range and propagation law, which is called public opinion entity. And the multi-source heterogeneous public opinion data is used as a data source to acquire public opinion event content and popular emotion views, and the space-time law of public opinion propagation diffusion is analyzed, so that decision support can be provided for public opinion guidance.
In step S1, the collected project publicity data may include project case number, construction project name, construction project location, and the like. The public generally can submit feedback comments under the project exposure, thereby forming public comments fed back to the project exposure, and the collected public comments can include feedback comments, contact addresses of feedback persons, and the like. The reporting of the collected media to project listings may include article titles, publication times, published media names, article content, and the like. The discussion of the public collected public on the project publicity data can be a sub-title of the public bar, a posting time, a poster, a reviewer, a posting time, a posting content and the like.
Aiming at network media propagation platforms where the network public opinion is located, such as network media of microblogs, news websites, forums and the like, the API interface and HTML analysis provided by social media can be utilized to collect the network public opinion data, and high-performance network crawler strategies, such as a network crawler based on multiple threads, are designed, multiple machines are used and data are crawled, so that the data capturing efficiency is improved, and the real-time acquisition and automatic update of the network public opinion data are realized.
The public opinion data not only comprises text data in network media, but also comprises heterogeneous social network data such as text forwarding quantity and forwarding relation, and has the characteristics of large data quantity, short timeliness, abundant sources, complex form, unstructured and the like. Therefore, in step S2, the public opinion data collected in step S1 is preprocessed to a certain extent, so that the public opinion data is converted into structured public opinion information.
In step S2, preprocessing public opinion data may specifically include: identifying and removing useless characters of public opinion data, and/or performing word segmentation on the public opinion data and removing dead words of the public opinion data, and/or extracting keywords of the public opinion data based on a word frequency statistics method, and/or extracting entity names of the public opinion data based on a named entity recognition method, and/or gathering the public opinion data based on a topic clustering method, and/or performing topic extraction on the public opinion data based on a co-word analysis method; and/or extracting emotion tendentiousness text in the public opinion data based on a text mining technology.
The useless characters are punctuations or emoticons without emotion expression, such as commas, stop signs and the like. The stop words are words that are not effective for extracting information, such as "on", and the like.
The word frequency statistical method is a common weighting technology for information retrieval and text mining, and is used for evaluating the repetition degree of a word for a field file set in a file or a corpus, and the importance of the word is increased in proportion to the occurrence frequency of the word in the file, so that words with higher word frequency can be extracted as keywords of public opinion data based on the word frequency statistical method.
Named entity recognition (Named Entity Recognition, NER) refers to the recognition of special objects in text whose semantic categories are usually predefined well before recognition, predefined categories like people, addresses, organizations, etc. The name recognition method can extract entity categories such as place names, organization names, time, person names, events and the like in the public opinion data.
The topic clustering method can adopt a k-means clustering algorithm, a condensed hierarchical clustering algorithm, a neural network clustering algorithm and the like. The topic clustering method can be used for gathering the public opinion data with similar topics.
The Co-word Analysis method (Co-word Analysis) is to count the number of times that a group of words appear in the same document every two times, and based on the number of times, perform cluster Analysis on the words, so as to reflect the relatedness and sparsity between the words, and further analyze the structure and change of the subject represented by the words. The co-word analysis method can respectively analyze the co-word by the subject word and the key word of the file. The public opinion topics can be extracted based on the co-word analysis method, so that the extraction of the public opinion topics is realized, and topic association analysis and hot spot detection are realized.
Text Mining refers to computer processing techniques that extract valuable information and knowledge from Text data. And extracting texts with emotion tendencies in the public opinion data by a text mining technology.
And (3) removing useless characters and stop words, and simultaneously extracting keywords, entity names, public opinion topics and emotion tendentiousness texts, so that the public opinion data can be initially classified, and structured public opinion information is formed.
In step S3, in order to design a structural model suitable for parallel processing of public opinion big data, and meanwhile, in consideration of the relevance characteristics of public opinion information, a multi-granularity public opinion space-time object description attribute frame is constructed, the structured public opinion information is generated into public opinion objects, and the time characteristics, the space characteristics, the evolution and the propagation characteristics of public opinion entities can be abstract described, so that massive multi-source public opinion data can be rapidly processed, and valuable public opinion information can be obtained.
Specifically, the definition of the public opinion object may be as follows: public opinion object= { object unique Identification (ID), object category, spatiotemporal information, semantic information, affective information, relationship information }. The public opinion information such as the space-time attribute, the text topic, the subject content, the emotion tendency, the relation with other objects and the like extracted from the public opinion information can be expressed through five types of attributes of the multi-granularity public opinion object, namely object category, space-time information, semantic information, emotion information and relation information.
The object class can comprise a text class, a topic class and a theme class, wherein the object class is that a public opinion object of the text class is generated according to an expression text of public opinion information, the object class is that a public opinion object of the topic class is generated according to an expression topic of the public opinion information, and the object class is that a public opinion object of the theme class is generated according to an expression topic of the public opinion information.
The object categories represent the types of the public opinion data processing and analysis objects, and the public opinion information expression models related to different object categories are different. According to the order of gradually increasing semantic abstraction degree, classifying the public opinion object categories into three categories of 'text category, topic category and theme category', wherein the text category public opinion object is a description model constructed for public opinion texts, and expressing public opinion information contained in the texts. The topic class public opinion object is a description model constructed for public opinion topics and expresses public opinion information contained in the topics. The topic type public opinion object is a description model constructed for public opinion topics and expresses public opinion information contained in the topics.
The spatiotemporal information may include temporal information and spatial information for expressing the time and space of occurrence and termination of public opinion, evolution and propagation in time and space, etc.
Specifically, the definition of the spatio-temporal information may be as follows:
spatio-temporal information= { temporal information, spatial information };
time information= { time granularity, [ start time, end time ] };
spatial information = { spatial granularity, spatial position }.
The semantic information may include semantic granularity and semantic content for expressing content of the public opinion text, and the semantic information of each public opinion object may be divided into a plurality of semantic granularities, each semantic granularity may have a plurality of semantic records, and each semantic record has a respective number and semantic content.
Specifically, the definition of the semantic information may be as follows:
semantic information = {
(semantic granularity 1, ([ number 1.1, semantic content ], [ number 1.2, semantic content ], … …)),
(semantic granularity 2, ([ number 2.1, semantic content ], [ number 2.2, semantic content ], … …)),
(semantic granularity 3, ([ number 3.1, semantic content ], [ number 3.2, semantic content ], … …)), … … }.
The emotion information may include emotion subjects, emotion objects, emotion categories and emotion intensities, and is used for expressing public opinion emotion content, one public opinion object may have a plurality of emotion records, and the emotion content is structurally expressed by four members, namely emotion subjects, emotion objects, emotion categories and emotion intensities. "emotion subject" represents an emotion's exposer, typically a netizen individual or a network media platform. An "emotion object" represents an object for which emotion is aimed, such as properties of a commodity, content of rumors, and the like. An "emotion category" is a category of emotion, including happiness, sadness, anger, approval, objection, suspicion, and the like. The emotion intensity is a score of the emotion intensity degree, and can be represented by a number and quantified by an emotion classification method and the like. Based on the quadruple structure, the mining of public opinion and emotion can be realized, so that the change of opinion and emotion generated along with the development of public opinion situation is analyzed, and data support is provided for the subsequent analysis of public opinion objects.
Specifically, the definition of emotion information may be as follows:
emotion information = { [ Emotion record 1, emotion content ], [ Emotion record 2, emotion content ], [ Emotion record i, emotion content ], … … } (i > = 1);
emotion content= { emotion subject, emotion object, emotion category, emotion intensity }.
The emotion information may be generated from public opinion information by: extracting emotion subjects and emotion objects from public opinion information based on a named entity recognition method; identifying emotion words from emotion object contexts based on an association rule mining method, and determining emotion categories according to the emotion words; and judging the emotion strength according to the emotion words based on the emotion tendency judging method.
The emotion tendency judging method can be based on word frequency statistics, judges the emotion intensity by calculating the co-occurrence frequency between emotion words and emotion reference words corresponding to the determined emotion types, and expresses the emotion intensity by judging the emotion polarity of the emotion words. Specifically, the emotion polarity can be judged as follows: the positive emotion vocabulary quantity is larger than the negative emotion vocabulary quantity, and the emotion polarity is positive; positive emotion vocabulary number = negative emotion vocabulary number, meaning emotion polarity is neutral; the positive emotion vocabulary quantity is less than the negative emotion vocabulary quantity, and the emotion polarity is negative.
The emotion vocabulary may include emotion words, negatives, degree adverbs and symbolic expressions.
The emotion words can be derived from an emotion dictionary, and the emotion dictionary can adopt a How Net or an ANTUSD, wherein the How Net contains 4566 positive emotion words and 4370 negative emotion words, and the ANTUSD contains 2810 positive emotion words and 8276 negative emotion words.
The negatives are words which can change the emotion polarity of the text, can change positive emotion into negative emotion or change negative emotion into positive emotion, and have double and multiple negations besides general negatives, so that the recognition of the negatives is critical for the determination of the subsequent emotion type and the judgment of emotion intensity.
The degree adverbs are mainly modification of corresponding adjectives or adverbs in the text, and usually appear in front of the adjectives or the adverbs, and in emotion analysis, the degree adverbs can weaken or strengthen the emotion intensity of the emotion words, so that the degree adverbs are considered when judging the emotion intensity, and the judgment accuracy of the emotion intensity can be improved.
People often use emoticons to convey their emotion in a social platform, and the emoticons not only can add humor sense to the text, but also can eliminate text ambiguity. Emoticons are typically used in the following cases: (1) Text does not express emotion well), such as "how is anger in such a person anger" well conveys the emotion of anger of the user; (2) For disambiguation of text, such as "true sense of life [ tear ]", when analyzed solely by text, "sense of meaning" causes the polarity of the text to be positive, but obviously the sentence is negative, and the emoticon [ tear ] correctly conveys the emotion classification; (3) Enhancing text emotion, such as "the movie is too good looking [ too happy ]" the expression symbol [ too happy ] enhances the emotion of the text.
The relationship information may include relationship categories and other public opinion objects having relationships with the public opinion objects, wherein the relationship categories may be association relationships, aggregation relationships, dependency relationships, and the like. The relationship information is used to express the relationship between the public opinion objects.
Specifically, the definition of the relationship information may be as follows:
relation information = {
(relationship category 1, ([ number 1.1, object ID ], [ number 1.2, object ID ], … …)),
(relationship category 2, ([ number 2.1, object ID ], [ number 2.2, object ID ], … …)),
(relationship category 3, ([ number 3.1, object ID ], [ number 3.2, object ID ], … …)) … … }.
In step S4, the public opinion case base may store historical public opinion cases, and the historical public opinion cases may be stored in the public opinion case base in the form of event objects, where the event objects may include event identification, start time, end time, event topic, event keywords, and event profile. An object-oriented technology is introduced to model the historical public opinion control cases as objects and serve as an independent knowledge unit in a public opinion case base, so that more complex knowledge identification and knowledge reasoning can be performed.
When the public opinion objects are subjected to similarity matching with the public opinion cases in the public opinion case library, the five types of attributes under the multi-granularity public opinion space-time object description attribute frame of the public opinion objects can be emphasized to be subjected to detailed analysis, the characteristics of each attribute are fully considered, different attributes are processed, the similarity among the attributes is well mined, and therefore the most similar public opinion cases are more scientifically and reasonably matched, and the public opinion control scheme obtained by analyzing according to the most similar public opinion cases in the step S5 is more practical.
The case structure of different public opinion cases has different characteristics, and the public opinion case information may be incomplete, and the information description of the case control scheme may be incomplete. And analyzing the structural similarity between the public opinion object and the public opinion case can well solve the problem of attribute deficiency. Thus, as shown in fig. 2, step S4 may specifically include:
s41, performing structural similarity matching on the attribute structure of the public opinion object and the attribute structure of the public opinion case in the public opinion case library to obtain an event type to which the public opinion object belongs;
s42, performing attribute similarity matching on the attribute of the public opinion object and the attribute value of the public opinion case belonging to the event type in the public opinion case library to obtain the most similar public opinion case.
Firstly, in step S41, structural similarity matching is performed on the attribute structure, event types to which public opinion objects belong are matched, and then in step S42, attribute similarity matching is performed on attribute values according to the matched event types, so that matching accuracy can be greatly improved.
Specifically, according to the diversity of the attribute values of the public opinion object and by combining the characteristics of the public opinion cases, the following three types of attribute similarity matching can be flexibly adopted:
(1) Digital attribute similarity matching: typically represented by definite numbers, either continuously or discretely. Calculating the similarity of two values by a Hamming distance and Euclidean distance equidistant calculation method;
(2) Symbol attribute similarity matching: usually represented by an explicit symbolic attribute, such as case publication time, event execution location, etc. The symbol attribute values have no quantitative relation, and only have the same (or contain) and different relations, so that the similarity of two symbols can be judged directly by judging whether the symbol attribute values are the same or not;
(3) Fuzzy attribute similarity matching: the fuzzy attribute includes fuzzy semantic attribute, fuzzy number attribute, fuzzy interval attribute, etc. And calculating the similarity of the two fuzzy attributes through membership functions such as a trapezoidal function, a triangular function, a Gaussian function and the like.
The most similar public opinion cases which are considered to be matched only take the similarity as a unique standard and lack in credibility, so that the confidence analysis can be combined on the basis of similarity matching. Step S4 may further include:
s43, presetting a confidence index and establishing a confidence decision tree;
s44, analyzing whether the attribute (such as space-time information, semantic information, emotion information and the like) of the public opinion object is credible or not according to the confidence decision tree.
By adopting the structure similarity matching and attribute similarity matching modes, although more accurate similarity can be obtained, the cost price ratio of the required time is larger, and when the public opinion case library is continuously increased, the required time is also increased in the same proportion. Therefore, the similarity matching based on the Bayesian probability model can be adopted, and the time cost of the matching can be reduced.
In step S41, a certain public opinion object is recorded asThe +.f in the public opinion case base>The event type is->Event type->Consists of a plurality of public opinion cases;
calculating public opinion objects according toAnd event type->Structural similarity->:
;
Is an empirical factor->Is subject of public opinion->Attribute number of->Is subject of public opinion->And event type->The same attribute number, ++>For event type->Is a number of attributes of (a).
According to the structural similarityCan judge the public opinion object +.>The event type is->Event type。
Specifically, a threshold value may be presetWhen the calculated maximum structural similarity +.>Greater than a preset thresholdAt the same time, it can be considered that the corresponding public opinion +.>The event type is->Event type->。
In step S42, a certain public opinion object is recorded asPublic opinion object->The event type is->Event type->The public opinion case base belongs to event type +.>Is +.>;
Because of the independence between attributes, i.e. the absence of dependencies, and conditional attributes, the conditional probability can be calculated:
;
;
then there are:
;
。
in summary, the conditional probability can be calculated as follows:
;
;
;
is the event type in the public opinion case base>And is->The number of public opinion cases with matched attributes,is the total number of public opinion cases in the public opinion case library, </i >>Is subject of public opinion->Attribute of->For event type->Has the attribute->The number of public opinion cases->Is event type +.>Is a property weight of (a).
According to conditional probabilityThe most similar public opinion cases can be judged.
Specifically, one public opinion case or a maximum plurality of public opinion cases with the greatest conditional probability may be regarded as the most similar public opinion cases of the public opinion object X. The public opinion cases corresponding to the maximum conditional probability may be calculated according to the following equation:
。
the step S5 may specifically be: and deducing the solution of the most similar public opinion case according to the actual situation, thereby obtaining the public opinion control scheme of the current public opinion event.
The step S5 may specifically be: and analyzing different points and the same points of the current public opinion event and the most similar public opinion case, and adjusting the treatment strategy of the most similar public opinion case to obtain a public opinion control scheme of the current public opinion event.
Through the steps S1 to S5, the collected public opinion data can be fully mined, data support is provided for subsequent similarity matching with public opinion cases and public opinion control scheme analysis, and scientific decision reference is provided for supervision of network public opinion.
It should be understood that the foregoing examples of the present invention are merely illustrative of the present invention and are not intended to limit the present invention to the specific embodiments thereof. Any modification, equivalent replacement, improvement, etc. that comes within the spirit and principle of the claims of the present invention should be included in the protection scope of the claims of the present invention.