CN112035658B - Enterprise public opinion monitoring method based on deep learning - Google Patents

Enterprise public opinion monitoring method based on deep learning Download PDF

Info

Publication number
CN112035658B
CN112035658B CN202010784664.7A CN202010784664A CN112035658B CN 112035658 B CN112035658 B CN 112035658B CN 202010784664 A CN202010784664 A CN 202010784664A CN 112035658 B CN112035658 B CN 112035658B
Authority
CN
China
Prior art keywords
information
topic
word
value
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010784664.7A
Other languages
Chinese (zh)
Other versions
CN112035658A (en
Inventor
钟贞炎
林三吉
陈丰生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haina Zhiyuan Digital Technology Shanghai Co ltd
Original Assignee
Haina Zhiyuan Digital Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haina Zhiyuan Digital Technology Shanghai Co ltd filed Critical Haina Zhiyuan Digital Technology Shanghai Co ltd
Priority to CN202010784664.7A priority Critical patent/CN112035658B/en
Publication of CN112035658A publication Critical patent/CN112035658A/en
Application granted granted Critical
Publication of CN112035658B publication Critical patent/CN112035658B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an enterprise public opinion monitoring method based on deep learning, which comprises the following steps: topic classification, topic and hotword extraction, information tonality analysis, topic tonality analysis, sound volume and life value analysis of topics and information, associated information recommendation and information abstract extraction. According to the method, mass information public opinion information is formed into topics, and focus public opinion in the topics is deeply mined; the design of a lightweight model provides a method for adjusting the classification probability of the information comments aiming at the sample unbalance condition, so that the accuracy and the F1 value of the model are greatly improved; the emotion words are expanded by using the function of the Tengxin hyponym, so that a great amount of labor cost is reduced; the multidimensional index is developed to track and display the propagation status and development trend of public opinion in all directions, and provide abstract and pure related information recommendation of long-range information.

Description

Enterprise public opinion monitoring method based on deep learning
Technical Field
The invention relates to an enterprise public opinion information monitoring technology and a related public opinion analysis method thereof, in particular to an enterprise public opinion monitoring method based on deep learning.
Background
The public opinion monitoring technology is an important research direction in the field of natural language processing, is used for wide opinion research, is used for knowing the focus of public attention, and is used for deeply exploring the opinion of the focus of the public; the enterprise public opinion monitoring system intelligently refines and analyzes the related information of the enterprise, helps users to know the public opinion information of the enterprise in time, grasps the market industry dynamics, and improves the risk management and control level. At present, most of enterprise public opinion monitoring technologies are based on manual monitoring and processing, so that enterprise public opinion information is missed and delayed in tracking. On the other hand, the propagation and development trend of public opinion of enterprises only depend on manual work, so that the public opinion of enterprises cannot be controlled, the change of public opinion of the enterprises cannot be monitored in real time, and a corresponding foundation is difficult to lay for marketing strategies of the enterprises. Although related public opinion monitoring system products are already available on the market, most of the public opinion monitoring system products are based on a monitoring system integrating simple data acquisition, storage and inquiry, and lack of deep mining of data, such as topic classification, information emotion judgment, information tonality analysis and the like, so that enterprises are difficult to grasp the development trend of public opinion and predict public relations crisis.
Disclosure of Invention
Aiming at the problems and the shortcomings of the prior art, the invention provides an enterprise public opinion monitoring method based on deep learning, which realizes the automatic, comprehensive, rapid and accurate tracking of enterprise information, and has the main functions of automatically collecting and arranging network public opinion information, indexing and searching the public opinion information, and intelligent analysis and emotion judgment of the public opinion information.
The invention solves the technical problems by the following technical proposal:
the invention provides an enterprise public opinion monitoring method based on deep learning, which is characterized by comprising the following steps of:
Topic classification: acquiring the titles and the contents of the information, calculating the Jaccard similarity of the titles of any two pieces of information, classifying the two pieces of information into the same topic if the Jaccard similarity is larger than a first set threshold, otherwise, performing Simhash coding on the contents of the two pieces of information, calculating the Hamming distance of the two Simhash codes, classifying the two pieces of information into the same topic if the Hamming distance is larger than a second set threshold, otherwise, considering the information as not being the same topic information, and displaying the title of the information newly issued in the same topic after classification as the title of the topic;
Theme and hotword extraction: according to a predefined topic-hotword dictionary, hotwords contained in all information under the same topic are searched, and a topic set related to the same topic is obtained through a mapping relation between the topic and the hotwords; extracting keywords in all information contents under the same topic as a standby word stock of the topic hotwords, calling a vacation word vector, and classifying the keywords with similarity larger than a third set threshold value into the corresponding topic hotwords by calculating the similarity of the topic and the keywords, so as to construct the corresponding relation between the keywords and the topic;
And (3) information tonality analysis: training a deep network model according to the marked information comment data, wherein the network model framework comprises an input layer, an embedded layer, a bidirectional GRU layer, a global pooling layer and a Softmax layer;
Topic adjustability analysis: extracting and screening keywords with emotion tendencies for each piece of information under the same topic, performing similar meaning word expansion on the keywords to serve as emotion words of the piece of information, performing weighted summation on the emotion words of the piece of information to judge the emotion tendencies of the piece of information, and finally analyzing the positive and negative face information duty ratio under the same topic;
Sound volume and life value analysis of topics and information: when counting information and sound volume of corresponding topics, dimensions such as forwarding volume, comment volume, praise volume, netizen participation, information release user quality, exposure, heat and the like are considered, real-time statistics is respectively carried out according to the units of hours and days, and therefore the aim of carrying out omnibearing tracking on public opinion propagation conditions is achieved; the change of the life value reflects the trend of public opinion development, and the life value is derived by the heat value in sound volume;
Associated information recommendation: setting two thresholds for segmentation, selecting information with similarity between the two thresholds as alternative associated information under the same topic, calculating the similarity between the alternative associated information and a plurality of pieces of information in an associated information set, and judging whether the minimum value in the calculated similarity is smaller than the larger threshold value in the given two thresholds to determine whether the alternative associated information can be added into the associated information set;
Extracting the information abstract: and carrying out importance evaluation on all sentences in the information, wherein the evaluated dimensions comprise position information of the sentences, sentence length and summarization capability of the sentences on the whole text, quantizing and weighting fusion are carried out on the numerical values of the dimensions, so that a plurality of sentences with the highest evaluation values can be selected to form an automatic abstract.
Preferably, the step of extracting the subject and the hotword includes:
Step S1: searching hot words contained in all information under the same topic by utilizing a predefined topic-hot word dictionary;
step S2: obtaining a theme set related to the same topic through the mapping relation between the theme and the hotword;
Step S3: according to the word segmentation tool jieba, segmenting all information contents under the same topic and removing stop words according to a predefined stop word bank, so that the influence of certain noise words on the retrieval of key words is avoided;
Step S4: extracting words with word frequency-inverse text frequency TF-TDF greater than a threshold value mu from the word segmentation without stop words as key words, wherein the calculation formula of the TF-TDF is as follows:
TF-IDF(wi)=TF(wi)*IDF(wi). (2)
TF (w i) in the formula (2) represents word frequency of the word w i in the current sentence, IDF (w i) represents inverse document frequency of the word w i, the value is log [ (n+1)/(N (w i) +1) ]+1, N is the number of sentences in the information, and N (w i) is the number of sentences containing the word w i;
Step S5: and inquiring 200-dimensional Tengzhen word vectors corresponding to the keywords and the topics stored in the database in advance, and if the corresponding word vector is not inquired by the extracted keyword, randomly generating 200-dimensional vectors to replace the word vector.
Step S6: and calculating cosine similarity of each theme and all keywords:
T i in the formula (3) represents a word vector of an ith topic, w j represents a word vector of a jth keyword, the vector is expressed as an inner product, T i and w i represent modes of the topic and the keyword word vector respectively, and whether the similarity obtained in the formula (3) is larger than a given threshold delta is judged;
Step S7: and (3) on the premise that the similarity obtained by the formula (3) is larger than a given threshold delta, adding the keywords into the corresponding theme hotwords.
Preferably, in the step of tonal analysis, a Dropout layer is added before the bi-directional GRU layer.
Preferably, in the step of topic adjustment analysis, the title and the information content of the piece of information are analyzed according to gradients, weighted summation is firstly carried out on emotion words related to the title of the piece of information, then polarity judgment is carried out, if the title of the piece of information has fixed emotion tendency, a title polarity result is directly used as emotion tendency of the information, if the title polarity result of the piece of information is neutral, weighted summation is carried out on emotion words of the text information content of the piece of information to obtain an information emotion value Senti, and the result is used as judgment basis of the emotion polarity of the information:
Finally, the positive and negative duty ratio analysis is performed on the information according to the information emotion polarity result SENTIMENT under the same topic.
Preferably, the step of analyzing the sound volume and the life value of the topics and the information comprises the following steps:
step S1: tracking information forwarding quantity, comment quantity and praise quantity of each hour, and calculating exposure degree, heat degree, life value, netizen participation degree and information release user quality of the information;
The information exposure exposure_value is defined as a weighted sum of forwarding_times, comment_times, and praise_times:
exposure_value=α*forward_times+β*reply_times+γ*positive_times. (6)
alpha, beta and gamma in the formula (6) are weight coefficients;
the heat value of the information is defined as the ratio of the exposure to the time interval:
date_now in formula (7) is the current date, public_time is the information release date;
the participation degree participation _level of the netizen in the information is defined as:
θ in formula (8) is a constant;
the quality user_quality of the information distribution user is defined as:
In the formula (9), the molecule represents the praise amount of the information comment;
the life value life_value of the information is defined as:
θ in formula (10) is the same as formula (8);
Step S2: after the statistics of the sound volume and the life value of the information is completed in the step S1, the sound volume and the life value of the topic are calculated, the exposure degree of the topic is defined as the sum of the exposure degrees of all the information under the topic, the participation degree of the topic is defined as the same as the formula (8), and the difference is that exposure degree of the topic is represented by exposure_value; the user quality and the heat of the topic are average values of the quality and the heat of all information release users under the topic respectively, however, the definition of the topic life value is the same as the formula (10), except that heat_value represents the heat of the topic;
step S3: after finishing the statistics of the sound volume and the life value of each hour of the information day in the step S1, extracting the result data of the last hour of the day as the sound volume and the life value of the information according to the statistics of the day;
Step S4: and (2) after the statistics of the sound volume and the vital value of the topic for 24 hours on the same day are completed in the step (S2), extracting the data of the point of the day 23 at the moment as the sound volume and the vital value index result of the topic according to the statistics of the day.
Preferably, the step of recommending the associated information includes:
step S1: acquiring information documents containing entity enterprise names under the same topic from a database;
step S2: respectively performing word segmentation and de-stop word processing on the information set selected to contain the enterprise name in the same topic;
step S3: traversing each piece of information after word segmentation processing in the information set, and respectively performing BM25 similarity calculation with all pieces of information remaining in the information set, wherein a BM25 similarity calculation formula is as follows:
M in the formula (11) is the number of words in the target information Q after word segmentation, w i is the IDF value of the i-th word, and R (Q i, D) is the relativity between each word Q i in Q and the rest of the information D after word segmentation, which is defined as:
K 1,k2 in the formula (12), b is an adjusting factor, f i is the frequency of occurrence of the word Q i in D, qf i is the frequency of occurrence of the word Q i in Q, dl is the length of D, and avgRDl is the average length of all the information word-segmentation processed screened in the step S1;
step S4: judging whether the BM25 similarity between the target information Q and the candidate information D is within the threshold interval (alpha, beta), and if the condition is satisfied, adding the information D to the associated information set corresponding to the target information.
Preferably, the step of extracting the summary of information includes:
step S1: cleaning and filtering information content in topics through a set rule;
step S2: dividing the information text into sentence sets according to periods, question marks and exclamation marks, and performing word segmentation and disabling word processing on each sentence;
step S3: constructing deep network training word vectors, and carrying out weighted summation on vectors corresponding to words in sentences to form sentence vectors;
step S4: calculating cosine similarity of each sentence in the sentence set, the information title, the remaining sentence subset and the current abstract result set, wherein the vector of the sentence subset is obtained by summing the vectors of all sentences and solving the average value;
Step S5: the score for each sentence v i is first calculated:
score(vi)=α*sim(vi,S)+β*sim(vi,T)+γ*loc(vi) (13)
α, β, γ in the formula (13) is a weight coefficient, sim (v i, S) is cosine similarity of the sentence v i and the rest of the sentence set S except the sentence v i, sim (v i, T) is cosine similarity of the sentence v i and the title T, and loc (v i) is a position information value of the sentence v i; usually, if the sentence is the first sentence, the value is1, if the sentence is the first sentence, the value is 0.8, if the sentence is the first sentence, the value is 0.5, and the values of the rest are zero, then, a plurality of sentences with highest scores under the dual evaluation standards of relevance and diversity are selected to form an information abstract, namely, the objective function is:
λ in the formula (14) is a weight coefficient, sim (v i, R) is a cosine similarity between the sentence v i and the currently obtained summary result set R.
On the basis of conforming to the common knowledge in the field, the above preferred conditions can be arbitrarily combined to obtain the preferred examples of the invention.
The invention has the positive progress effects that:
According to the method, mass information public opinion information is formed into topics, and focus public opinion in the topics is deeply mined; the design of a lightweight model provides a method for adjusting the classification probability of the information comments aiming at the sample unbalance condition, so that the accuracy and the F1 value of the model are greatly improved; the emotion words are expanded by using the function of the Tengxin hyponym, so that a great amount of labor cost is reduced; the multidimensional index is developed to track and display the propagation status and development trend of public opinion in all directions, and provide abstract and pure related information recommendation of long-range information.
Drawings
Fig. 1 is a topic classification flow chart according to a preferred embodiment of the invention.
FIG. 2 is a flowchart of the topic and hotword extraction in accordance with a preferred embodiment of the present invention.
FIG. 3 is a schematic diagram of an emotion classification network for information comments according to a preferred embodiment of the present invention.
FIG. 4 is a flowchart of topic adjustment analysis according to a preferred embodiment of the present invention.
FIG. 5 is a flowchart of the topic and information sound volume and life value calculation according to the preferred embodiment of the present invention.
FIG. 6 is a flowchart of the related information acquisition according to the preferred embodiment of the present invention.
FIG. 7 is a flowchart of summary extraction according to a preferred embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment provides an enterprise public opinion monitoring method based on deep learning, which comprises the following steps:
1. Topic classification:
Topic classification is rapidly and accurately carried out through a multi-level discrimination method. Specifically, as shown in fig. 1, the titles and contents of the information are obtained, the Jaccard similarity of any two titles of the information is calculated, if the Jaccard similarity is greater than a first set threshold, the two information are classified as the same topic, otherwise, the content of the two information is subjected to Simhash coding, the Hamming distance of the two Simhash codes is calculated, if the Hamming distance is greater than a second set threshold, the two information is classified as the same topic, otherwise, the information is not considered as the same topic, and the title of the information which is released recently is displayed as the title of the topic in the same topic after classification.
For example, A1 and B1 represent the titles of two pieces of information a and B, respectively, and Jaccard similarity thereof is calculated:
Wherein J (A1, B1) in the formula (1) represents Jaccard similarity of the information A and B titles, |A1 n B1| represents the number of the same words contained in the two information titles, and|A1 n B1| represents the number of elements of all word sets of the two information titles. If the Jaccard similarity is greater than a given threshold α, then consider that the two pieces of information A and B belong to a common topic; otherwise, performing Simhash coding on the contents of the two information A and B, performing exclusive OR processing on the coded values, and then solving the number of 1 occurrence in the binary system to obtain a Hamming distance value H (A, B); when H (A, B) is greater than the threshold value beta, the information A and B are considered to belong to the same topic, and conversely, are considered to belong to different topics.
2. Theme and hotword extraction:
step S1: and searching the hotwords contained in all information under the same topic by using a predefined topic-hotword dictionary.
Step S2: and obtaining a theme set related to the same topic through the mapping relation between the theme and the hotword.
Step S3: according to the word segmentation tool jieba, the words of all information contents under the same topic are segmented, and the stop words are removed according to a predefined stop word bank, so that the influence of certain noise words on the retrieval of the keywords is avoided.
Step S4: extracting words with word frequency-inverse text frequency TF-TDF greater than a threshold value mu from the word segmentation without stop words as key words, wherein the calculation formula of the TF-TDF is as follows:
TF-IDF(wi)=TF(wi)*IDF(wi). (2)
TF (w i) in formula (2) represents word frequency of word w i in the current sentence, IDF (w i) represents inverse document frequency of word w i, the value is log [ (n+1)/(N (w i) +1) ]+1, N is the number of sentences in the information, and N (w i) is the number of sentences containing word w i.
Step S5: and inquiring 200-dimensional Tengzhen word vectors corresponding to the keywords and the topics stored in the database in advance, and if the corresponding word vector is not inquired by the extracted keyword, randomly generating 200-dimensional vectors to replace the word vector.
Step S6: and calculating cosine similarity of each theme and all keywords:
t i in the formula (3) represents a word vector of an ith topic, w j represents a word vector of a jth keyword, the vector is expressed as an inner product, T i and w i represent modes of the topic and the keyword word vector respectively, and whether the similarity obtained by the formula (3) is larger than a given threshold delta is judged.
Step S7: and (3) on the premise that the similarity obtained by the formula (3) is larger than a given threshold delta, adding the keywords into the corresponding theme hotwords.
3. And (3) information tonality analysis:
The adjustability of information is mainly reflected in the emotion direction of information comments. Training a deep network model according to the marked information comment data, wherein the network model framework is mainly stacked with an input layer, an embedded layer, a bidirectional GRU layer, a global pooling layer and a Softmax layer; to increase robustness, a layer of Dropout is added before the bi-directional GRU layer. On the other hand, the emotion classification probability output by the Softmax layer is properly adjusted according to the condition that comment sample data types are unbalanced, so that the accuracy and the F1 value of the model are improved.
As shown in FIG. 3, the basic framework of the information comment emotion classification backbone network diagram structure mainly comprises an Input layer (Input), an embedded layer (Embedding), a Dropout layer, a bidirectional GRU layer (BiGRU), a global pooling layer (GlobalMaxPooling) and a Softmax layer. The embedded layer adopts a weight matrix W (trained by Word2 vec) trained in advance, the weight is not changed in the training process, a layer of Dropout (set to 0.2) is added before input BiGRU (units: 200) in order to improve robustness, GRU bidirectional output is spliced and global pooling is carried out, and then the probability of comment sample category is obtained through Softmax for classification. Aiming at the condition that comment sample categories are unbalanced, the comment sample category probability output by the Softmax layer is adjusted, and the calculation method is as follows:
In the formula (4), prob represents a comment sample category probability value output by the Softmax layer, N represents the total number of samples, N represents the category number of samples, m is an adjustable factor, and the adjustable factor is 1,2,3 or the like.
4. Topic adjustability analysis:
And extracting and screening keywords with emotion tendencies for each piece of information under the same topic, performing similar meaning word expansion on the keywords to serve as emotion words of the piece of information, performing weighted summation on the emotion words of the piece of information to judge the emotion tendencies of the piece of information, and finally analyzing the positive and negative face information duty ratio under the same topic.
Fig. 4 shows the main steps of topic adjustment analysis, and the statistical analysis is mainly performed by using the information emotion polarities under the same topic. Firstly, extracting keywords of information under the same topic respectively, screening keywords with emotion tendencies, and expanding emotion words through a function of vacation word vector paraphrasing words; in order to determine the polarity of information more accurately, the title and content of information are analyzed in gradient. The weight summation is firstly carried out on the emotion words related to the information title, then the polarity judgment is carried out, and if the information title has fixed emotion tendency, the title polarity result is directly used as the emotion tendency of the information. If the title polarity result is neutral, then carrying out weight summation by using the emotion words of the text to obtain an information emotion value Senti, and taking the result as the judgment basis of the information emotion polarity:
finally, the positive and negative duty ratio analysis is carried out on the information according to the information emotion polarity result SENTIMENT under the same topic.
5. Sound volume and life value analysis of topics and information:
The propagation status of public opinion and the development trend thereof play a vital role in providing corresponding strategic guidelines for enterprises, and indexes such as sound volume, vital value and the like of information and corresponding topics are directly reflected by the propagation status and development trend of public opinion. When counting information and sound volume of corresponding topics, dimensions such as forwarding volume, comment volume (follow-up volume), praise volume, netizen participation, information release user quality, exposure degree, heat degree and the like are considered, and real-time statistics is respectively carried out according to the units of hours and days, so that the aim of carrying out omnibearing tracking on public opinion propagation conditions is achieved. In addition, the change of the life value reflects the trend of public opinion development, and the index is derived from the heat value in sound volume.
As shown in fig. 5, step S1: tracking information forwarding quantity, comment quantity and praise quantity of each hour, and calculating exposure degree, heat degree, life value, netizen participation degree and information release user quality of the information. The information exposure exposure_value is defined as a weighted sum of forwarding_times, comment_times, and praise_times:
exposure_value=α*forward_times+β*reply_times+γ*positive_times. (6)
alpha, beta and gamma in the formula (6) are weight coefficients;
the heat value of the information is defined as the ratio of the exposure to the time interval:
date_now in formula (7) is the current date, public_time is the information release date;
However, the engagement participation _level of the netizen with the information is defined as:
θ in formula (8) is a constant;
in addition, the quality user_quality of the information distribution user is defined as:
in the formula (9), the molecule represents the praise amount of the information comment; the life value life_value of the information is defined as:
θ in the formula (10) is the same as that in the formula (8).
Step S2: after the statistics of the sound volume and the life value of the information is completed in step S1, the sound volume and the life value of the dialogue question are calculated. The exposure of the topic is defined as the sum of the exposure of all information under the topic, the participation of the topic is defined as the same as the formula (8), and the difference is that exposure_value represents the exposure of the topic; the user quality and the heat of the topic are average values of the quality and the heat of all information release users under the topic respectively, however, the definition of the topic life value is the same as the formula (10), except that heat_value represents the heat of the topic.
Step S3: after the statistics of the sound volume and the life value of each hour of the information day are completed in the step S1, the result data of the last hour of the day is extracted as the sound volume and the life value of the information according to the statistics of the day.
Step S4: and (2) after the statistics of the sound volume and the vital value of the topic for 24 hours on the same day are completed in the step (S2), extracting the data of the point of the day 23 at the moment as the sound volume and the vital value index result of the topic according to the statistics of the day.
6. Associated information recommendation:
The outbreak of a certain event is often accompanied by the release of a plurality of pieces of information, and in order to conveniently jump to the information with higher association strength under the same topic while viewing the detail page of the current information, the associated information in the common topic needs to be considered for recommendation. The measurement of the association strength between the information is critical, considering that some information is directly copied, only a small part of the information is modified, the measurement of the association strength is needed by calculating the similarity between the information, two thresholds are set for segmentation, the information with the similarity between the two thresholds is selected as the alternative association information under the same topic, and then the minimum value in the calculated similarity is smaller than the larger threshold value in the given two thresholds to determine whether the alternative association information can be added into the association information set or not by calculating the similarity between the alternative association information and a plurality of pieces of information in the association information set.
As shown in fig. 6, step S1: information documents containing entities (business names) under the same topic are obtained from the database.
Step S2: and respectively performing word segmentation and word disabling processing on the information set selected to contain the enterprise name in the same topic.
Step S3: traversing each piece of information after word segmentation processing in the information set, and respectively performing BM25 similarity calculation with all pieces of information remaining in the information set, wherein a BM25 similarity calculation formula is as follows:
M in the formula (11) is the number of words in the target information Q after word segmentation, w i is the IDF value of the i-th word, and R (Q i, D) is the relativity between each word Q i in Q and the rest of the information D after word segmentation, which is defined as:
In formula (12), k 1,k2, b are adjustment factors, f i is the frequency of occurrence of the word Q i in D, qf i is the frequency of occurrence of the word Q i in Q, dl is the length of D (i.e. the number of words after word segmentation), and avgRDl is the average length of all the information word segmentation screened in step S1.
Step S4: judging whether the BM25 similarity between the target information Q and the candidate information D is within the threshold interval (alpha, beta), and if the condition is satisfied, adding the information D to the associated information set corresponding to the target information.
7. Extracting the information abstract:
The bottommost layer in the public opinion monitoring system of the mobile terminal enterprise is an information detail page, the top of the information detail page generates a summary, the module evaluates the importance of all sentences in the information, the evaluated dimensions comprise the position information of the sentences, the sentence length, the summarization capability of the sentences to the whole text and the like, and the numerical values of the dimensions are quantized, weighted and fused to select a plurality of sentences with the highest evaluation values to form an automatic summary. The sentence score evaluation is key, and the similarity between sentences and the information titles and the full text is calculated as the score, so that the similarity among a plurality of selected sentences is high and the diversity is poor. Therefore, the similarity between the current sentence and the selected sentence, namely the maximum boundary correlation algorithm idea, needs to be considered. Finally, the selected sentence sets are ranked highest under the correlation and diversity evaluation standards.
As shown in fig. 7, step S1: and cleaning and filtering the information content in the topics through setting rules.
Step S2: dividing the information text into sentence sets according to the sentence marks, question marks and exclamation marks, and performing word division and disabling word processing on each sentence.
Step S3: and constructing a deep network training word vector, and carrying out weighted summation on vectors corresponding to each word in the sentence to form a sentence vector.
Step S4: and calculating cosine similarity of each sentence in the sentence set with the information title, the remaining sentence subset and the current abstract result set, wherein the vectors of the sentence subset are obtained by summing the vectors of all sentences and solving the average value.
Step S5: the score for each sentence v i is first calculated:
score(vi)=α*sim(vi,S)+β*sim(vi,T)+γ*loc(vi) (13)
α, β, γ in the formula (13) is a weight coefficient, sim (v i, S) is cosine similarity of the sentence v i and the rest of the sentence set S except the sentence v i, sim (v i, T) is cosine similarity of the sentence v i and the title T, and loc (v i) is a position information value of the sentence v i; usually, if the sentence is the first sentence of the first paragraph, the value is1, if the sentence is the first sentence of the second paragraph, the value is 0.8, if the sentence is the first sentence of the second paragraph, the value is 0.5, and the values are zero in the other cases. Secondly, selecting a plurality of sentences with highest scores under the correlation and diversity dual evaluation standards to form an information abstract, namely, an objective function is as follows:
λ in the formula (14) is a weight coefficient, sim (v i, R) is a cosine similarity between the sentence v i and the currently obtained summary result set R.
Firstly, classifying topics of collected public opinion data based on a method for calculating Jaccard similarity and Hamming distance between information; secondly, a method based on bidirectional GRU, global pooling and probability adjustment is provided for judging the positive and negative of the information comment, and a calculation method of giving weight to keywords is adopted for the information in the topic to infer the emotion tendency of the information comment; in addition, in order to enable the customer to quickly understand the information content, the invention provides a maximum boundary correlation algorithm and rule information abstract extraction method.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims (6)

1. The enterprise public opinion monitoring method based on deep learning is characterized by comprising the following steps of:
Topic classification: acquiring the titles and the contents of the information, calculating the Jaccard similarity of the titles of any two pieces of information, classifying the two pieces of information into the same topic if the Jaccard similarity is larger than a first set threshold, otherwise, performing Simhash coding on the contents of the two pieces of information, calculating the Hamming distance of the two Simhash codes, classifying the two pieces of information into the same topic if the Hamming distance is larger than a second set threshold, otherwise, considering the information as not being the same topic information, and displaying the title of the information newly issued in the same topic after classification as the title of the topic;
Theme and hotword extraction: according to a predefined topic-hotword dictionary, hotwords contained in all information under the same topic are searched, and a topic set related to the same topic is obtained through a mapping relation between the topic and the hotwords; extracting keywords in all information contents under the same topic as a standby word stock of the topic hotwords, calling a vacation word vector, and classifying the keywords with similarity larger than a third set threshold value into the corresponding topic hotwords by calculating the similarity of the topic and the keywords, so as to construct the corresponding relation between the keywords and the topic;
And (3) information tonality analysis: training a deep network model according to the marked information comment data, wherein the network model framework comprises an input layer, an embedded layer, a bidirectional GRU layer, a global pooling layer and a Softmax layer;
Topic adjustability analysis: extracting and screening keywords with emotion tendencies for each piece of information under the same topic, performing similar meaning word expansion on the keywords to serve as emotion words of the piece of information, performing weighted summation on the emotion words of the piece of information to judge the emotion tendencies of the piece of information, and finally analyzing the positive and negative face information duty ratio under the same topic;
Sound volume and life value analysis of topics and information: when counting information and sound volume of corresponding topics, dimensions such as forwarding volume, comment volume, praise volume, netizen participation, information release user quality, exposure, heat and the like are considered, real-time statistics is respectively carried out according to the units of hours and days, and therefore the aim of carrying out omnibearing tracking on public opinion propagation conditions is achieved; the change of the life value reflects the trend of public opinion development, and the life value is derived by the heat value in sound volume;
Associated information recommendation: setting two thresholds for segmentation, selecting information with similarity between the two thresholds as alternative associated information under the same topic, calculating the similarity between the alternative associated information and a plurality of pieces of information in an associated information set, and judging whether the minimum value in the calculated similarity is smaller than the larger threshold value in the given two thresholds to determine whether the alternative associated information can be added into the associated information set;
Extracting the information abstract: carrying out importance evaluation on all sentences in the information, wherein the evaluated dimensions comprise position information of the sentences, sentence length and summarization capability of the sentences on the whole text, quantizing and weighting fusion are carried out on the numerical values of the dimensions to obtain a plurality of sentences with highest evaluation values, and forming an automatic abstract;
The steps of analyzing the sound volume and the life value of topics and information comprise:
step S1: tracking information forwarding quantity, comment quantity and praise quantity of each hour, and calculating exposure degree, heat degree, life value, netizen participation degree and information release user quality of the information;
The information exposure exposure_value is defined as a weighted sum of forwarding_times, comment_times, and praise_times:
exposure_value=α*forward_times+β*reply_times+γ*positive_times. (6)
alpha, beta and gamma in the formula (6) are weight coefficients;
the heat value of the information is defined as the ratio of the exposure to the time interval:
date_now in formula (7) is the current date, public_time is the information release date;
the participation degree participation _level of the netizen in the information is defined as:
θ in the formula (8) is a constant, and e is a natural constant;
the quality user_quality of the information distribution user is defined as:
In the formula (9), the molecule represents the praise amount of the information comment;
the life value life_value of the information is defined as:
θ in formula (10) is the same as formula (8);
Step S2: after the statistics of the sound volume and the life value of the information is completed in the step S1, the sound volume and the life value of the topic are calculated, the exposure degree of the topic is defined as the sum of the exposure degrees of all the information under the topic, the participation degree of the topic is defined as the same as the formula (8), and the difference is that exposure degree of the topic is represented by exposure_value; the user quality and the heat of the topic are average values of the quality and the heat of all information release users under the topic respectively, however, the definition of the topic life value is the same as the formula (10), except that heat_value represents the heat of the topic;
step S3: after finishing the statistics of the sound volume and the life value of each hour of the information day in the step S1, extracting the result data of the last hour of the day as the sound volume and the life value of the information according to the statistics of the day;
Step S4: and (2) after the statistics of the sound volume and the vital value of the topic for 24 hours on the same day are completed in the step (S2), extracting the data of the point of the day 23 at the moment as the sound volume and the vital value index result of the topic according to the statistics of the day.
2. The deep learning-based enterprise public opinion monitoring method of claim 1, wherein the step of topic and hotword extraction comprises:
Step S1: searching hot words contained in all information under the same topic by utilizing a predefined topic-hot word dictionary;
step S2: obtaining a theme set related to the same topic through the mapping relation between the theme and the hotword;
Step S3: according to the word segmentation tool jieba, segmenting all information contents under the same topic and removing stop words according to a predefined stop word bank, so that the influence of certain noise words on the retrieval of key words is avoided;
Step S4: extracting words with word frequency-inverse text frequency TF-TDF greater than a threshold value mu from the word segmentation without stop words as key words, wherein the calculation formula of the TF-TDF is as follows:
TF-IDF(wi)=TF(wi)*IDF(wi). (2)
TF (w i) in the formula (2) represents word frequency of the word w i in the current sentence, IDF (w i) represents inverse document frequency of the word w i, the value is log [ (n+1)/(N (w i) +1) ]+1, N is the number of sentences in the information, and N (w i) is the number of sentences containing the word w i;
Step S5: inquiring 200-dimensional Tengzhen word vectors corresponding to keywords and topics stored in a database in advance, and if a certain extracted keyword does not inquire the corresponding word vector, randomly generating a 200-dimensional vector to replace the word vector;
step S6: and calculating cosine similarity of each theme and all keywords:
T i in the formula (3) represents a word vector of an ith topic, w j represents a word vector of a jth keyword, the vector is expressed as an inner product, T i and w i represent modes of the topic and the keyword word vector respectively, and whether the similarity obtained in the formula (3) is larger than a given threshold delta is judged;
Step S7: and (3) on the premise that the similarity obtained by the formula (3) is larger than a given threshold delta, adding the keywords into the corresponding theme hotwords.
3. The deep learning-based enterprise public opinion monitoring method of claim 1, wherein a Dropout layer is added before the bi-directional GRU layer in the step of information tonality analysis.
4. The method for monitoring public opinion of enterprise based on deep learning according to claim 1, wherein in the step of topic tonality analysis, the title and the content of the information are analyzed according to gradient, weighted summation is performed on emotion words related to the title of the information, polarity judgment is further performed, if the title of the information has fixed emotion tendency, the title polarity result is directly used as emotion tendency of the information, if the title polarity result of the information is neutral, weighted summation is performed on emotion words of the text content of the information to obtain an information emotion value Senti, and the result is used as judgment basis of information emotion polarity:
Finally, the positive and negative duty ratio analysis is performed on the information according to the information emotion polarity result SENTIMENT under the same topic.
5. The deep learning-based enterprise public opinion monitoring method of claim 1, wherein the step of associating information recommendations comprises:
step S1: acquiring information documents containing entity enterprise names under the same topic from a database;
step S2: respectively performing word segmentation and de-stop word processing on the information set selected to contain the enterprise name in the same topic;
step S3: traversing each piece of information after word segmentation processing in the information set, and respectively performing BM25 similarity calculation with all pieces of information remaining in the information set, wherein a BM25 similarity calculation formula is as follows:
M in the formula (11) is the number of words in the target information Q after word segmentation, w i is the IDF value of the i-th word, and R (Q i, D) is the relativity between each word Q i in Q and the rest of the information D after word segmentation, which is defined as:
K 1,k2 in the formula (12), b is an adjusting factor, f i is the frequency of occurrence of the word Q i in D, qf i is the frequency of occurrence of the word Q i in Q, dl is the length of D, and avgRDl is the average length of all the information word-segmentation processed screened in the step S1;
step S4: judging whether the BM25 similarity between the target information Q and the candidate information D is within the threshold interval (alpha, beta), and if the condition is satisfied, adding the information D to the associated information set corresponding to the target information.
6. The deep learning-based enterprise public opinion monitoring method of claim 1, wherein the information summary extraction step comprises:
step S1: cleaning and filtering information content in topics through a set rule;
step S2: dividing the information text into sentence sets according to periods, question marks and exclamation marks, and performing word segmentation and disabling word processing on each sentence;
step S3: constructing deep network training word vectors, and carrying out weighted summation on vectors corresponding to words in sentences to form sentence vectors;
step S4: calculating cosine similarity of each sentence in the sentence set, the information title, the remaining sentence subset and the current abstract result set, wherein the vector of the sentence subset is obtained by summing the vectors of all sentences and solving the average value;
Step S5: the score for each sentence v i is first calculated:
score(vi)=α*sim(vi,S)+β*sim(vi,T)+γ*loc(vi) (13)
α, β, γ in the formula (13) is a weight coefficient, sim (v i, S) is cosine similarity of the sentence v i and the rest of the sentence set S except the sentence v i, sim (v i, T) is cosine similarity of the sentence v i and the title T, and loc (v i) is a position information value of the sentence v i; usually, if the sentence is the first sentence, the value is1, if the sentence is the first sentence, the value is 0.8, if the sentence is the first sentence, the value is 0.5, and the values of the rest are zero, then, a plurality of sentences with highest scores under the dual evaluation standards of relevance and diversity are selected to form an information abstract, namely, the objective function is:
λ in the formula (14) is a weight coefficient, sim (v i, R) is a cosine similarity between the sentence v i and the currently obtained summary result set R.
CN202010784664.7A 2020-08-05 2020-08-05 Enterprise public opinion monitoring method based on deep learning Active CN112035658B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010784664.7A CN112035658B (en) 2020-08-05 2020-08-05 Enterprise public opinion monitoring method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010784664.7A CN112035658B (en) 2020-08-05 2020-08-05 Enterprise public opinion monitoring method based on deep learning

Publications (2)

Publication Number Publication Date
CN112035658A CN112035658A (en) 2020-12-04
CN112035658B true CN112035658B (en) 2024-04-30

Family

ID=73582701

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010784664.7A Active CN112035658B (en) 2020-08-05 2020-08-05 Enterprise public opinion monitoring method based on deep learning

Country Status (1)

Country Link
CN (1) CN112035658B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507128A (en) * 2020-12-07 2021-03-16 云南电网有限责任公司普洱供电局 Content filling prompting method for power distribution network operation file and related equipment
CN112507129B (en) * 2020-12-07 2023-09-08 云南电网有限责任公司普洱供电局 Content change processing method of power distribution network operation file and related equipment
CN112507073A (en) * 2020-12-07 2021-03-16 云南电网有限责任公司普洱供电局 Content verification method of power distribution network operation file and related equipment
CN112581006A (en) * 2020-12-25 2021-03-30 杭州衡泰软件有限公司 Public opinion engine and method for screening public opinion information and monitoring enterprise main body risk level
CN112749341B (en) * 2021-01-22 2024-03-29 南京莱斯网信技术研究院有限公司 Important public opinion recommendation method, readable storage medium and data processing device
CN112862305A (en) * 2021-02-03 2021-05-28 北京百度网讯科技有限公司 Method, device, equipment and storage medium for determining risk state of object
CN112966500B (en) * 2021-02-15 2021-11-23 珠海市鸿瑞信息技术股份有限公司 Network data chain safety monitoring platform based on artificial intelligence configuration

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
CN109086355A (en) * 2018-07-18 2018-12-25 北京航天云路有限公司 Hot spot association relationship analysis method and system based on theme of news word
CN110516067A (en) * 2019-08-23 2019-11-29 北京工商大学 Public sentiment monitoring method, system and storage medium based on topic detection
WO2019227710A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Network public opinion analysis method and apparatus, and computer-readable storage medium
CN111210308A (en) * 2020-01-03 2020-05-29 精硕科技(北京)股份有限公司 Method and device for determining promotion strategy, computer equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231634A (en) * 2007-12-29 2008-07-30 中国科学院计算技术研究所 Autoabstract method for multi-document
WO2019227710A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Network public opinion analysis method and apparatus, and computer-readable storage medium
CN109086355A (en) * 2018-07-18 2018-12-25 北京航天云路有限公司 Hot spot association relationship analysis method and system based on theme of news word
CN110516067A (en) * 2019-08-23 2019-11-29 北京工商大学 Public sentiment monitoring method, system and storage medium based on topic detection
CN111210308A (en) * 2020-01-03 2020-05-29 精硕科技(北京)股份有限公司 Method and device for determining promotion strategy, computer equipment and medium

Also Published As

Publication number Publication date
CN112035658A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112035658B (en) Enterprise public opinion monitoring method based on deep learning
US9081852B2 (en) Recommending terms to specify ontology space
US7844592B2 (en) Ontology-content-based filtering method for personalized newspapers
US7707204B2 (en) Factoid-based searching
Routray et al. A survey on sentiment analysis
CN110543564B (en) Domain label acquisition method based on topic model
CN112256843B (en) News keyword extraction method and system based on TF-IDF method optimization
CN115309872B (en) Multi-model entropy weighted retrieval method and system based on Kmeans recall
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN112231593B (en) Financial information intelligent recommendation system
CN114254201A (en) Recommendation method for science and technology project review experts
CN111753167B (en) Search processing method, device, computer equipment and medium
Gao et al. Sentiment classification for stock news
CN116010552A (en) Engineering cost data analysis system and method based on keyword word library
CN113407729A (en) Judicial-oriented personalized case recommendation method and system
CN111859955A (en) Public opinion data analysis model based on deep learning
CN116127194A (en) Enterprise recommendation method
CN117056392A (en) Big data retrieval service system and method based on dynamic hypergraph technology
CN115953041A (en) Construction scheme and system of operator policy system
CN113688633A (en) Outline determination method and device
CN112507687A (en) Work order retrieval method based on secondary sorting
CN112836010A (en) Patent retrieval method, storage medium and device
CN112487302B (en) File resource accurate pushing method based on user behaviors
Mallek et al. An Unsupervised Approach for Precise Context Identification from Unstructured Text Documents
Pandi et al. Reputation based online product recommendations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant