CN113051455A - Water affair public opinion identification method based on network text data - Google Patents

Water affair public opinion identification method based on network text data Download PDF

Info

Publication number
CN113051455A
CN113051455A CN202110346900.1A CN202110346900A CN113051455A CN 113051455 A CN113051455 A CN 113051455A CN 202110346900 A CN202110346900 A CN 202110346900A CN 113051455 A CN113051455 A CN 113051455A
Authority
CN
China
Prior art keywords
topic
network text
text data
water
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110346900.1A
Other languages
Chinese (zh)
Other versions
CN113051455B (en
Inventor
朱波
穆利
姜元春
吴铭
钱洋
王亚琦
熊迎秋
郝瀚
丁磊
隆云飞
阚道升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ustc Sinovate Software Co ltd
Hefei Water Group Co ltd
Hefei University of Technology
Original Assignee
Ustc Sinovate Software Co ltd
Hefei Water Group Co ltd
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ustc Sinovate Software Co ltd, Hefei Water Group Co ltd, Hefei University of Technology filed Critical Ustc Sinovate Software Co ltd
Priority to CN202110346900.1A priority Critical patent/CN113051455B/en
Publication of CN113051455A publication Critical patent/CN113051455A/en
Application granted granted Critical
Publication of CN113051455B publication Critical patent/CN113051455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a water public opinion identification method based on network text data, which comprises the following steps: 1. acquiring the network text data related to the water affairs, 2 preprocessing the network text data related to the water affairs, 3 analyzing the network text data related to the water affairs and finding out the focus of the water affair public opinion. The method can determine the webpage searching strategy according to the type of the website, realize the rapid acquisition of the text data related to the water affairs from the mass network data, and find the water affair public opinion concern by combining with the theme analysis so as to realize the water affair public opinion identification, thereby improving the efficiency and the accuracy of the water affair public opinion identification, and the result has good interpretability.

Description

Water affair public opinion identification method based on network text data
Technical Field
The invention relates to the technical field of data mining, in particular to a water affair public opinion identification method based on network text data.
Background
With the rapid development of the internet and the constant change of people's life style, network data related to various industries show explosive growth. Most of the network data are related to people and show the guidance of social public opinion, so that the data on the network are highly valued by enterprises, and the water affairs industry also realizes the identification of the water affairs public opinion by acquiring network texts related to water affairs from mass network data and finding out the attention points of the social public opinion from the network texts.
Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a water affair public opinion identification method based on network text data, so that the problem that the water affair public opinion is difficult to identify can be solved, the text data related to the water affair can be quickly acquired from massive network data, the focus of the water affair network text can be accurately analyzed, and the efficiency and the accuracy of the water affair public opinion identification can be improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a water affair public opinion identification method based on network text data, which is characterized by comprising the following steps:
step 1, acquiring network text data related to water affairs;
step 1.1, adopting different webpage searching strategies to collect the network text data of the target webpage according to the type of the website:
if the website type is the official website of the water supply group, adopting a breadth-first strategy;
if the website type is a government portal website, adopting a depth priority strategy;
if the website type is a network community or forum website, acquiring the type of a user publishing topics, comments or messages related to water affairs according to the network text data of the target webpage, and determining a webpage searching strategy;
if the type of the user is an official user, adopting a depth priority strategy;
if the type of the user is a personal user, adopting a depth priority strategy and an extent priority strategy;
step 1.2, acquiring the grades of all participating users of a first topic to which the network text data belongs according to the network text data of the target webpage;
if the level of the participating user belonging to the first topic meets the preset level requirement, collecting all network text data published by the corresponding participating user under the belonging first topic;
step 1.3, acquiring topic participation times of all participating users of a second topic to which the network text data belongs according to the network text data of the target webpage;
if the topic participation frequency of the participating user belonging to the second topic exceeds a preset participation frequency threshold, acquiring network text data issued by the corresponding participating user in the life cycle of the belonging second topic;
step 1.4, acquiring all participating users and grades thereof of a third topic to which the published network text data belongs according to the network text data published by the participating users in the life cycle of the second topic to which the participating users belong;
if the grade of the participating user belonging to the third topic meets the preset grade requirement, collecting all network text data published by the corresponding participating user under the third topic;
step 2, preprocessing the network text data related to the water affairs;
step 2.1, performing word segmentation processing on the network text data related to the water affairs so as to convert the text into word vectors;
2.2, constructing a network text data stop word list related to the water affairs, and performing stop word removing processing on the word vector to obtain the word vector without stop words;
step 3, analyzing the network text data related to the water affairs and finding out the public opinion of the water affairs;
step 3.1, constructing a corpus by utilizing the preprocessed web text data, and assuming that M pieces of water affair web texts exist in the corpus, expressing all word vectors and corresponding topics in the corpus as
Figure BDA0003001009260000021
Figure BDA0003001009260000022
Wherein the content of the first and second substances,
Figure BDA0003001009260000023
representing the word vector in the mth water affairs web text,
Figure BDA0003001009260000024
representing word vectors
Figure BDA0003001009260000025
A corresponding topic number;
step 3.2, calculating the topic generation probability of the water affair network text in the corpus:
step 3.2.1, obtaining the theme generation probability of the mth water affair network text by using the formula (1)
Figure BDA0003001009260000026
Figure BDA0003001009260000027
In the formula (1), the reaction mixture is,
Figure BDA0003001009260000028
a word number vector representing the m-th water affair network text according to the subject statistics, an
Figure BDA0003001009260000029
Representing the number of words generated by the kth topic in the mth water affair network text,
Figure BDA00030010092600000210
for the hyperparameter, Δ (·) represents a normalization function;
step 3.2.2, obtaining the theme generation probability of the water affair network text in the corpus by using the formula (2)
Figure BDA00030010092600000211
Figure BDA00030010092600000212
Step 3.3, calculating the word generation probability of the water affair network text in the corpus:
step 3.3.1, obtaining the word probability generated by the kth topic by using the formula (3)
Figure BDA00030010092600000213
Figure BDA0003001009260000031
In the formula (3), the reaction mixture is,
Figure BDA0003001009260000032
representing the word vector produced by the k-th topic,
Figure BDA0003001009260000033
represents a number vector of words generated by the kth topic, and
Figure BDA0003001009260000034
indicates the number of the t-th word generated by the k-th topic,
Figure BDA0003001009260000035
is a hyper-parameter;
step 3.3.2, obtaining the generation probability of the water affair network text words in the corpus by using the formula (4)
Figure BDA0003001009260000036
Figure BDA0003001009260000037
Step 3.4, calculating joint probability generated by the water affair network text in the corpus by using the formula (5)
Figure BDA0003001009260000038
Figure BDA0003001009260000039
And 3.5, updating the theme of each word in the corpus by using the formula (6):
Figure BDA00030010092600000310
in the formula (6), ziIndicates the subject corresponding to the ith word, k indicates the subject number,
Figure BDA00030010092600000311
indicating that the subject excluding the ith word, the remaining words,
Figure BDA00030010092600000312
a vector of words is represented that is,
Figure BDA00030010092600000313
indicates the number of words, alpha, corresponding to the kth topic in the mth water affair network text after the ith word is eliminatedkIs a hyperparameter
Figure BDA00030010092600000314
The k dimension of (b), betatIs a hyperparameter
Figure BDA00030010092600000315
The (d) th dimension of (a),
Figure BDA00030010092600000316
indicating the number of t words generated by the kth subject excluding the ith word, and V indicating the length of the whole water affair network text corpus;
step 3.6, calculating the word distribution of the t word of the k topic by using the formula (7)
Figure BDA00030010092600000317
Figure BDA00030010092600000318
Step 3.7, calculating the kth theme distribution of the mth water affair network text by using the formula (8)
Figure BDA00030010092600000319
Figure BDA00030010092600000320
Step 3.8, according to the word distribution under the k topic
Figure BDA00030010092600000321
And selecting the first N words from the current kth theme as keywords of the kth theme, and describing and analyzing the kth theme according with the actual meaning of the water affair public opinion according to the semantics of the keywords, so that the points of concern of the social public opinion and the mainstream media on the water affair are found, and the water affair public opinion is identified.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention discovers the water affair public opinion concern by utilizing the water affair network text data and combining with the theme analysis, realizes the identification of the water affair public opinion, and improves the efficiency and the accuracy of the identification of the water affair public opinion.
2. The invention provides a webpage searching strategy determined according to the type of the website, and the webpage searching strategy is adopted to collect the water service network text data of the target webpage, so that the text data related to the water service can be quickly acquired from massive network data, and the acquisition efficiency of the water service network text data is improved.
3. The text data analysis method using LDA topic modeling is suitable for processing large-scale text data sets, realizes the discovery of the water affair public opinion concern and completes the water affair public opinion identification through the analysis of a large amount of water affair network text data, and the result has good interpretability.
Drawings
FIG. 1 is a flow chart of a water service web text data acquisition method of the present invention;
FIG. 2 is a schematic diagram of a structure of a web page network node according to an embodiment of the present invention;
FIG. 3 is a flow chart of the water public opinion identification of the present invention;
FIG. 4 is a diagram of a subject modeling model topology structure according to the present invention;
FIG. 5 is a schematic diagram of a probability map representation of a subject modeled directed graph.
Detailed Description
In this embodiment, as shown in fig. 3, a method for identifying public water affairs based on web text data is performed according to the following steps:
in practical applications, the water affair network text data may be related data such as a message of each water supply group and a government portal website, topics of a network community, and news reports of each media platform, and for convenience of description, the following description will take the example that the water affair network text data is the message data of the government portal website.
Step 1, as shown in fig. 1, acquiring network text data related to water affairs;
s10, comparing the website of the target webpage with the reference website to determine the type of the website;
s20, determining a webpage searching strategy according to the type of the website;
and S30, acquiring the webpage data of the target webpage according to the webpage searching strategy.
Step 1.1, adopting different webpage searching strategies to collect the network text data of the target webpage according to the type of the website:
in the method, the webpage searching strategy can comprise a breadth first strategy, a depth first strategy, a combination of the breadth first strategy and the depth first strategy, and the webpage searching strategy can comprise a combination of the depth first strategy and the breadth first strategy, a combination of the breadth first strategy and the depth first strategy, or a combination of the depth first strategy and the breadth first strategy. The following determines that the web page search policy is a policy in which both the depth-first policy and the breadth-first policy refer to a combination of the depth-first policy and the breadth-first policy.
Fig. 2 shows a web page node structure of a website according to an embodiment. The first-layer webpage node is A (root node), the second-layer node comprises B, C and D, the third-layer node comprises E, F, G, H and I, if the webpage searching strategy is determined to be the breadth-first strategy, the traversal crawling path of the breadth-first strategy is A- > B- > C- > D- > E- > F- > G- > H- > I, and if the webpage searching strategy is determined to be the depth-first strategy, the traversal processing path of the depth-first strategy is A- > B- > E- > F- > C- > G- > D- > H- > I.
Before determining the web search policy according to the type of the web address in step S20, the type of the web address may be bound or mapped with the web search policy in advance, and the type of the web address may be determined by comparing the web address of the target web page with a preset reference web address, and further determining the web search policy. Step S20 may include adding a website type tag to the reference website in a pre-established database, comparing the website of the target webpage with the reference website, determining the reference website with the highest similarity to the website of the target webpage, reading the website type tag of the reference website, and determining the website type of the target webpage according to the read website tag. The types of the websites can comprise official websites of a water supply group, websites of a network community and websites of a government information portal, and different webpage searching strategies are adopted to collect webpage data according to different website types, so that the efficiency and the accuracy of acquiring the webpage data are improved.
If the website type is the official website of the water supply group, adopting a breadth-first strategy;
if the website type is a government portal website, adopting a depth priority strategy;
in one embodiment, after the acquired website of the target webpage is compared with the reference website, the reference website most similar to the website of the target webpage is determined, the website label of the reference website is read, the website type of the target webpage is acquired as a government portal website, the website is a display platform for most directly feeding back messages by the masses, the messages comprise messages related to keywords such as water supply, water consumption, water fee and water quality, and a webpage search strategy is determined as a depth priority strategy according to pre-binding. And the water affair network text data is the water affair-related message feedback data in the government portal website message block.
If the website type is a network community or forum website, acquiring the type of a user publishing topics, comments or messages related to water affairs according to the network text data of a target webpage (generally a community or forum home page), thereby determining a webpage search strategy;
if the type of the user is a water service group or a news media official user, adopting a depth priority strategy;
if the type of the user is a personal user, adopting a depth priority strategy and a breadth priority strategy;
step 1.2, acquiring the grades of all participating users (namely first participating users) of a first topic (a first topic or a keyword defined according to needs) to which the network text data belongs according to the network text data of the target webpage;
if the grade of the first participating user meets the preset grade requirement, acquiring all network text data which are published under the first topic of the first participating user and are related to the water affairs;
step 1.3, acquiring topic participation times of all participating users (namely second participating users) of a second topic (a second topic or a keyword defined according to needs) to which the network text data belongs according to the network text data of the target webpage;
if the topic participation frequency of the second participating user exceeds a preset participation frequency threshold, acquiring network text data issued by the second participating user in the life cycle (a time period defined according to needs) of the second topic;
step 1.4, acquiring a third topic (a third topic or a keyword defined as required) to which the published network text data belongs, all participating users of the third topic (namely third participating users) and the levels thereof according to the network text data published by the second participating users in the life cycle of the second topic;
if the grade of the third participating user meets the preset grade requirement, collecting all network text data published by the third participating user under the third topic;
in step S30 in fig. 1, when acquiring the water service network text data of the target web page according to the web page search policy, a data acquisition technique needs to be determined, which may include acquiring the text data by using the beautiful sound technology of python and the matching technology of regular expressions, or may also adopt a distributed parallel automatic acquisition technique. Firstly, an initial url queue is constructed, html content of each webpage is obtained through requests.get (url), then Beautiful Soup technology is used for analyzing bSoup ═ Beautiful Soup (responseHtml. text, 'html. part'), all needed url addresses bSoup. find _ all ('a', href ═ re. complex (regex)) in the page can be obtained through a find _ all method, url with a specified form is obtained, and then the url is added into the queue one by one.
After the search strategy and the acquisition technology are determined, the business network text data can be crawled, and the method mainly comprises three parts of splicing, acquiring and analyzing HTML codes or json codes and acquiring text data of target addresses. On the basis of successful login, some Chinese keywords are spliced with a known address after being subjected to MD5 coding, and some Chinese keywords are spliced with the address according to page numbers or text numbers and the coding and the address of the keywords, so that a target URL address is obtained; then simulating a browser to access a webpage to acquire an HTML code or a json code; finally, the HTML codes or json codes are analyzed and the required text data related to the water affairs are extracted from the HTML codes or json codes
And finally, storing the acquired water service network text data, wherein the process mainly stores a list formed by the text data acquired in the last step into a MySQL database or a txt text by using circulation so as to analyze and mine the acquired water service network text data.
In the data acquisition process, python can be used as a development tool, pycharm is used as a development environment, MySQL or a local file is used as data storage, strategies related to the web crawler and the python web crawler technology are combined, codes for realizing acquisition of each water affair web text data are designed, finally, the water affair web text data can be acquired through real-time program operation, and the data acquisition efficiency is improved.
In one embodiment, the water service network text data is stored in a local file, and then the text data is preprocessed and mined, including word segmentation, word deactivation and topic analysis.
Step 2, preprocessing the network text data related to the water affairs;
step 2.1, performing word segmentation processing on the network text data related to the water affairs so as to convert the text into a word set;
in one embodiment, the word segmentation is performed on the water affair text corpus formed by the acquired water affair-related message text data of the government portal website, and the word segmentation is performed on the water affair text corpus by using a result word segmentation kit in python.
2.2, constructing a network text data stop word list related to the water affairs, and performing stop word removing processing on the word set to obtain the word set without stop words;
in one embodiment, the participled water affair text corpus is subjected to stop word processing, and various punctuations, special characters, tone words and idioms are added into a stop word list, such as ' hello ', leadership ' and the like. And matching the vocabulary in the water affair text with the vocabulary in the stop word list by adopting a character string matching method, and removing the vocabulary matched with the stop words in the water affair text, thereby reducing noise data and effectively reducing the influence of irrelevant vocabulary on theme description.
Step 3, analyzing the network text data related to the water affairs and finding out the public opinion of the water affairs;
step 3.1, constructing a corpus by utilizing the preprocessed web text data, and assuming that M pieces of water affair web texts exist in the corpus, expressing all word vectors and corresponding topics in the corpus as
Figure BDA0003001009260000071
Figure BDA0003001009260000072
Wherein the content of the first and second substances,
Figure BDA0003001009260000073
representing the word vector in the mth water affairs web text,
Figure BDA0003001009260000074
representing word vectors
Figure BDA0003001009260000075
A corresponding topic number;
step 3.2, calculating the topic generation probability of the water affair network text in the corpus:
step 3.2.1, obtaining the theme generation probability of the mth water affair network text by using the formula (1)
Figure BDA0003001009260000076
Figure BDA0003001009260000077
In the formula (1), the reaction mixture is,
Figure BDA0003001009260000078
a word number vector representing the m-th water affair network text according to the subject statistics, an
Figure BDA0003001009260000079
Representing the number of words generated by the kth topic in the mth water affair network text,
Figure BDA00030010092600000710
as a parameter, Δ (·) represents a normalization function that, for a K-dimensional vector X,
Figure BDA00030010092600000711
Γ (x) is a gamma function;
step 3.2.2, obtaining the water affair network text in the corpus by using the formula (2)Subject matter generation probability of the present
Figure BDA00030010092600000712
Figure BDA00030010092600000713
Step 3.3, calculating the word generation probability of the water affair network text in the corpus:
step 3.3.1, obtaining the word probability generated by the kth topic by using the formula (3)
Figure BDA0003001009260000081
Figure BDA0003001009260000082
In the formula (3), the reaction mixture is,
Figure BDA0003001009260000083
representing the word vector produced by the k-th topic,
Figure BDA0003001009260000084
represents a number vector of words generated by the kth topic, and
Figure BDA0003001009260000085
indicates the number of words t generated by the kth topic,
Figure BDA0003001009260000086
is a hyper-parameter;
step 3.3.2, obtaining the generation probability of the water affair network text words in the corpus by using the formula (4)
Figure BDA0003001009260000087
Figure BDA0003001009260000088
Step 3.4, calculating joint probability generated by the water affair network text in the corpus by using the formula (5)
Figure BDA0003001009260000089
Figure BDA00030010092600000810
And 3.5, updating the theme of each word in the corpus by using the formula (6):
Figure BDA00030010092600000811
in the formula (6), ziIndicates the subject corresponding to the ith word, k indicates the subject number,
Figure BDA00030010092600000812
indicating that the subject excluding the ith word, the remaining words,
Figure BDA00030010092600000813
a vector of words is represented that is,
Figure BDA00030010092600000814
indicates the number of words, alpha, corresponding to the kth topic in the mth water affair network text after the ith word is eliminatedkIs a hyperparameter
Figure BDA00030010092600000815
The k dimension of (b), betatIs a hyperparameter
Figure BDA00030010092600000816
The (d) th dimension of (a),
Figure BDA00030010092600000817
indicating the number of t words generated by the kth subject excluding the ith word, and V indicating the length of the whole water affair network text corpus;
in one embodiment, the topic analysis may be performed on the preprocessed water affairs text corpus by using an LDA topic modeling method, as shown in fig. 4, which is a topological structure diagram of a subject modeling model of an embodiment, where C1 is a document layer, C2 is a topic layer, and C3 is a word layer, and fig. 5 is a schematic diagram of a probability diagram representation of a subject modeling directed graph of an embodiment, where the topic modeling method includes inference on a topic, and updating the topic for each word by using equation (6).
Step 3.6, calculating the word distribution of the t word of the k topic by using the formula (7)
Figure BDA00030010092600000818
Figure BDA00030010092600000819
Step 3.7, calculating the kth theme distribution of the mth water affair network text by using the formula (8)
Figure BDA00030010092600000820
Figure BDA0003001009260000091
Step 3.8, according to the word distribution under the k topic
Figure BDA0003001009260000092
And selecting the first N words from the current kth theme as keywords of the kth theme, and describing and analyzing the kth theme according with the actual meaning of the water affair public opinion according to the semantics of the keywords, so that the points of concern of the social public opinion and the mainstream media on the water affair are found, and the water affair public opinion is identified.
The table shows the topic vocabulary obtained by topic analysis of one embodiment:
Figure BDA0003001009260000093
for the kth theme, top N words are selected from the current theme as key words and main descriptive words of the theme, and then according to the descriptive word semantics and word distribution phik,tAnd explaining the actual meaning of the current theme in the public opinion, and thus mining the attention point of the public opinion by combining a plurality of themes.

Claims (1)

1. A water affair public opinion identification method based on network text data is characterized by comprising the following steps:
step 1, acquiring network text data related to water affairs;
step 1.1, adopting different webpage searching strategies to collect the network text data of the target webpage according to the type of the website:
if the website type is the official website of the water supply group, adopting a breadth-first strategy;
if the website type is a government portal website, adopting a depth priority strategy;
if the website type is a network community or forum website, acquiring the type of a user publishing topics, comments or messages related to water affairs according to the network text data of the target webpage, and determining a webpage searching strategy;
if the type of the user is an official user, adopting a depth priority strategy;
if the type of the user is a personal user, adopting a depth priority strategy and an extent priority strategy;
step 1.2, acquiring the grades of all participating users of a first topic to which the network text data belongs according to the network text data of the target webpage;
if the level of the participating user belonging to the first topic meets the preset level requirement, collecting all network text data published by the corresponding participating user under the belonging first topic;
step 1.3, acquiring topic participation times of all participating users of a second topic to which the network text data belongs according to the network text data of the target webpage;
if the topic participation frequency of the participating user belonging to the second topic exceeds a preset participation frequency threshold, acquiring network text data issued by the corresponding participating user in the life cycle of the belonging second topic;
step 1.4, acquiring all participating users and grades thereof of a third topic to which the published network text data belongs according to the network text data published by the participating users in the life cycle of the second topic to which the participating users belong;
if the grade of the participating user belonging to the third topic meets the preset grade requirement, collecting all network text data published by the corresponding participating user under the third topic;
step 2, preprocessing the network text data related to the water affairs;
step 2.1, performing word segmentation processing on the network text data related to the water affairs so as to convert the text into word vectors;
2.2, constructing a network text data stop word list related to the water affairs, and performing stop word removing processing on the word vector to obtain the word vector without stop words;
step 3, analyzing the network text data related to the water affairs and finding out the public opinion of the water affairs;
step 3.1, constructing a corpus by utilizing the preprocessed web text data, and assuming that M pieces of water affair web texts exist in the corpus, expressing all word vectors and corresponding topics in the corpus as
Figure FDA0003001009250000011
Figure FDA0003001009250000012
Wherein the content of the first and second substances,
Figure FDA0003001009250000013
representing the word vector in the mth water affairs web text,
Figure FDA0003001009250000014
means direction of wordsMeasurement of
Figure FDA0003001009250000015
A corresponding topic number;
step 3.2, calculating the topic generation probability of the water affair network text in the corpus:
step 3.2.1, obtaining the theme generation probability of the mth water affair network text by using the formula (1)
Figure FDA0003001009250000021
Figure FDA0003001009250000022
In the formula (1), the reaction mixture is,
Figure FDA0003001009250000023
a word number vector representing the m-th water affair network text according to the subject statistics, an
Figure FDA0003001009250000024
Figure FDA0003001009250000025
Representing the number of words generated by the kth topic in the mth water affair network text,
Figure FDA0003001009250000026
for the hyperparameter, Δ (·) represents a normalization function;
step 3.2.2, obtaining the theme generation probability of the water affair network text in the corpus by using the formula (2)
Figure FDA0003001009250000027
Figure FDA0003001009250000028
Step 3.3, calculating the word generation probability of the water affair network text in the corpus:
step 3.3.1, obtaining the word probability generated by the kth topic by using the formula (3)
Figure FDA0003001009250000029
Figure FDA00030010092500000210
In the formula (3), the reaction mixture is,
Figure FDA00030010092500000211
representing the word vector produced by the k-th topic,
Figure FDA00030010092500000212
represents a number vector of words generated by the kth topic, and
Figure FDA00030010092500000213
Figure FDA00030010092500000214
indicates the number of the t-th word generated by the k-th topic,
Figure FDA00030010092500000215
is a hyper-parameter;
step 3.3.2, obtaining the generation probability of the water affair network text words in the corpus by using the formula (4)
Figure FDA00030010092500000216
Figure FDA00030010092500000217
Step 3.4, calculating joint probability generated by the water affair network text in the corpus by using the formula (5)
Figure FDA00030010092500000218
Figure FDA00030010092500000219
And 3.5, updating the theme of each word in the corpus by using the formula (6):
Figure FDA00030010092500000220
in the formula (6), ziIndicates the subject corresponding to the ith word, k indicates the subject number,
Figure FDA00030010092500000221
indicating that the subject excluding the ith word, the remaining words,
Figure FDA0003001009250000031
a vector of words is represented that is,
Figure FDA0003001009250000032
indicates the number of words, alpha, corresponding to the kth topic in the mth water affair network text after the ith word is eliminatedkIs a hyperparameter
Figure FDA0003001009250000033
The k dimension of (b), betatIs a hyperparameter
Figure FDA0003001009250000034
The (d) th dimension of (a),
Figure FDA0003001009250000035
indicating the number of t words generated by the kth subject excluding the ith word, and V indicating the length of the whole water affair network text corpus;
step 3.6, calculating the word distribution of the t word of the k topic by using the formula (7)
Figure FDA0003001009250000036
Figure FDA0003001009250000037
Step 3.7, calculating the kth theme distribution of the mth water affair network text by using the formula (8)
Figure FDA0003001009250000038
Figure FDA0003001009250000039
Step 3.8, according to the word distribution under the k topic
Figure FDA00030010092500000310
And selecting the first N words from the current kth theme as keywords of the kth theme, and describing and analyzing the kth theme according with the actual meaning of the water affair public opinion according to the semantics of the keywords, so that the points of concern of the social public opinion and the mainstream media on the water affair are found, and the water affair public opinion is identified.
CN202110346900.1A 2021-03-31 2021-03-31 Water affair public opinion identification method based on network text data Active CN113051455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110346900.1A CN113051455B (en) 2021-03-31 2021-03-31 Water affair public opinion identification method based on network text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110346900.1A CN113051455B (en) 2021-03-31 2021-03-31 Water affair public opinion identification method based on network text data

Publications (2)

Publication Number Publication Date
CN113051455A true CN113051455A (en) 2021-06-29
CN113051455B CN113051455B (en) 2022-04-26

Family

ID=76516631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110346900.1A Active CN113051455B (en) 2021-03-31 2021-03-31 Water affair public opinion identification method based on network text data

Country Status (1)

Country Link
CN (1) CN113051455B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996450A (en) * 2022-05-27 2022-09-02 华中科技大学 Water public opinion big data analysis method based on double-layer fastText model

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809252A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Internet data extraction system
CN107229735A (en) * 2017-06-13 2017-10-03 成都布林特信息技术有限公司 Public feelings information analysis and early warning method based on natural language processing
CN107256263A (en) * 2017-06-13 2017-10-17 成都布林特信息技术有限公司 Internet hot spots information automatic monitoring method
CN107291778A (en) * 2016-04-11 2017-10-24 中兴通讯股份有限公司 The collection method and device of data
CN109145215A (en) * 2018-08-29 2019-01-04 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109471965A (en) * 2018-10-26 2019-03-15 四川才子软件信息网络有限公司 A kind of network public-opinion data sampling and processing method and monitoring platform based on big data
EP3499508A1 (en) * 2017-12-14 2019-06-19 Koninklijke Philips N.V. Computer-implemented method and apparatus for generating information
CN110163688A (en) * 2019-05-30 2019-08-23 复旦大学 Commodity network public sentiment detection system
WO2019227710A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Network public opinion analysis method and apparatus, and computer-readable storage medium
US20190377763A1 (en) * 2009-09-28 2019-12-12 Ebay Inc. System and method for topic extraction and opinion mining
CN112395539A (en) * 2020-11-26 2021-02-23 格美安(北京)信息技术有限公司 Public opinion risk monitoring method and system based on natural language processing

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190377763A1 (en) * 2009-09-28 2019-12-12 Ebay Inc. System and method for topic extraction and opinion mining
CN104809252A (en) * 2015-05-20 2015-07-29 成都布林特信息技术有限公司 Internet data extraction system
CN107291778A (en) * 2016-04-11 2017-10-24 中兴通讯股份有限公司 The collection method and device of data
CN107229735A (en) * 2017-06-13 2017-10-03 成都布林特信息技术有限公司 Public feelings information analysis and early warning method based on natural language processing
CN107256263A (en) * 2017-06-13 2017-10-17 成都布林特信息技术有限公司 Internet hot spots information automatic monitoring method
EP3499508A1 (en) * 2017-12-14 2019-06-19 Koninklijke Philips N.V. Computer-implemented method and apparatus for generating information
WO2019227710A1 (en) * 2018-05-31 2019-12-05 平安科技(深圳)有限公司 Network public opinion analysis method and apparatus, and computer-readable storage medium
CN109145215A (en) * 2018-08-29 2019-01-04 中国平安保险(集团)股份有限公司 Internet public opinion analysis method, apparatus and storage medium
CN109471965A (en) * 2018-10-26 2019-03-15 四川才子软件信息网络有限公司 A kind of network public-opinion data sampling and processing method and monitoring platform based on big data
CN110163688A (en) * 2019-05-30 2019-08-23 复旦大学 Commodity network public sentiment detection system
CN112395539A (en) * 2020-11-26 2021-02-23 格美安(北京)信息技术有限公司 Public opinion risk monitoring method and system based on natural language processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QINJUAN YANG等: "Segment-level joint topic-sentiment model for online review analysis", 《IEEE INTELLIGENT SYSTEMS》 *
吴卿凤等: "江苏水利网络舆情年度数据分析及思考", 《江苏水利》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996450A (en) * 2022-05-27 2022-09-02 华中科技大学 Water public opinion big data analysis method based on double-layer fastText model

Also Published As

Publication number Publication date
CN113051455B (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN106682192B (en) Method and device for training answer intention classification model based on search keywords
CN107229668B (en) Text extraction method based on keyword matching
CN109726274B (en) Question generation method, device and storage medium
CN110609983B (en) Structured decomposition method for policy file
CN104102721A (en) Method and device for recommending information
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111581376B (en) Automatic knowledge graph construction system and method
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
CN104471568A (en) Learning-based processing of natural language questions
CN110968782A (en) Student-oriented user portrait construction and application method
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
CN113535917A (en) Intelligent question-answering method and system based on travel knowledge map
CN112183056A (en) Context-dependent multi-classification emotion analysis method and system based on CNN-BilSTM framework
CN107862039B (en) Webpage data acquisition method and system and data matching and pushing method
CN115796181A (en) Text relation extraction method for chemical field
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
Jia et al. OpenKN: An open knowledge computational engine for network big data
CN109165373B (en) Data processing method and device
CN110110218B (en) Identity association method and terminal
CN115329085A (en) Social robot classification method and system
CN113051455B (en) Water affair public opinion identification method based on network text data
CN109992723B (en) User interest tag construction method based on social network and related equipment
Nethra et al. WEB CONTENT EXTRACTION USING HYBRID APPROACH.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant