CN113051455A

CN113051455A - Water affair public opinion identification method based on network text data

Info

Publication number: CN113051455A
Application number: CN202110346900.1A
Authority: CN
Inventors: 朱波; 穆利; 姜元春; 吴铭; 钱洋; 王亚琦; 熊迎秋; 郝瀚; 丁磊; 隆云飞; 阚道升
Original assignee: Ustc Sinovate Software Co ltd; Hefei Water Group Co ltd; Hefei University of Technology
Current assignee: Ustc Sinovate Software Co ltd; Hefei Water Group Co ltd; Hefei University of Technology
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-29
Anticipated expiration: 2041-03-31
Also published as: CN113051455B

Abstract

The invention discloses a water public opinion identification method based on network text data, which comprises the following steps: 1. acquiring the network text data related to the water affairs, 2 preprocessing the network text data related to the water affairs, 3 analyzing the network text data related to the water affairs and finding out the focus of the water affair public opinion. The method can determine the webpage searching strategy according to the type of the website, realize the rapid acquisition of the text data related to the water affairs from the mass network data, and find the water affair public opinion concern by combining with the theme analysis so as to realize the water affair public opinion identification, thereby improving the efficiency and the accuracy of the water affair public opinion identification, and the result has good interpretability.

Description

Water affair public opinion identification method based on network text data

Technical Field

The invention relates to the technical field of data mining, in particular to a water affair public opinion identification method based on network text data.

Background

With the rapid development of the internet and the constant change of people's life style, network data related to various industries show explosive growth. Most of the network data are related to people and show the guidance of social public opinion, so that the data on the network are highly valued by enterprises, and the water affairs industry also realizes the identification of the water affairs public opinion by acquiring network texts related to water affairs from mass network data and finding out the attention points of the social public opinion from the network texts.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a water affair public opinion identification method based on network text data, so that the problem that the water affair public opinion is difficult to identify can be solved, the text data related to the water affair can be quickly acquired from massive network data, the focus of the water affair network text can be accurately analyzed, and the efficiency and the accuracy of the water affair public opinion identification can be improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a water affair public opinion identification method based on network text data, which is characterized by comprising the following steps:

step 1, acquiring network text data related to water affairs;

step 1.1, adopting different webpage searching strategies to collect the network text data of the target webpage according to the type of the website:

if the website type is the official website of the water supply group, adopting a breadth-first strategy;

if the website type is a government portal website, adopting a depth priority strategy;

if the website type is a network community or forum website, acquiring the type of a user publishing topics, comments or messages related to water affairs according to the network text data of the target webpage, and determining a webpage searching strategy;

if the type of the user is an official user, adopting a depth priority strategy;

if the type of the user is a personal user, adopting a depth priority strategy and an extent priority strategy;

step 1.2, acquiring the grades of all participating users of a first topic to which the network text data belongs according to the network text data of the target webpage;

if the level of the participating user belonging to the first topic meets the preset level requirement, collecting all network text data published by the corresponding participating user under the belonging first topic;

step 1.3, acquiring topic participation times of all participating users of a second topic to which the network text data belongs according to the network text data of the target webpage;

if the topic participation frequency of the participating user belonging to the second topic exceeds a preset participation frequency threshold, acquiring network text data issued by the corresponding participating user in the life cycle of the belonging second topic;

step 1.4, acquiring all participating users and grades thereof of a third topic to which the published network text data belongs according to the network text data published by the participating users in the life cycle of the second topic to which the participating users belong;

if the grade of the participating user belonging to the third topic meets the preset grade requirement, collecting all network text data published by the corresponding participating user under the third topic;

step 2, preprocessing the network text data related to the water affairs;

step 2.1, performing word segmentation processing on the network text data related to the water affairs so as to convert the text into word vectors;

2.2, constructing a network text data stop word list related to the water affairs, and performing stop word removing processing on the word vector to obtain the word vector without stop words;

step 3, analyzing the network text data related to the water affairs and finding out the public opinion of the water affairs;

step 3.1, constructing a corpus by utilizing the preprocessed web text data, and assuming that M pieces of water affair web texts exist in the corpus, expressing all word vectors and corresponding topics in the corpus as

Wherein the content of the first and second substances,

representing the word vector in the mth water affairs web text,

representing word vectors

A corresponding topic number;

step 3.2, calculating the topic generation probability of the water affair network text in the corpus:

step 3.2.1, obtaining the theme generation probability of the mth water affair network text by using the formula (1)

In the formula (1), the reaction mixture is,

a word number vector representing the m-th water affair network text according to the subject statistics, an

Representing the number of words generated by the kth topic in the mth water affair network text,

for the hyperparameter, Δ (·) represents a normalization function;

step 3.2.2, obtaining the theme generation probability of the water affair network text in the corpus by using the formula (2)

Step 3.3, calculating the word generation probability of the water affair network text in the corpus:

step 3.3.1, obtaining the word probability generated by the kth topic by using the formula (3)

In the formula (3), the reaction mixture is,

representing the word vector produced by the k-th topic,

represents a number vector of words generated by the kth topic, and

indicates the number of the t-th word generated by the k-th topic,

is a hyper-parameter;

step 3.3.2, obtaining the generation probability of the water affair network text words in the corpus by using the formula (4)

Step 3.4, calculating joint probability generated by the water affair network text in the corpus by using the formula (5)

And 3.5, updating the theme of each word in the corpus by using the formula (6):

in the formula (6), z_iIndicates the subject corresponding to the ith word, k indicates the subject number,

indicating that the subject excluding the ith word, the remaining words,

a vector of words is represented that is,

indicates the number of words, alpha, corresponding to the kth topic in the mth water affair network text after the ith word is eliminated_kIs a hyperparameter

The k dimension of (b), beta_tIs a hyperparameter

The (d) th dimension of (a),

indicating the number of t words generated by the kth subject excluding the ith word, and V indicating the length of the whole water affair network text corpus;

step 3.6, calculating the word distribution of the t word of the k topic by using the formula (7)

Step 3.7, calculating the kth theme distribution of the mth water affair network text by using the formula (8)

Step 3.8, according to the word distribution under the k topic

And selecting the first N words from the current kth theme as keywords of the kth theme, and describing and analyzing the kth theme according with the actual meaning of the water affair public opinion according to the semantics of the keywords, so that the points of concern of the social public opinion and the mainstream media on the water affair are found, and the water affair public opinion is identified.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention discovers the water affair public opinion concern by utilizing the water affair network text data and combining with the theme analysis, realizes the identification of the water affair public opinion, and improves the efficiency and the accuracy of the identification of the water affair public opinion.

2. The invention provides a webpage searching strategy determined according to the type of the website, and the webpage searching strategy is adopted to collect the water service network text data of the target webpage, so that the text data related to the water service can be quickly acquired from massive network data, and the acquisition efficiency of the water service network text data is improved.

3. The text data analysis method using LDA topic modeling is suitable for processing large-scale text data sets, realizes the discovery of the water affair public opinion concern and completes the water affair public opinion identification through the analysis of a large amount of water affair network text data, and the result has good interpretability.

Drawings

FIG. 1 is a flow chart of a water service web text data acquisition method of the present invention;

FIG. 2 is a schematic diagram of a structure of a web page network node according to an embodiment of the present invention;

FIG. 3 is a flow chart of the water public opinion identification of the present invention;

FIG. 4 is a diagram of a subject modeling model topology structure according to the present invention;

FIG. 5 is a schematic diagram of a probability map representation of a subject modeled directed graph.

Detailed Description

In this embodiment, as shown in fig. 3, a method for identifying public water affairs based on web text data is performed according to the following steps:

in practical applications, the water affair network text data may be related data such as a message of each water supply group and a government portal website, topics of a network community, and news reports of each media platform, and for convenience of description, the following description will take the example that the water affair network text data is the message data of the government portal website.

Step 1, as shown in fig. 1, acquiring network text data related to water affairs;

s10, comparing the website of the target webpage with the reference website to determine the type of the website;

s20, determining a webpage searching strategy according to the type of the website;

and S30, acquiring the webpage data of the target webpage according to the webpage searching strategy.

in the method, the webpage searching strategy can comprise a breadth first strategy, a depth first strategy, a combination of the breadth first strategy and the depth first strategy, and the webpage searching strategy can comprise a combination of the depth first strategy and the breadth first strategy, a combination of the breadth first strategy and the depth first strategy, or a combination of the depth first strategy and the breadth first strategy. The following determines that the web page search policy is a policy in which both the depth-first policy and the breadth-first policy refer to a combination of the depth-first policy and the breadth-first policy.

Fig. 2 shows a web page node structure of a website according to an embodiment. The first-layer webpage node is A (root node), the second-layer node comprises B, C and D, the third-layer node comprises E, F, G, H and I, if the webpage searching strategy is determined to be the breadth-first strategy, the traversal crawling path of the breadth-first strategy is A- > B- > C- > D- > E- > F- > G- > H- > I, and if the webpage searching strategy is determined to be the depth-first strategy, the traversal processing path of the depth-first strategy is A- > B- > E- > F- > C- > G- > D- > H- > I.

Before determining the web search policy according to the type of the web address in step S20, the type of the web address may be bound or mapped with the web search policy in advance, and the type of the web address may be determined by comparing the web address of the target web page with a preset reference web address, and further determining the web search policy. Step S20 may include adding a website type tag to the reference website in a pre-established database, comparing the website of the target webpage with the reference website, determining the reference website with the highest similarity to the website of the target webpage, reading the website type tag of the reference website, and determining the website type of the target webpage according to the read website tag. The types of the websites can comprise official websites of a water supply group, websites of a network community and websites of a government information portal, and different webpage searching strategies are adopted to collect webpage data according to different website types, so that the efficiency and the accuracy of acquiring the webpage data are improved.

in one embodiment, after the acquired website of the target webpage is compared with the reference website, the reference website most similar to the website of the target webpage is determined, the website label of the reference website is read, the website type of the target webpage is acquired as a government portal website, the website is a display platform for most directly feeding back messages by the masses, the messages comprise messages related to keywords such as water supply, water consumption, water fee and water quality, and a webpage search strategy is determined as a depth priority strategy according to pre-binding. And the water affair network text data is the water affair-related message feedback data in the government portal website message block.

If the website type is a network community or forum website, acquiring the type of a user publishing topics, comments or messages related to water affairs according to the network text data of a target webpage (generally a community or forum home page), thereby determining a webpage search strategy;

if the type of the user is a water service group or a news media official user, adopting a depth priority strategy;

if the type of the user is a personal user, adopting a depth priority strategy and a breadth priority strategy;

step 1.2, acquiring the grades of all participating users (namely first participating users) of a first topic (a first topic or a keyword defined according to needs) to which the network text data belongs according to the network text data of the target webpage;

if the grade of the first participating user meets the preset grade requirement, acquiring all network text data which are published under the first topic of the first participating user and are related to the water affairs;

step 1.3, acquiring topic participation times of all participating users (namely second participating users) of a second topic (a second topic or a keyword defined according to needs) to which the network text data belongs according to the network text data of the target webpage;

if the topic participation frequency of the second participating user exceeds a preset participation frequency threshold, acquiring network text data issued by the second participating user in the life cycle (a time period defined according to needs) of the second topic;

step 1.4, acquiring a third topic (a third topic or a keyword defined as required) to which the published network text data belongs, all participating users of the third topic (namely third participating users) and the levels thereof according to the network text data published by the second participating users in the life cycle of the second topic;

if the grade of the third participating user meets the preset grade requirement, collecting all network text data published by the third participating user under the third topic;

in step S30 in fig. 1, when acquiring the water service network text data of the target web page according to the web page search policy, a data acquisition technique needs to be determined, which may include acquiring the text data by using the beautiful sound technology of python and the matching technology of regular expressions, or may also adopt a distributed parallel automatic acquisition technique. Firstly, an initial url queue is constructed, html content of each webpage is obtained through requests.get (url), then Beautiful Soup technology is used for analyzing bSoup ═ Beautiful Soup (responseHtml. text, 'html. part'), all needed url addresses bSoup. find _ all ('a', href ═ re. complex (regex)) in the page can be obtained through a find _ all method, url with a specified form is obtained, and then the url is added into the queue one by one.

After the search strategy and the acquisition technology are determined, the business network text data can be crawled, and the method mainly comprises three parts of splicing, acquiring and analyzing HTML codes or json codes and acquiring text data of target addresses. On the basis of successful login, some Chinese keywords are spliced with a known address after being subjected to MD5 coding, and some Chinese keywords are spliced with the address according to page numbers or text numbers and the coding and the address of the keywords, so that a target URL address is obtained; then simulating a browser to access a webpage to acquire an HTML code or a json code; finally, the HTML codes or json codes are analyzed and the required text data related to the water affairs are extracted from the HTML codes or json codes

And finally, storing the acquired water service network text data, wherein the process mainly stores a list formed by the text data acquired in the last step into a MySQL database or a txt text by using circulation so as to analyze and mine the acquired water service network text data.

In the data acquisition process, python can be used as a development tool, pycharm is used as a development environment, MySQL or a local file is used as data storage, strategies related to the web crawler and the python web crawler technology are combined, codes for realizing acquisition of each water affair web text data are designed, finally, the water affair web text data can be acquired through real-time program operation, and the data acquisition efficiency is improved.

In one embodiment, the water service network text data is stored in a local file, and then the text data is preprocessed and mined, including word segmentation, word deactivation and topic analysis.

Step 2, preprocessing the network text data related to the water affairs;

step 2.1, performing word segmentation processing on the network text data related to the water affairs so as to convert the text into a word set;

in one embodiment, the word segmentation is performed on the water affair text corpus formed by the acquired water affair-related message text data of the government portal website, and the word segmentation is performed on the water affair text corpus by using a result word segmentation kit in python.

2.2, constructing a network text data stop word list related to the water affairs, and performing stop word removing processing on the word set to obtain the word set without stop words;

in one embodiment, the participled water affair text corpus is subjected to stop word processing, and various punctuations, special characters, tone words and idioms are added into a stop word list, such as ' hello ', leadership ' and the like. And matching the vocabulary in the water affair text with the vocabulary in the stop word list by adopting a character string matching method, and removing the vocabulary matched with the stop words in the water affair text, thereby reducing noise data and effectively reducing the influence of irrelevant vocabulary on theme description.

Wherein the content of the first and second substances,

representing the word vector in the mth water affairs web text,

representing word vectors

A corresponding topic number;

In the formula (1), the reaction mixture is,

as a parameter, Δ (·) represents a normalization function that, for a K-dimensional vector X,

Γ (x) is a gamma function;

step 3.2.2, obtaining the water affair network text in the corpus by using the formula (2)Subject matter generation probability of the present

In the formula (3), the reaction mixture is,

representing the word vector produced by the k-th topic,

represents a number vector of words generated by the kth topic, and

indicates the number of words t generated by the kth topic,

is a hyper-parameter;

indicating that the subject excluding the ith word, the remaining words,

a vector of words is represented that is,

The k dimension of (b), beta_tIs a hyperparameter

The (d) th dimension of (a),

in one embodiment, the topic analysis may be performed on the preprocessed water affairs text corpus by using an LDA topic modeling method, as shown in fig. 4, which is a topological structure diagram of a subject modeling model of an embodiment, where C1 is a document layer, C2 is a topic layer, and C3 is a word layer, and fig. 5 is a schematic diagram of a probability diagram representation of a subject modeling directed graph of an embodiment, where the topic modeling method includes inference on a topic, and updating the topic for each word by using equation (6).

Step 3.8, according to the word distribution under the k topic

The table shows the topic vocabulary obtained by topic analysis of one embodiment:

for the kth theme, top N words are selected from the current theme as key words and main descriptive words of the theme, and then according to the descriptive word semantics and word distribution phi_k,tAnd explaining the actual meaning of the current theme in the public opinion, and thus mining the attention point of the public opinion by combining a plurality of themes.