CN112148936A

CN112148936A - Business and travel public opinion analysis method based on script crawler framework and text analysis

Info

Publication number: CN112148936A
Application number: CN202011076411.0A
Authority: CN
Inventors: 苏如春; 孙少峰; 练镜锋
Original assignee: Guangzhou Hantele Communication Co ltd
Current assignee: Guangzhou Hantele Communication Co ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2020-12-29

Abstract

The invention relates to a business public opinion analysis method based on script crawler architecture and text analysis, which comprises the following steps: acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics; preprocessing the text features and judging whether the text features are sent to a word bank or not; the word bank obtains a text category according to the text characteristics; carrying out public opinion analysis on the text; the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.

Description

Business and travel public opinion analysis method based on script crawler framework and text analysis

Technical Field

The invention relates to the technical field of public opinion analysis, in particular to a business public opinion analysis method based on script crawler architecture and text analysis.

Background

Public opinion analysis of business travel is an important aspect of knowing about users. After watching a video or using a product, a user can express own feelings and opinions in various ways, such as the content of a television integrated program, the love of actors, the view of a local bank and the opinion of the product; the public opinions are mined and analyzed, so that the attention points and the subjective feelings of the users can be displayed more intuitively and clearly.

The content of the public opinion analysis includes text, picture, audio and other forms, and the data source mainly includes web page data, client data, forum data and other network data; comprehensive and deep analysis is carried out from various dimensions, a large amount of technical knowledge and experience are combined, the information can be converted into structured and effective information through an NLP technology, and the opinions and emotional expression of a user on a certain evaluation object are extracted; the method mainly focuses on extracting the characteristics of the user, such as the viewpoint (including a comment object and related evaluation words of the user), emotion, focus and the like, which reflect the attention points and subjective feelings of the user by using the vocabulary and the syntactic analysis technology. The existing public opinion analysis technology for business and travel is complex and has slow analysis speed.

Disclosure of Invention

The invention aims to provide a method for analyzing business and travel public sentiments based on script crawler architecture and text analysis, which aims to solve the problems that the business and travel public sentiment analysis technology is complex and the analysis speed is slow in the prior art.

The technical purpose of the invention is realized by the following technical scheme:

a business public opinion analysis method based on script crawler architecture and text analysis comprises the following steps:

acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;

preprocessing the text features and judging whether the text features are sent to a word bank or not;

the word bank obtains a text category according to the text characteristics;

and carrying out public opinion analysis on the text.

In one embodiment, the Chinese text comprises long text and short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.

In one embodiment, the obtaining of the chinese text, the processing and the segmenting, and the obtaining of the text features specifically include:

initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;

reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;

reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;

reading in an analysis object page, and extracting a text part of the analysis object page;

scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;

dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;

and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.

In one embodiment, the extracting the text part of the analysis object page specifically includes:

judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;

if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.

In one embodiment, the key sentence pattern matching specifically includes:

reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;

if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.

In one embodiment, the combination of the topics for determining the text part is specifically:

for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.

In one embodiment, the preprocessing the text feature and determining whether to send the text feature to the thesaurus specifically includes:

reading a word stock of text features and classifiers to perform maximum forward matching and maximum reverse matching;

if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text D is represented as a vector:

D＝{W₁，W₂，…，W_n}

wherein, W₁，W₂，…，W_nRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;

and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.

In one embodiment, the distinguishing of the character string formed by dividing the continuous single characters is performed, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:

judging the character string formed by dividing continuous single characters, if two or more continuous characters appear independently in the same short sentence, forming the continuous characters into a word, and putting the word into a new word library; the new word satisfies the following conditions:

wherein N is_newNumber of occurrences of new word, N_similarRepresenting the number of similar articles, and P is a new word identification threshold; and if the occurrence times of a certain word group meet the condition of the formula, the word is considered to be a new word and is placed into a new word matching word bank.

In one embodiment, the word stock obtains the text type according to the text feature specifically as follows:

the word bank comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:

given a text collection dataset: d^*＝(D₁，D₂，…，D_|D*|)；

Wherein, | D^*I is the number of data sets of a given text set, D_i(i＝l，2，…，|D^*|) correspond to each text, respectively;

given a taxonomy set data set：：C(c₁，c₂，…，c_|c|)；

Where | C | is the number of a given text category, C_i(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;

firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;

wherein F represents a conversion function and D represents any text of a given text set data set;

using document D as a vector: d ═ x₁，x₂，…，x_n)

Wherein the characteristic component x_i(i-1, 2, …, n) denotes the word W_iThe weights in text X are calculated as:

tf (w) in the formula_i(D) Is meant by W_iFrequency of occurrence in document D; n is the total number of all documents; n is a radical of_iIs in the presence of W_iThe number of documents;

is a normalization factor;

the training set d of the Bayesian classifier is as follows:

d^*＝(d₁，d₂，…，d_s)

wherein each training sample d_i(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:

D＝(x₁，x₂，…，x_n，c_i)(c_i∈C)

the classification is to an unknown class of text D (x)_i，x₂，…，x_n) Predicting the class of D;

for new text, the class conditional probability of its belonging to c is noted as:

the categories of text are:

in one embodiment, the performing public opinion analysis on the text specifically includes:

calculating a consensus index of the text by the following formula based on the mapping table and the combination of the topics of the text portion obtained above: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;

wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;

and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.

The invention has the beneficial effects that: the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram illustrating steps of a method for business public opinion analysis based on script crawler architecture and text analysis;

fig. 2 is a schematic flow chart illustrating a method for analyzing business public sentiment based on script crawler architecture and text analysis.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

Referring to fig. 1 and fig. 2, a method for analyzing business public sentiment based on script crawler architecture and text analysis according to the present invention is shown, the method comprising the following steps:

100. acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;

in the embodiment of the invention, the Chinese text is obtained through a script crawler framework, and the method specifically comprises the following steps: the request sent by the engine is received through the dispatcher, and is arranged and queued according to a certain mode, and when the engine needs the request, the request is returned to the engine; the downloader is responsible for downloading all requests sent by the script engine, and the obtained responses are returned to the script engine and handed to the Spider for processing; the items obtained in the Spider are processed through a pipeline, and post-processing (detailed analysis, filtering, storage and the like) is performed.

In the embodiment of the invention, the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.

In the embodiment of the present invention, the processing and segmentation are performed on the text of the Chinese, and the obtained text features specifically are as follows:

In the embodiment of the present invention, the extracting the text part of the analysis object page specifically includes:

In the embodiment of the present invention, the key sentence pattern matching specifically includes:

In the embodiment of the present invention, the combination of the topics for determining the text part specifically includes:

200. Preprocessing the text features and judging whether the text features are sent to a word bank or not;

specifically, reading a word stock of text features and a classifier to perform maximum forward matching and maximum reverse matching;

D＝{W₁，W₂，…，W_n}

In the embodiment of the present invention, the distinguishing of the character string formed by dividing the continuous single characters, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:

judging the character string formed by dividing continuous single characters, if yesWhen two or more continuous characters are found to appear independently in the same short sentence, the continuous characters form a word and are put into a new word bank; the new word satisfies the following conditions:

300. The word bank obtains a text category according to the text characteristics;

specifically, the lexicon comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:

given a text collection dataset: d^*＝(D₁，D₂，…，D_|D*|)；

given a sorted set dataset: : c (C)₁，c₂，…，c_|c|)；

using document D as a vector: d ═ x₁，x₂，…，x_n)

is a normalization factor;

the training set d of the Bayesian classifier is as follows:

d^*＝(d₁，d₂，…，d_s)

D＝(x₁，x₂，…，x_n，c_i)(c_i∈C)

the categories of text are:

400. and carrying out public opinion analysis on the text.

Specifically, the public opinion index of the text is calculated from the mapping table and the combination of the subjects of the text part obtained above by the following formula: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;

The method comprises the steps of matching keywords and key sentence patterns in a text by using an alternating current automaton and a regular expression, and representing an article as a plurality of topics according to a matching result; by setting the weight value of each theme, the sum of the weights of the pages is calculated, and whether the page belongs to public sentiment or not can be analyzed and judged quickly and accurately.

The invention replaces word matching in simple public opinion analysis with theme matching, omits complex steps such as clustering, classification and the like, performs public opinion analysis under the condition of ensuring certain accuracy, and realizes rapid analysis.

As a preferred embodiment, S1, it is assumed that the activity coefficient of a text is stable in a certain period T and is proportional to the forwarding amount of the text in the period. First, a time period T is divided into N small time units T₁,t₂…t_nEach time unit having a length of

Assuming that the online activity probability of a text conforms to a binomial distribution, the activity coefficient of the text m in the time period T can be expressed as

Wherein y is_kIs a binary variable for indicating whether the text is at T_kAny information is published. If the text m has over-forwarding behavior in time unit, y_k＝1。n|y_k1| represents that y is satisfied_kThe number of time units being 1.

S2, the forwarding coefficient of the text is taken as another very important factor in the model. The correlation degree between the forwarding coefficient of the text and the number of the published messages and the number of friends of the text is assumed to be small. Similar to the estimation process of the text activity degree, assuming that the forwarding probability of a text in a certain time period conforms to the binomial distribution, the forwarding coefficient of the text m in the time period T can be expressed as:

s3, it is easy to observe that the active coefficient and forwarding coefficient of the text change over time. In view of this, the time is divided into a plurality of lengths alpha₁Of time segments T, which time segments T form a set T_N. It is assumed that the activity coefficient and forwarding coefficient of the text are relatively stable during each time period. In the calculation process, firstly, two variables of the text activity coefficient and the forwarding coefficient of each text in each small time period are calculated respectively. And the influence between texts can be obtained by a maximum likelihood estimation method.

The influence coefficient of the text can be obtained by solving the following optimization problem:

for the influence coefficient gamma_mnThe calculation formula obtained by the solution of (1) is as follows:

F_mn1 denotes a friend of a travel public opinion propagator M being n, where M_nAll information sets issued by the user n; thus passing through matlabThe program can be written to solve the influence coefficient y_mn。

The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. A business public opinion analysis method based on script crawler architecture and text analysis is characterized in that: the method comprises the following steps:

the word bank obtains a text category according to the text characteristics;

and carrying out public opinion analysis on the text.

2. The method of claim 1, wherein the method comprises the following steps: the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.

3. The method of claim 2, wherein the method comprises the following steps: the Chinese text is obtained, processed and segmented, and the obtained text features are specifically as follows:

4. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the text part of the extracted analysis object page specifically comprises the following steps:

5. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the key sentence pattern matching specifically comprises:

6. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the combination of the topics for determining the text part specifically comprises:

7. The method of claim 1, wherein the method comprises the following steps: the preprocessing the text features and judging whether to send the text features to the word bank specifically comprises the following steps:

if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text is represented as a vector:

D＝{W₁，W₂，…，W_n}

wherein D is a text, W₁，W₂，…，W_nRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;

8. The method of claim 7, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the method comprises the following steps of judging a character string formed by dividing continuous single characters, marking a new word if the character string is the new word, and placing the new word in a new word bank:

wherein N is_newNumber of occurrences of new word, N_similarRepresenting the number of similar articles, and P is a new word identification threshold; if the times of the certain character group meet the condition of the formula, the character group is a new word and is put into a new word matching word bank.

9. The method of claim 1, wherein the method comprises the following steps: the word bank obtains text categories according to text features, and the text categories are specifically as follows:

given a text collection dataset: d^*＝(D₁，D₂，…，D_|D*|)；

given a sorted set dataset:：C(c₁，c₂，…，c_|c|)；

using document D as a vector: d ═ x₁，x₂，…，x_n)

is a normalization factor;

the training set d of the Bayesian classifier is as follows:

d^*＝(d₁，d₂，…，d_s)

D＝(x₁，x₂，…，x_n，c_i)(c_i∈C)

p(X'|c_i):

the categories of text are:

10. the method for public opinion analysis based on script crawler architecture and text analysis as claimed in any one of claims 1 to 9, wherein: the public sentiment analysis is carried out on the text, and the method specifically comprises the following steps: