CN112148936A - Business and travel public opinion analysis method based on script crawler framework and text analysis - Google Patents
Business and travel public opinion analysis method based on script crawler framework and text analysis Download PDFInfo
- Publication number
- CN112148936A CN112148936A CN202011076411.0A CN202011076411A CN112148936A CN 112148936 A CN112148936 A CN 112148936A CN 202011076411 A CN202011076411 A CN 202011076411A CN 112148936 A CN112148936 A CN 112148936A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- analysis
- public opinion
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 230000014509 gene expression Effects 0.000 claims description 32
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 230000002996 emotional effect Effects 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 6
- 238000001914 filtration Methods 0.000 claims description 4
- 238000010276 construction Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 241000239290 Araneae Species 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The invention relates to a business public opinion analysis method based on script crawler architecture and text analysis, which comprises the following steps: acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics; preprocessing the text features and judging whether the text features are sent to a word bank or not; the word bank obtains a text category according to the text characteristics; carrying out public opinion analysis on the text; the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.
Description
Technical Field
The invention relates to the technical field of public opinion analysis, in particular to a business public opinion analysis method based on script crawler architecture and text analysis.
Background
Public opinion analysis of business travel is an important aspect of knowing about users. After watching a video or using a product, a user can express own feelings and opinions in various ways, such as the content of a television integrated program, the love of actors, the view of a local bank and the opinion of the product; the public opinions are mined and analyzed, so that the attention points and the subjective feelings of the users can be displayed more intuitively and clearly.
The content of the public opinion analysis includes text, picture, audio and other forms, and the data source mainly includes web page data, client data, forum data and other network data; comprehensive and deep analysis is carried out from various dimensions, a large amount of technical knowledge and experience are combined, the information can be converted into structured and effective information through an NLP technology, and the opinions and emotional expression of a user on a certain evaluation object are extracted; the method mainly focuses on extracting the characteristics of the user, such as the viewpoint (including a comment object and related evaluation words of the user), emotion, focus and the like, which reflect the attention points and subjective feelings of the user by using the vocabulary and the syntactic analysis technology. The existing public opinion analysis technology for business and travel is complex and has slow analysis speed.
Disclosure of Invention
The invention aims to provide a method for analyzing business and travel public sentiments based on script crawler architecture and text analysis, which aims to solve the problems that the business and travel public sentiment analysis technology is complex and the analysis speed is slow in the prior art.
The technical purpose of the invention is realized by the following technical scheme:
a business public opinion analysis method based on script crawler architecture and text analysis comprises the following steps:
acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
preprocessing the text features and judging whether the text features are sent to a word bank or not;
the word bank obtains a text category according to the text characteristics;
and carrying out public opinion analysis on the text.
In one embodiment, the Chinese text comprises long text and short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
In one embodiment, the obtaining of the chinese text, the processing and the segmenting, and the obtaining of the text features specifically include:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
In one embodiment, the extracting the text part of the analysis object page specifically includes:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
In one embodiment, the key sentence pattern matching specifically includes:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
In one embodiment, the combination of the topics for determining the text part is specifically:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
In one embodiment, the preprocessing the text feature and determining whether to send the text feature to the thesaurus specifically includes:
reading a word stock of text features and classifiers to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text D is represented as a vector:
D={W1,W2,…,Wn}
wherein, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
In one embodiment, the distinguishing of the character string formed by dividing the continuous single characters is performed, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:
judging the character string formed by dividing continuous single characters, if two or more continuous characters appear independently in the same short sentence, forming the continuous characters into a word, and putting the word into a new word library; the new word satisfies the following conditions:
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; and if the occurrence times of a certain word group meet the condition of the formula, the word is considered to be a new word and is placed into a new word matching word bank.
In one embodiment, the word stock obtains the text type according to the text feature specifically as follows:
the word bank comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a taxonomy set data set::C(c1,c2,…,c|c|);
Where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
in one embodiment, the performing public opinion analysis on the text specifically includes:
calculating a consensus index of the text by the following formula based on the mapping table and the combination of the topics of the text portion obtained above: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
The invention has the beneficial effects that: the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram illustrating steps of a method for business public opinion analysis based on script crawler architecture and text analysis;
fig. 2 is a schematic flow chart illustrating a method for analyzing business public sentiment based on script crawler architecture and text analysis.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
Referring to fig. 1 and fig. 2, a method for analyzing business public sentiment based on script crawler architecture and text analysis according to the present invention is shown, the method comprising the following steps:
100. acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
in the embodiment of the invention, the Chinese text is obtained through a script crawler framework, and the method specifically comprises the following steps: the request sent by the engine is received through the dispatcher, and is arranged and queued according to a certain mode, and when the engine needs the request, the request is returned to the engine; the downloader is responsible for downloading all requests sent by the script engine, and the obtained responses are returned to the script engine and handed to the Spider for processing; the items obtained in the Spider are processed through a pipeline, and post-processing (detailed analysis, filtering, storage and the like) is performed.
In the embodiment of the invention, the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
In the embodiment of the present invention, the processing and segmentation are performed on the text of the Chinese, and the obtained text features specifically are as follows:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
In the embodiment of the present invention, the extracting the text part of the analysis object page specifically includes:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
In the embodiment of the present invention, the key sentence pattern matching specifically includes:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
In the embodiment of the present invention, the combination of the topics for determining the text part specifically includes:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
200. Preprocessing the text features and judging whether the text features are sent to a word bank or not;
specifically, reading a word stock of text features and a classifier to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text D is represented as a vector:
D={W1,W2,…,Wn}
wherein, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
In the embodiment of the present invention, the distinguishing of the character string formed by dividing the continuous single characters, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:
judging the character string formed by dividing continuous single characters, if yesWhen two or more continuous characters are found to appear independently in the same short sentence, the continuous characters form a word and are put into a new word bank; the new word satisfies the following conditions:
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; and if the occurrence times of a certain word group meet the condition of the formula, the word is considered to be a new word and is placed into a new word matching word bank.
300. The word bank obtains a text category according to the text characteristics;
specifically, the lexicon comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a sorted set dataset: : c (C)1,c2,…,c|c|);
Where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
400. and carrying out public opinion analysis on the text.
Specifically, the public opinion index of the text is calculated from the mapping table and the combination of the subjects of the text part obtained above by the following formula: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
The method comprises the steps of matching keywords and key sentence patterns in a text by using an alternating current automaton and a regular expression, and representing an article as a plurality of topics according to a matching result; by setting the weight value of each theme, the sum of the weights of the pages is calculated, and whether the page belongs to public sentiment or not can be analyzed and judged quickly and accurately.
The invention replaces word matching in simple public opinion analysis with theme matching, omits complex steps such as clustering, classification and the like, performs public opinion analysis under the condition of ensuring certain accuracy, and realizes rapid analysis.
As a preferred embodiment, S1, it is assumed that the activity coefficient of a text is stable in a certain period T and is proportional to the forwarding amount of the text in the period. First, a time period T is divided into N small time units T1,t2…tnEach time unit having a length ofAssuming that the online activity probability of a text conforms to a binomial distribution, the activity coefficient of the text m in the time period T can be expressed as
Wherein y iskIs a binary variable for indicating whether the text is at TkAny information is published. If the text m has over-forwarding behavior in time unit, yk=1。n|yk1| represents that y is satisfiedkThe number of time units being 1.
S2, the forwarding coefficient of the text is taken as another very important factor in the model. The correlation degree between the forwarding coefficient of the text and the number of the published messages and the number of friends of the text is assumed to be small. Similar to the estimation process of the text activity degree, assuming that the forwarding probability of a text in a certain time period conforms to the binomial distribution, the forwarding coefficient of the text m in the time period T can be expressed as:
s3, it is easy to observe that the active coefficient and forwarding coefficient of the text change over time. In view of this, the time is divided into a plurality of lengths alpha1Of time segments T, which time segments T form a set TN. It is assumed that the activity coefficient and forwarding coefficient of the text are relatively stable during each time period. In the calculation process, firstly, two variables of the text activity coefficient and the forwarding coefficient of each text in each small time period are calculated respectively. And the influence between texts can be obtained by a maximum likelihood estimation method.
The influence coefficient of the text can be obtained by solving the following optimization problem:
for the influence coefficient gammamnThe calculation formula obtained by the solution of (1) is as follows:
Fmn1 denotes a friend of a travel public opinion propagator M being n, where MnAll information sets issued by the user n; thus passing through matlabThe program can be written to solve the influence coefficient ymn。
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.
Claims (10)
1. A business public opinion analysis method based on script crawler architecture and text analysis is characterized in that: the method comprises the following steps:
acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
preprocessing the text features and judging whether the text features are sent to a word bank or not;
the word bank obtains a text category according to the text characteristics;
and carrying out public opinion analysis on the text.
2. The method of claim 1, wherein the method comprises the following steps: the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
3. The method of claim 2, wherein the method comprises the following steps: the Chinese text is obtained, processed and segmented, and the obtained text features are specifically as follows:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
4. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the text part of the extracted analysis object page specifically comprises the following steps:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
5. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the key sentence pattern matching specifically comprises:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
6. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the combination of the topics for determining the text part specifically comprises:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
7. The method of claim 1, wherein the method comprises the following steps: the preprocessing the text features and judging whether to send the text features to the word bank specifically comprises the following steps:
reading a word stock of text features and classifiers to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text is represented as a vector:
D={W1,W2,…,Wn}
wherein D is a text, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
8. The method of claim 7, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the method comprises the following steps of judging a character string formed by dividing continuous single characters, marking a new word if the character string is the new word, and placing the new word in a new word bank:
judging the character string formed by dividing continuous single characters, if two or more continuous characters appear independently in the same short sentence, forming the continuous characters into a word, and putting the word into a new word library; the new word satisfies the following conditions:
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; if the times of the certain character group meet the condition of the formula, the character group is a new word and is put into a new word matching word bank.
9. The method of claim 1, wherein the method comprises the following steps: the word bank obtains text categories according to text features, and the text categories are specifically as follows:
the word bank comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a sorted set dataset::C(c1,c2,…,c|c|);
where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
p(X'|ci):
10. the method for public opinion analysis based on script crawler architecture and text analysis as claimed in any one of claims 1 to 9, wherein: the public sentiment analysis is carried out on the text, and the method specifically comprises the following steps:
calculating a consensus index of the text by the following formula based on the mapping table and the combination of the topics of the text portion obtained above: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011076411.0A CN112148936A (en) | 2020-10-10 | 2020-10-10 | Business and travel public opinion analysis method based on script crawler framework and text analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011076411.0A CN112148936A (en) | 2020-10-10 | 2020-10-10 | Business and travel public opinion analysis method based on script crawler framework and text analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112148936A true CN112148936A (en) | 2020-12-29 |
Family
ID=73952811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011076411.0A Pending CN112148936A (en) | 2020-10-10 | 2020-10-10 | Business and travel public opinion analysis method based on script crawler framework and text analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112148936A (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
CN107066585A (en) * | 2017-04-17 | 2017-08-18 | 济南大学 | A kind of probability topic calculates the public sentiment monitoring method and system with matching |
CN107908694A (en) * | 2017-11-01 | 2018-04-13 | 平安科技(深圳)有限公司 | Public sentiment clustering method, application server and the computer-readable recording medium of internet news |
CN108536667A (en) * | 2017-03-06 | 2018-09-14 | 中国移动通信集团广东有限公司 | Chinese text recognition methods and device |
CN108563667A (en) * | 2018-01-05 | 2018-09-21 | 武汉虹旭信息技术有限责任公司 | Hot issue acquisition system based on new word identification and its method |
CN109871443A (en) * | 2018-12-25 | 2019-06-11 | 杭州茂财网络技术有限公司 | A kind of short text classification method and device based on book keeping operation scene |
CN110990565A (en) * | 2019-11-20 | 2020-04-10 | 广州商品清算中心股份有限公司 | Extensible text analysis system and method for public sentiment analysis |
CN111221962A (en) * | 2019-11-18 | 2020-06-02 | 重庆邮电大学 | Text emotion analysis method based on new word expansion and complex sentence pattern expansion |
-
2020
- 2020-10-10 CN CN202011076411.0A patent/CN112148936A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
CN108536667A (en) * | 2017-03-06 | 2018-09-14 | 中国移动通信集团广东有限公司 | Chinese text recognition methods and device |
CN107066585A (en) * | 2017-04-17 | 2017-08-18 | 济南大学 | A kind of probability topic calculates the public sentiment monitoring method and system with matching |
CN107908694A (en) * | 2017-11-01 | 2018-04-13 | 平安科技(深圳)有限公司 | Public sentiment clustering method, application server and the computer-readable recording medium of internet news |
CN108563667A (en) * | 2018-01-05 | 2018-09-21 | 武汉虹旭信息技术有限责任公司 | Hot issue acquisition system based on new word identification and its method |
CN109871443A (en) * | 2018-12-25 | 2019-06-11 | 杭州茂财网络技术有限公司 | A kind of short text classification method and device based on book keeping operation scene |
CN111221962A (en) * | 2019-11-18 | 2020-06-02 | 重庆邮电大学 | Text emotion analysis method based on new word expansion and complex sentence pattern expansion |
CN110990565A (en) * | 2019-11-20 | 2020-04-10 | 广州商品清算中心股份有限公司 | Extensible text analysis system and method for public sentiment analysis |
Non-Patent Citations (1)
Title |
---|
白书奎等: "一种舆情分析中的文本分类方法", 信息技术, no. 03, 31 December 2013 (2013-12-31), pages 9 - 12 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107291780B (en) | User comment information display method and device | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
CN110232149B (en) | Hot event detection method and system | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
CN108628833B (en) | Method and device for determining summary of original content and method and device for recommending original content | |
CN111368038B (en) | Keyword extraction method and device, computer equipment and storage medium | |
CN110888990B (en) | Text recommendation method, device, equipment and medium | |
CN107273348B (en) | Topic and emotion combined detection method and device for text | |
JP2012027845A (en) | Information processor, relevant sentence providing method, and program | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
CN111160019B (en) | Public opinion monitoring method, device and system | |
CN110287314B (en) | Long text reliability assessment method and system based on unsupervised clustering | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111538828A (en) | Text emotion analysis method and device, computer device and readable storage medium | |
CN114238573A (en) | Information pushing method and device based on text countermeasure sample | |
CN115017303A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
Peng et al. | High quality information extraction and query-oriented summarization for automatic query-reply in social network | |
CN116362811A (en) | Automatic advertisement delivery management system based on big data | |
CN112307336A (en) | Hotspot information mining and previewing method and device, computer equipment and storage medium | |
CN115329085A (en) | Social robot classification method and system | |
CN115017302A (en) | Public opinion monitoring method and public opinion monitoring system | |
CN112966103B (en) | Mixed attention mechanism text title matching method based on multi-task learning | |
KR101652433B1 (en) | Behavioral advertising method according to the emotion that are acquired based on the extracted topics from SNS document | |
CN107291686B (en) | Method and system for identifying emotion identification | |
CN108427769B (en) | Character interest tag extraction method based on social network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |