CN112148936A - Business and travel public opinion analysis method based on script crawler framework and text analysis - Google Patents

Business and travel public opinion analysis method based on script crawler framework and text analysis Download PDF

Info

Publication number
CN112148936A
CN112148936A CN202011076411.0A CN202011076411A CN112148936A CN 112148936 A CN112148936 A CN 112148936A CN 202011076411 A CN202011076411 A CN 202011076411A CN 112148936 A CN112148936 A CN 112148936A
Authority
CN
China
Prior art keywords
text
word
analysis
public opinion
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011076411.0A
Other languages
Chinese (zh)
Inventor
苏如春
孙少峰
练镜锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Hantele Communication Co ltd
Original Assignee
Guangzhou Hantele Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Hantele Communication Co ltd filed Critical Guangzhou Hantele Communication Co ltd
Priority to CN202011076411.0A priority Critical patent/CN112148936A/en
Publication of CN112148936A publication Critical patent/CN112148936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention relates to a business public opinion analysis method based on script crawler architecture and text analysis, which comprises the following steps: acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics; preprocessing the text features and judging whether the text features are sent to a word bank or not; the word bank obtains a text category according to the text characteristics; carrying out public opinion analysis on the text; the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.

Description

Business and travel public opinion analysis method based on script crawler framework and text analysis
Technical Field
The invention relates to the technical field of public opinion analysis, in particular to a business public opinion analysis method based on script crawler architecture and text analysis.
Background
Public opinion analysis of business travel is an important aspect of knowing about users. After watching a video or using a product, a user can express own feelings and opinions in various ways, such as the content of a television integrated program, the love of actors, the view of a local bank and the opinion of the product; the public opinions are mined and analyzed, so that the attention points and the subjective feelings of the users can be displayed more intuitively and clearly.
The content of the public opinion analysis includes text, picture, audio and other forms, and the data source mainly includes web page data, client data, forum data and other network data; comprehensive and deep analysis is carried out from various dimensions, a large amount of technical knowledge and experience are combined, the information can be converted into structured and effective information through an NLP technology, and the opinions and emotional expression of a user on a certain evaluation object are extracted; the method mainly focuses on extracting the characteristics of the user, such as the viewpoint (including a comment object and related evaluation words of the user), emotion, focus and the like, which reflect the attention points and subjective feelings of the user by using the vocabulary and the syntactic analysis technology. The existing public opinion analysis technology for business and travel is complex and has slow analysis speed.
Disclosure of Invention
The invention aims to provide a method for analyzing business and travel public sentiments based on script crawler architecture and text analysis, which aims to solve the problems that the business and travel public sentiment analysis technology is complex and the analysis speed is slow in the prior art.
The technical purpose of the invention is realized by the following technical scheme:
a business public opinion analysis method based on script crawler architecture and text analysis comprises the following steps:
acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
preprocessing the text features and judging whether the text features are sent to a word bank or not;
the word bank obtains a text category according to the text characteristics;
and carrying out public opinion analysis on the text.
In one embodiment, the Chinese text comprises long text and short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
In one embodiment, the obtaining of the chinese text, the processing and the segmenting, and the obtaining of the text features specifically include:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
In one embodiment, the extracting the text part of the analysis object page specifically includes:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
In one embodiment, the key sentence pattern matching specifically includes:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
In one embodiment, the combination of the topics for determining the text part is specifically:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
In one embodiment, the preprocessing the text feature and determining whether to send the text feature to the thesaurus specifically includes:
reading a word stock of text features and classifiers to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text D is represented as a vector:
D={W1,W2,…,Wn}
wherein, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
In one embodiment, the distinguishing of the character string formed by dividing the continuous single characters is performed, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:
judging the character string formed by dividing continuous single characters, if two or more continuous characters appear independently in the same short sentence, forming the continuous characters into a word, and putting the word into a new word library; the new word satisfies the following conditions:
Figure BDA0002716924410000041
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; and if the occurrence times of a certain word group meet the condition of the formula, the word is considered to be a new word and is placed into a new word matching word bank.
In one embodiment, the word stock obtains the text type according to the text feature specifically as follows:
the word bank comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a taxonomy set data set::C(c1,c2,…,c|c|);
Where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
Figure BDA0002716924410000051
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
Figure BDA0002716924410000052
is a normalization factor;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
Figure BDA0002716924410000053
the categories of text are:
Figure BDA0002716924410000061
in one embodiment, the performing public opinion analysis on the text specifically includes:
calculating a consensus index of the text by the following formula based on the mapping table and the combination of the topics of the text portion obtained above: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
The invention has the beneficial effects that: the method comprises the steps of processing and segmenting Chinese texts to obtain text characteristics, processing the text characteristics, analyzing the text characteristics through a word bank to obtain text categories, and performing public opinion analysis on the texts; and complicated steps are omitted, and under the condition of ensuring certain accuracy, public opinion analysis is carried out, so that rapid analysis is realized.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram illustrating steps of a method for business public opinion analysis based on script crawler architecture and text analysis;
fig. 2 is a schematic flow chart illustrating a method for analyzing business public sentiment based on script crawler architecture and text analysis.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
Referring to fig. 1 and fig. 2, a method for analyzing business public sentiment based on script crawler architecture and text analysis according to the present invention is shown, the method comprising the following steps:
100. acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
in the embodiment of the invention, the Chinese text is obtained through a script crawler framework, and the method specifically comprises the following steps: the request sent by the engine is received through the dispatcher, and is arranged and queued according to a certain mode, and when the engine needs the request, the request is returned to the engine; the downloader is responsible for downloading all requests sent by the script engine, and the obtained responses are returned to the script engine and handed to the Spider for processing; the items obtained in the Spider are processed through a pipeline, and post-processing (detailed analysis, filtering, storage and the like) is performed.
In the embodiment of the invention, the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
In the embodiment of the present invention, the processing and segmentation are performed on the text of the Chinese, and the obtained text features specifically are as follows:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
In the embodiment of the present invention, the extracting the text part of the analysis object page specifically includes:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
In the embodiment of the present invention, the key sentence pattern matching specifically includes:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
In the embodiment of the present invention, the combination of the topics for determining the text part specifically includes:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
200. Preprocessing the text features and judging whether the text features are sent to a word bank or not;
specifically, reading a word stock of text features and a classifier to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text D is represented as a vector:
D={W1,W2,…,Wn}
wherein, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
In the embodiment of the present invention, the distinguishing of the character string formed by dividing the continuous single characters, if the character string is a new word, the character string is marked as the new word, and the specific steps of placing the new word in a new word bank are as follows:
judging the character string formed by dividing continuous single characters, if yesWhen two or more continuous characters are found to appear independently in the same short sentence, the continuous characters form a word and are put into a new word bank; the new word satisfies the following conditions:
Figure BDA0002716924410000091
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; and if the occurrence times of a certain word group meet the condition of the formula, the word is considered to be a new word and is placed into a new word matching word bank.
300. The word bank obtains a text category according to the text characteristics;
specifically, the lexicon comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a sorted set dataset: : c (C)1,c2,…,c|c|);
Where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
Figure BDA0002716924410000101
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
Figure BDA0002716924410000102
is a normalization factor;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
Figure BDA0002716924410000111
the categories of text are:
Figure BDA0002716924410000112
400. and carrying out public opinion analysis on the text.
Specifically, the public opinion index of the text is calculated from the mapping table and the combination of the subjects of the text part obtained above by the following formula: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
The method comprises the steps of matching keywords and key sentence patterns in a text by using an alternating current automaton and a regular expression, and representing an article as a plurality of topics according to a matching result; by setting the weight value of each theme, the sum of the weights of the pages is calculated, and whether the page belongs to public sentiment or not can be analyzed and judged quickly and accurately.
The invention replaces word matching in simple public opinion analysis with theme matching, omits complex steps such as clustering, classification and the like, performs public opinion analysis under the condition of ensuring certain accuracy, and realizes rapid analysis.
As a preferred embodiment, S1, it is assumed that the activity coefficient of a text is stable in a certain period T and is proportional to the forwarding amount of the text in the period. First, a time period T is divided into N small time units T1,t2…tnEach time unit having a length of
Figure BDA0002716924410000121
Assuming that the online activity probability of a text conforms to a binomial distribution, the activity coefficient of the text m in the time period T can be expressed as
Figure BDA0002716924410000122
Wherein y iskIs a binary variable for indicating whether the text is at TkAny information is published. If the text m has over-forwarding behavior in time unit, yk=1。n|yk1| represents that y is satisfiedkThe number of time units being 1.
S2, the forwarding coefficient of the text is taken as another very important factor in the model. The correlation degree between the forwarding coefficient of the text and the number of the published messages and the number of friends of the text is assumed to be small. Similar to the estimation process of the text activity degree, assuming that the forwarding probability of a text in a certain time period conforms to the binomial distribution, the forwarding coefficient of the text m in the time period T can be expressed as:
Figure BDA0002716924410000123
s3, it is easy to observe that the active coefficient and forwarding coefficient of the text change over time. In view of this, the time is divided into a plurality of lengths alpha1Of time segments T, which time segments T form a set TN. It is assumed that the activity coefficient and forwarding coefficient of the text are relatively stable during each time period. In the calculation process, firstly, two variables of the text activity coefficient and the forwarding coefficient of each text in each small time period are calculated respectively. And the influence between texts can be obtained by a maximum likelihood estimation method.
Figure BDA0002716924410000131
The influence coefficient of the text can be obtained by solving the following optimization problem:
Figure BDA0002716924410000132
for the influence coefficient gammamnThe calculation formula obtained by the solution of (1) is as follows:
Figure BDA0002716924410000133
Fmn1 denotes a friend of a travel public opinion propagator M being n, where MnAll information sets issued by the user n; thus passing through matlabThe program can be written to solve the influence coefficient ymn
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims (10)

1. A business public opinion analysis method based on script crawler architecture and text analysis is characterized in that: the method comprises the following steps:
acquiring a Chinese text, processing and segmenting the Chinese text to obtain text characteristics;
preprocessing the text features and judging whether the text features are sent to a word bank or not;
the word bank obtains a text category according to the text characteristics;
and carrying out public opinion analysis on the text.
2. The method of claim 1, wherein the method comprises the following steps: the Chinese text comprises a long text and a short text; the long text comprises news, blog, and forum text; the short text includes forum replies and micro-blogs.
3. The method of claim 2, wherein the method comprises the following steps: the Chinese text is obtained, processed and segmented, and the obtained text features are specifically as follows:
initializing parameters: establishing a keyword list to be matched, wherein the keyword list comprises a plurality of keywords for describing public opinion information and topic numbers corresponding to the keywords; the key sentence pattern table to be matched comprises a plurality of regular expressions for describing sentence patterns of public sentiment information and the subject numbers of the key sentence patterns; establishing a mapping table from a subject number to a subject attribute and a subject weight;
reading each keyword to be matched from the keyword table to be matched, adding each word into a word tree prefix of the AC automaton, and completing the construction of a word tree;
reading a regular expression corresponding to each sentence pattern from a key sentence pattern table to be matched;
reading in an analysis object page, and extracting a text part of the analysis object page;
scanning the text, matching the essential words appearing in the text, calculating the occurrence frequency of each essential word, and checking the corresponding theme number of each essential word according to the matched essential word list;
dividing the content of the text part into several sentences according to punctuations or spaces, deleting the sentences of which the number of characters is less than the set length, and performing key sentence pattern matching as a preset minimum sentence length threshold value for the rest sentences;
and determining the combination of the subjects of the text part according to the matching result to obtain the text characteristics.
4. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the text part of the extracted analysis object page specifically comprises the following steps:
judging the type of the page according to the original website of the page and key codes contained in HTML codes of the page by using a regular expression matching method;
if the page belongs to news or blogs, extracting all page paragraphs and calculating the page title as a single paragraph in the text; if the page belongs to a forum, merging the reply with the poster part of the poster and the reply with the poster word number larger than the first set word number in the poster into a text for each discussion post, and analyzing other subsequent posts with the word number larger than the second set word number as separate texts; if the page belongs to the microblog, each word with the number of words exceeding the number of words is analyzed as a text independently.
5. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the key sentence pattern matching specifically comprises:
reading out a regular expression in a key sentence pattern table to be matched, and matching the sentence with the regular expression;
if the regular expression is successfully matched, the sentence is identified as a key sentence pattern corresponding to the regular expression, a topic number corresponding to the sentence pattern is recorded, and the occurrence frequency of the sentence pattern is increased by 1; if the regular matching is not successful, continuing to execute the following steps: reading out a regular expression in the key sentence pattern table to be matched, and matching the sentence with the regular expression until all the regular expressions are matched.
6. The method of claim 3, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the combination of the topics for determining the text part specifically comprises:
for a long text, if the occurrence frequency of a subject word or a key sentence contained in a subject in the text is not less than a first set frequency, the text part is considered to be related to the subject; for short text, if the number of keywords or sentences contained in a topic appearing in the text is not less than a second set number of times, the text is considered to be related to the topic.
7. The method of claim 1, wherein the method comprises the following steps: the preprocessing the text features and judging whether to send the text features to the word bank specifically comprises the following steps:
reading a word stock of text features and classifiers to perform maximum forward matching and maximum reverse matching;
if the maximum forward matching result is consistent with the maximum reverse matching result, determining that the word segmentation is correct, judging a character string formed by segmenting continuous single characters, if the word is a new word, marking the word as the new word, and placing the word in a new word bank; after the segmentation is finished, one text is represented as a vector:
D={W1,W2,…,Wn}
wherein D is a text, W1,W2,…,WnRespectively represent a word and mean a meaningful character string consisting of one or more Chinese characters;
and if the maximum forward matching result is inconsistent with the maximum reverse matching result, segmenting the text by adopting an improved hidden Markov segmentation method.
8. The method of claim 7, wherein the business public opinion analysis method based on script crawler architecture and text analysis comprises: the method comprises the following steps of judging a character string formed by dividing continuous single characters, marking a new word if the character string is the new word, and placing the new word in a new word bank:
judging the character string formed by dividing continuous single characters, if two or more continuous characters appear independently in the same short sentence, forming the continuous characters into a word, and putting the word into a new word library; the new word satisfies the following conditions:
Figure FDA0002716924400000041
wherein N isnewNumber of occurrences of new word, NsimilarRepresenting the number of similar articles, and P is a new word identification threshold; if the times of the certain character group meet the condition of the formula, the character group is a new word and is put into a new word matching word bank.
9. The method of claim 1, wherein the method comprises the following steps: the word bank obtains text categories according to text features, and the text categories are specifically as follows:
the word bank comprises a Bayesian classifier, and the text features are classified by adopting the Bayesian classifier:
given a text collection dataset: d*=(D1,D2,…,D|D*|);
Wherein, | D*I is the number of data sets of a given text set, Di(i=l,2,…,|D*|) correspond to each text, respectively;
given a sorted set dataset::C(c1,c2,…,c|c|);
where | C | is the number of a given text category, Ci(i ═ 1, 2, …, | C |) corresponds to each text category, respectively;
firstly, generating a conversion function F for text classification, and obtaining a mapping result for any text in a given text set data set through the conversion function: f: d → cF;
wherein F represents a conversion function and D represents any text of a given text set data set;
using document D as a vector: d ═ x1,x2,…,xn)
Wherein the characteristic component xi(i-1, 2, …, n) denotes the word WiThe weights in text X are calculated as:
Figure FDA0002716924400000051
tf (w) in the formulai(D) Is meant by WiFrequency of occurrence in document D; n is the total number of all documents; n is a radical ofiIs in the presence of WiThe number of documents;
Figure FDA0002716924400000052
is a normalization factor;
the training set d of the Bayesian classifier is as follows:
d*=(d1,d2,…,ds)
wherein each training sample di(i ═ 1, 2, …, s) is an n +1 dimensional vector, written as:
D=(x1,x2,…,xn,ci)(ci∈C)
the classification is to an unknown class of text D (x)i,x2,…,xn) Predicting the class of D;
for new text, the class conditional probability of its belonging to c is noted as:
p(X'|ci):
Figure FDA0002716924400000061
the categories of text are:
Figure FDA0002716924400000062
10. the method for public opinion analysis based on script crawler architecture and text analysis as claimed in any one of claims 1 to 9, wherein: the public sentiment analysis is carried out on the text, and the method specifically comprises the following steps:
calculating a consensus index of the text by the following formula based on the mapping table and the combination of the topics of the text portion obtained above: AR — A1Ss _ A2Sn _ A3Sp _ A4S1_ A5 Sf;
wherein, Ss is the weight sum of the sensitive problems appearing in the text, Sn is the weight sum of the negative emotional topics, Sp is the weight sum of the positive emotional topics, Sp is the weight sum of the non-public opinion topics, Sf is the weight sum of the description overseas situation topics, A2 is greater than A3, and A3 is greater than AR;
and if the public opinion index is greater than Tr, the text does not contain filtering keywords set by the user, and the area related in the text description content is consistent with the attention area set by the user, the text is regarded as the public opinion information concerned by the user, wherein Tr is a preset minimum threshold value for identifying a certain page as the public opinion.
CN202011076411.0A 2020-10-10 2020-10-10 Business and travel public opinion analysis method based on script crawler framework and text analysis Pending CN112148936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011076411.0A CN112148936A (en) 2020-10-10 2020-10-10 Business and travel public opinion analysis method based on script crawler framework and text analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011076411.0A CN112148936A (en) 2020-10-10 2020-10-10 Business and travel public opinion analysis method based on script crawler framework and text analysis

Publications (1)

Publication Number Publication Date
CN112148936A true CN112148936A (en) 2020-12-29

Family

ID=73952811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011076411.0A Pending CN112148936A (en) 2020-10-10 2020-10-10 Business and travel public opinion analysis method based on script crawler framework and text analysis

Country Status (1)

Country Link
CN (1) CN112148936A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150432A (en) * 2013-03-07 2013-06-12 宁波成电泰克电子信息技术发展有限公司 Method for internet public opinion analysis
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification
CN107066585A (en) * 2017-04-17 2017-08-18 济南大学 A kind of probability topic calculates the public sentiment monitoring method and system with matching
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN108536667A (en) * 2017-03-06 2018-09-14 中国移动通信集团广东有限公司 Chinese text recognition methods and device
CN108563667A (en) * 2018-01-05 2018-09-21 武汉虹旭信息技术有限责任公司 Hot issue acquisition system based on new word identification and its method
CN109871443A (en) * 2018-12-25 2019-06-11 杭州茂财网络技术有限公司 A kind of short text classification method and device based on book keeping operation scene
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 Text emotion analysis method based on new word expansion and complex sentence pattern expansion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150432A (en) * 2013-03-07 2013-06-12 宁波成电泰克电子信息技术发展有限公司 Method for internet public opinion analysis
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification
CN108536667A (en) * 2017-03-06 2018-09-14 中国移动通信集团广东有限公司 Chinese text recognition methods and device
CN107066585A (en) * 2017-04-17 2017-08-18 济南大学 A kind of probability topic calculates the public sentiment monitoring method and system with matching
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN108563667A (en) * 2018-01-05 2018-09-21 武汉虹旭信息技术有限责任公司 Hot issue acquisition system based on new word identification and its method
CN109871443A (en) * 2018-12-25 2019-06-11 杭州茂财网络技术有限公司 A kind of short text classification method and device based on book keeping operation scene
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN110990565A (en) * 2019-11-20 2020-04-10 广州商品清算中心股份有限公司 Extensible text analysis system and method for public sentiment analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
白书奎等: "一种舆情分析中的文本分类方法", 信息技术, no. 03, 31 December 2013 (2013-12-31), pages 9 - 12 *

Similar Documents

Publication Publication Date Title
CN107291780B (en) User comment information display method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
CN110232149B (en) Hot event detection method and system
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN111368038B (en) Keyword extraction method and device, computer equipment and storage medium
CN110888990B (en) Text recommendation method, device, equipment and medium
CN107273348B (en) Topic and emotion combined detection method and device for text
JP2012027845A (en) Information processor, relevant sentence providing method, and program
CN107506472B (en) Method for classifying browsed webpages of students
CN111160019B (en) Public opinion monitoring method, device and system
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN111538828A (en) Text emotion analysis method and device, computer device and readable storage medium
CN114238573A (en) Information pushing method and device based on text countermeasure sample
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
Peng et al. High quality information extraction and query-oriented summarization for automatic query-reply in social network
CN116362811A (en) Automatic advertisement delivery management system based on big data
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN115329085A (en) Social robot classification method and system
CN115017302A (en) Public opinion monitoring method and public opinion monitoring system
CN112966103B (en) Mixed attention mechanism text title matching method based on multi-task learning
KR101652433B1 (en) Behavioral advertising method according to the emotion that are acquired based on the extracted topics from SNS document
CN107291686B (en) Method and system for identifying emotion identification
CN108427769B (en) Character interest tag extraction method based on social network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination