CN116737922A - Tourist online comment fine granularity emotion analysis method and system - Google Patents

Tourist online comment fine granularity emotion analysis method and system Download PDF

Info

Publication number
CN116737922A
CN116737922A CN202310232968.6A CN202310232968A CN116737922A CN 116737922 A CN116737922 A CN 116737922A CN 202310232968 A CN202310232968 A CN 202310232968A CN 116737922 A CN116737922 A CN 116737922A
Authority
CN
China
Prior art keywords
emotion
text
data
comment
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310232968.6A
Other languages
Chinese (zh)
Inventor
王金丽
袁泽辉
吕宛青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN202310232968.6A priority Critical patent/CN116737922A/en
Publication of CN116737922A publication Critical patent/CN116737922A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a tourist online comment fine granularity emotion analysis method and system. Processing the tourist comments, dividing the comments into a plurality of granularities, extracting a plurality of attribute-emotion word pairs, calculating and classifying through a model, and finally outputting emotion polarities corresponding to each attribute to acquire emotion tendencies of the tourists. The invention has the advantages that: the method has the advantages that massive unstructured comment data are processed to realize semantic relation learning, an index evaluation system of the emotion tendency of the tourists is established, a foundation is laid for online comment fine granularity emotion analysis of the tourists, comment data can be automatically extracted to conduct data optimization and processing, an accurate analysis result is finally obtained, analysis efficiency is improved, and analysis cost is reduced.

Description

Tourist online comment fine granularity emotion analysis method and system
Technical Field
The invention relates to the technical field of intelligent emotion analysis, in particular to a guest online comment fine granularity emotion analysis method and system based on a pre-training model.
Background
With the rapid development of computer technology and network technology, the Internet (Internet) plays an increasing role in people's daily life, learning and work. Moreover, with the development of the mobile internet, the internet is also moving.
Tourists will usually make their own subjective comments on the internet on each element of the travel destination. The evaluation of the tourist destination usually has rich emotion colors and subjectivity, the fine-granularity emotion tendency of tourists on each element of the tourist destination is excavated, factors for promoting and influencing the tourist destination are obtained, the reasons behind good evaluation or bad evaluation are analyzed, decision support is provided for tourist operators and managers, the tourists are assisted in selecting travel places and consumer products, and the image and the public praise of the tourist destination are improved.
But no analysis system and method for online comments of tourists has emerged.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a tourist online comment fine granularity emotion analysis method and system. Processing the tourist comments, dividing the comments into a plurality of granularities, extracting a plurality of attribute-emotion word pairs, calculating and classifying through a model, and finally outputting emotion polarities corresponding to each attribute to acquire emotion tendencies of the tourists.
In order to achieve the above object, the present invention adopts the following technical scheme:
a tourist online comment fine granularity emotion analysis method comprises the following steps:
s1: and crawling the tourist comment data set corresponding to the travel destination on the network.
S2: preprocessing a data set, including: supplement incomplete data, data cleansing and decommissioning words. The incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; deactivating words generally refers to removing unwanted words that occur more frequently.
S3: emotion analysis comprising the sub-steps of:
s31: attribute classification and labeling. Classifying and labeling the preprocessed data, and performing dimension division, wherein the dimensions comprise: diet, price, entertainment, environment, service and travel experience, and then fine granularity division is carried out on each dimension to obtain objects, entities or attributes of tourist evaluation. And (3) finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2.
S32: the data set is partitioned. Randomly dividing the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1. and testing and training the acquired data set.
S33: text vectorization. The packed tokens of the word are converted into word vectors, the obtained word vectors token_ids, attn_mask and seg_ids are input into a model, and the vector representation of each word, i.e. the vector representation of the hidden_reps and the vector representation of the first token, are obtained.
S34: attribute-emotion word extraction. Comment text attribute features are extracted using a pre-training model, a self-attention mechanism.
S35: and (5) emotion classification. The MLM and NSP of the pre-training model are utilized to perform pre-training, text classification is performed on the first toekn vector output by training, and then normalization processing is performed by utilizing softmax, so that probability of each text classification is obtained, and emotion classification is achieved.
Further, partitioning the guest comment data set includes: target layer, dimension, index and emotion tendency assignment, as shown in the following table:
each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension contains 2 to 3 metrics. The emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.
Further, the pre-training model comprises: an input layer, a text representation layer, an emotion feature extraction layer and an output layer.
Input layer: the input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatment of word embedding, segment embedding and position embedding on the text through a pre-training model to obtain the coding vector of the input text.
Text presentation layer: unstructured text that cannot be recognized by a computer is converted into text vectors that can be recognized by the computer.
Emotion feature extraction layer: and (3) taking the processed text representation vector as an input vector, inputting the input vector into a pre-training model, and extracting the characteristics by adopting an encoder module of the pre-training model.
Feature extraction uses a self-attention mechanism to perform weight distribution, displaying the importance of words in the text dataset.
Output layer: the method comprises the steps of splicing and fusing vectors obtained by extracting the characteristics of the text, and classifying the characteristics of the text by using a classifier.
Further, the text presentation layer is processed as follows: the pre-training model sets the maximum sequence length of the input text, and cuts the long text. For short single sentences, the [ CLS ] and [ SEP ] flag bits are respectively added at the head and the tail of the sentence, and insufficient positions are filled with [ PAD ]. For a long sentence composed of two sentences, three flag bits of [ CLS ] [ SEP ] can be added at the head and tail, respectively. If the length of the long sentence is greater than max_seq_leng th, a sentence having a longer length is truncated from the tail.
Further generating an attribute_mask of the vector representation;
then, generating token_ids by means of a preset dictionary;
finally, the intent_mask and token_ids are input into the trained model for processing, resulting in an ebedding representation of each word.
Further, the emotion classification is performed in the output layer using a pretrained model fine-tuned Softmax classifier. Four classification is carried out on the six attributes, and then each attribute is classified by adopting 6 models and classifiers. A numerical vector is normalized to a probability distribution vector using a Softmax activation function.
The invention also discloses a guest online comment fine granularity emotion analysis system which can be used for implementing the guest online comment fine granularity emotion analysis method, and specifically comprises the following steps: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;
data collection and storage module: and crawling the tourist comment data set corresponding to the travel destination on the network.
And a data preprocessing module: the data set is supplemented with incomplete data, data cleansing and decommissioning words.
The emotion analysis module is used for dividing a data set, vectorizing texts, extracting attribute-emotion words and carrying out emotion classification by attribute classification and marking to obtain the probability of each text classification and realize emotion classification.
Compared with the prior art, the invention has the advantages that:
1. in the face of massive unstructured comment data, general guest comment grammar semantic knowledge is formed by learning the context semantic relation of each comment, and the model can be applied to specific target tasks, so that semantic relation learning is realized;
2. aiming at massive online comments of tourists, an index evaluation system of the emotion tendencies of the tourists is established from multi-dimensional fine granularity, and a foundation is laid for the analysis of the emotion of the online comments of the tourists;
3. the comment data can be automatically extracted for data optimization and processing, so that an accurate analysis result is finally obtained, the analysis efficiency is improved, and the analysis cost is reduced.
Drawings
FIG. 1 is a diagram of a pre-training model structure in accordance with an embodiment of the present invention;
FIG. 2 is a fine granularity emotion analysis model framework diagram based on pre-training in accordance with an embodiment of the present invention;
FIG. 3 is a flow chart of an experiment of an embodiment of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.
1. Definition of the objects of the invention
The tourist fine granularity emotion analysis of the tourist at the tourist destination is used for inputting text data such as direct emotion expression of the tourist at the tourist website on the tourist destination, and outputting emotion polarities of the tourist on various elements of the tourist, including positive, negative, neutral and the like. Tourist destinations include many factors such as price, environment, traffic, service, etc., some tourists may be satisfied with some aspects but not others, so tourist comments may have positive comments, negative comments, and neutral comments.
If the sample 1' the national village is very large, the four-to-five hour period is required, all the characteristics of the nations in the sample are very complete, the matched service facilities are very complete, the eating amount is very much, the water quality is very good, the scenery is general, but the eating amount is expensive, and the sample is suitable for being played by a person. In this embodiment, the tourists have positive attitudes for national village supporting facilities, things (things are very much eaten), water quality and the like, have neutral attitudes for scenery colors, and have negative attitudes for things (things are expensive) and entertainment prices (also expensive to play).
The invention aims to process the comment, divide the comment into a plurality of granularities, extract a plurality of attribute-emotion word pairs, calculate and classify the comment through a model, and finally output emotion polarities corresponding to each attribute to acquire emotion tendencies of tourists.
However, the method is a Chinese comment, a Chinese text is complex and consists of a plurality of equally spaced words, namely characters, a word or a sentence consists of a plurality of characters, and the problems of word ambiguity, syntax dependence and the like exist, so that the text needs to be preprocessed, and a corresponding label is allocated to each character of the text through a sequence, a position label and the like. A POSTIVE label is allocated for the positive emotion of tourists, which is shown by a certain attribute or characteristic of a tourist destination, and is indicated by '1'; assigning NEGTIVE labels to comment features showing negative emotions, wherein the NEGTIVE labels are represented by '1'; assigning a NUETRAL tag to comment features exhibiting neutral emotion, denoted by "0"; NOTM is labeled for non-mentioned comment features, denoted by "-2".
For the extracted tourist attribute-fine granularity emotion word pair, a binary group { ATT nm SO represents }. Wherein, ATT nm Represents all attribute sets, n is a primary index, m is a secondary index, SO represents emotion polarity, so= {1,0, -1, -2}. Assume that the input sight spot online comment is x= { w 1 ,w 2 ,…,w n -w is i (i=1, 2,3,) represents individual characters of the comment text. After the pre-training model processing treatment, the emotion polarities of n×m attributes are obtained, and are expressed as y= { (ATT) ij ,1),(ATT (i+1)j ,0),……,(ATT nm ,-1)}。
The given index system is { diet, price, entertainment, environment, service, travel experience }, the primary index "environment", "diet", "price," "environment", whether the corresponding secondary index "facility is complete-complete", "diversified-particularly-multiple", "price level-noble", "landscape-general", the corresponding output can be expressed as y= { (a [1,1], 1), (a [2,2], -1), (a [3, j ], -2), (a [4,3], 1), (a [4,1], 0), (a [5,j ], -2), (a [6,j ], -2) }, thus, the tourist emotion tendency is { (diet (diversified), positive); (price level), negative); (entertainment, not mentioned); (environment (facility), positive); (environmental (landscape), neutral); (entertainment, not mentioned); (service, not mentioned) }. The output result is finer granularity emotion tendencies of tourists.
2. Pre-training model
Pre-training can obtain a pre-training model irrelevant to specific tasks from large-scale data through self-supervision learning, and is essentially a transfer learning method. In natural language processing, the pre-training model learns large-scale data in advance to form a knowledge system containing a context relation, and then brings the knowledge system to a specific task, so that an accurate task execution result can be obtained.
The guest online comment pre-training model can implicitly learn the grammatical semantic knowledge of the online comment by inputting massive comment data into the model, inputting a contextually relevant representation of each comment. Further, knowledge learned in the open field is migrated to the downstream task, and the execution efficiency of the downstream task is improved.
The pre-training model is an optimized model based on the improvement of an attention mechanism and a transducer model, and can be divided into a masking language model (masked language model, MLM) and a predicting next sentence (next sentenceprediction, NSP), wherein the MLM and the NSP can realize word vector representation and sentence semantic feature extraction, and logical relations among sentences are mined. The pre-training model adopts bidirectional coding to analyze context semantic information in the text segment, and combines an attention mechanism to establish association strength between words, so that a plurality of texts can be efficiently processed in parallel, dependency relationship and semantic association information between the words and the context can be efficiently identified, and the problems of long-term dependence and word ambiguity in the texts are effectively solved. The pre-training model structure is shown in fig. 1.
In fig. 1, E denotes an input, trm denotes a transducer encoder, T denotes a resultant text, and the pre-training model is composed of a stack of layers of bidirectional transducer encoders. The pre-training model consists of an input layer, a coding layer and an output layer. The input layers include word embedding (token embedding), segment embedding (segmen t embeddings), and position embedding (position embeddings). Converting unstructured text which cannot be understood by a computer into structured text which can be understood by the computer; segment embedding can segment a sentence, and the sentence is segmented by inserting tags [ CLS ] and [ SEP ] at the beginning and the end of a sentence; the position embedding is to encode the position information of the vocabulary and save time sequence information. The output layer is a word vector sequence obtained by processing the input words. The obtained word vector can be directly applied to downstream tasks, such as text emotion analysis and the like.
3. Fine granularity emotion analysis model based on pre-training
The embodiment of the invention provides a pretrained fine granularity emotion analysis model based on fusion of fine granularity emotion analysis, and the model structure is shown in figure 2.
(1) Input layer (Input context)
The input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatments such as word embedding, segment embedding, position embedding and the like on the text through a pre-training model to obtain the coding vector of the input text. For a given user set U, a comment set V, for which any one comment text sequence can be expressed as s= { w 1 ,w 2 ,...,w n Where n represents the total number of comment words. Corresponding word vectors can be obtained through a pre-training model and expressed asThe position embedding is very critical, and can endow each word vector with corresponding position information to judge the specific position of the word in the sequence, wherein the formula is shown in the formula (1):
where pos represents the position of the word, i represents the index of the position vector, d model Representing the dimension of the word vector.
(2) Text presentation layer (Text vectorization representation)
Text representation is the conversion of unstructured text that is not recognizable by a computer into text vectors that can be recognized by the computer. Traditional text representations such as Word2Vec and the like are prone to problems such as Word ambiguity. Text is expressed by using a pre-training model, text is pre-trained based on word granularity, and text coding vectors obtained by an input layer are input into the pre-training model to obtain word vector codes.
The text representation is as follows: the pre-training model sets the maximum sequence length (max_se q_length) of the input text, and cuts the long text. For short single sentences, the [ CLS ] and [ SEP ] flag bits are respectively added at the head and the tail of the sentence, and insufficient positions are filled with [ PAD ]. For a long sentence composed of two sentences, three flag bits of [ CLS ] [ SEP ] can be added at the head and tail, respectively. If the length of the long sentence is greater than max_seq_length, a sentence having a longer length is truncated from the tail.
Assume that the input sentence is: "The service is all in readiness," and assuming that the length of max_length of the pre-training model is 10, the word-processed sentence is:
Tokens=[[CLS],The,services,is,all,in,readiness,[SEP],[PAD],[PAD]]
an attribute_mask that further generates a vector representation may be expressed as:
attention_mask=[1,1,1,1,1,1,1,1,0,0]
then, token_ids are generated by means of a dictionary set in advance.
token_ids=[101,239,304,534,436,874,738,102,0,0]
Finally, the intent_mask and token_ids are input into the trained model for processing, resulting in an ebedding representation of each word.
(3) Emotion characteristics extraction layer (Sentiment feature extraction method)
And (3) taking the processed text representation vector as an input vector, inputting the input vector into a pre-training model, and extracting the characteristics by adopting an encoder module of the pre-training model. The accuracy of feature extraction is related to the number of layers of the encoder. The more layers the encode r is, the more accurate the feature extraction accuracy is.
The feature extraction layer uses a self-attention mechanism to perform weight distribution to display the importance of words in the text data set. The self-attention mechanism is a key component of a transducer model, a group of queries are mapped to corresponding sets of keys and values vectors, the core is to calculate the semantic relevance of words in a text, the main function is to enhance attention and improve feature extraction capability and accuracy, and the formula is as follows:
head i =Attention(QW i Q ,KW i k ,VW i v ) (3)
MultiHead(Q,K,V)=Concat(head 1 ,......,head n )W 0 (4)
in the formula, Q, K, V represents input vectors query, key and value, W i Q 、W i k 、W i v Representing the new matrix obtained after linear change, d k Representing vector dimensions, W 0 Representing a weight matrix.
(4) Output layer (sentiment pooling, sentiment classifier, output context)
The output layer is used for splicing and fusing vectors obtained by extracting the characteristics of the text, and classifying the characteristics of the text by using a classifier. Emotion classification is performed herein using a pretrained model fine-tuned Softmax classifier. According to the embodiment of the invention, six attributes are classified four times, and each attribute is classified by adopting 6 models and classifiers. Softmax is an activation function that normalizes a vector of values to a vector of probability distributions, and the sum of the probabilities is 1. The Softmax layer is often used in conjunction with a cross entropy loss function. The Softmax function formula is as follows:
where i represents the number of the output node.
4. Index construction
Tourist emotion tendency evaluation dimension and index
Analyzing the fine granularity emotion of tourists, firstly, carrying out multi-granularity division on the comments of the tourists, wherein the multi-granularity division comprises a target layer, dimensions and indexes, and the details are shown in a table 2:
table 2 guest online comment fine-grained division table
According to table 2, each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension contains 2 to 3 metrics. The emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.
5. On-line comment fine granularity emotion analysis method for tourists based on pre-training model
(1) Data source acquisition:
the data set of the embodiment is derived from online comments of Yunnan nationality villages in the carrying network, and the website is as follows: https:// you. Ct rip.com/sight/kunming29/2973.Html. Data are obtained through four steps: (1) initiating a request. Packaging url carrying the comment of the program network into a request by using an http library tool, wherein the request comprises a request head, a browser and the like; (2) acquiring a webpage file. The server returns json format data of response according to a request initiated by the crawler program, and only comment text information and user text information required by research are saved; and (3) extracting the target text. Acquiring key information in a regular matching mode by calling a re library, analyzing all comments needing to be crawled, and storing the comments in a database; (4) changing the parameters of url. And (3) repeating the step (2), and then analyzing comment information by a regular expression matching method and storing the comment information in a database. And finally crawling 3274 data comments, wherein the training set comprises 2620 pieces of tagged comment data, the verification set comprises 327 pieces of tagged comment data, and the test set comprises 327 pieces of untagged comment data.
The guest fine granularity emotion analysis comprises data acquisition, data preprocessing, text vectorization, feature extraction, emotion classification, model performance comparison and the like. As shown in fig. 3:
(1) And (5) data acquisition. Python crawl is used as a dataset for online comments of tourists in the take-and-travel nationality village.
(2) And (5) preprocessing data. Text is preprocessed using python third party library jieba, including supplementing incomplete data, data cleansing, deactivating words, etc. The incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; deactivating words generally refers to removing unwanted words that occur more frequently.
(3) And (5) emotion analysis. Emotion analysis includes attribute-emotion word extraction, partitioning of data sets, text vectorization, emotion classification, and the like. Wherein (1) attribute classification and labeling. The preprocessed data are classified and marked, the preprocessed data are divided according to 6 dimensions of diet, price, entertainment activities, environment, service, travel experience and the like, and then each dimension is divided into fine grains, for example, the diet is divided into diversity, taste, quantity and the like, so that the object, entity or attribute of tourist evaluation can be obtained. And (3) finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2. (2) The data set is partitioned. Calling the trace_test_split () of sklearn to randomly divide the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1. and testing and training the acquired data set. (3) Text vectorization. The segmented tokens are transformed into word vectors by means of the bridges token to ids, the resulting word vectors token ids, attn mask and seg ids are input into a model, and the vector representation of each word, hidden_reps and the vector representation of the first token ([ CLS ]). (4) Attribute-emotion word extraction. Comment text attribute features are extracted using a pre-training model, a self-attention mechanism. (5) And (5) emotion classification. The method comprises the steps of pre-training by using MLM and NSP of a pre-training model, classifying texts of vectors of a first toekn ([ CLS ]) output by training, and normalizing by using softmax to obtain probability of each text classification, so as to realize emotion classification.
(4) Statistical analysis and performance comparison. The statistical analysis is to re-sample by using a Bootstrap method, and the performance comparison is to judge whether the model is good or bad by taking the precision rate P, the recall rate R, F value, the AUC and the like as evaluation indexes. Wherein (1) Bootstrap resampling. According to the experimental effect, the distribution proportion of the data set is readjusted, the sample capacity is enlarged on the basis of the existing sample, the average labeling data about neutral and negative evaluation of the set attribute is too little, so that the related data in the training set and the test set are little, the result is not ideal, the resampling is needed, and the proportion of neutral and negative average of the set attribute is enlarged. (2) Model performance comparison. And (3) taking BILSTM and textCNN as baselines, comparing Precision, recall, F values and AUC of the three models, and judging classification effects and merits of the models.
6. Examples the origins are as follows:
1. original text: bad scenic spots. The very near parking lot is withdrawn and the electric car is picked up for the guest and charged 25. Ticket 175, with only a few stones, is priced true. But simply receive high price tickets. Service personnel have bad attitudes. The second pass is not taken.
2. Original text: the ticket is purchased, the custom of other nationalities, life reduction and the like can be reflected, the people are carried with singing and dancing, the performances are rich, various local handmade products and clothes can be seen, the people can see well, the people can select fewer people in eating, and the people can carry dry grains.
3. Original text: the scenic spot is big, the eating is expensive and the playing is also expensive. But the environment is clean, and grasslands and flowers are well maintained. Is suitable for a person to play. The main minority of Yunnan is introduced, and the knowledge can be expanded.
The final output results of this example are shown in the following table:
in still another embodiment of the present invention, a system for analyzing a fine granularity emotion of a guest online comment is provided, where the system can be used to implement the above method for analyzing a fine granularity emotion of a guest online comment, and specifically includes: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;
data collection and storage module: and crawling the tourist comment data set corresponding to the travel destination on the network.
And a data preprocessing module: the data set is supplemented with incomplete data, data cleansing and decommissioning words.
The emotion analysis module is used for dividing a data set, vectorizing texts, extracting attribute-emotion words and carrying out emotion classification by attribute classification and marking to obtain the probability of each text classification and realize emotion classification.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (6)

1. The online comment fine granularity emotion analysis method for the tourists is characterized by comprising the following steps of:
s1: crawling a tourist comment data set corresponding to the travel destination on the network;
s2: preprocessing a data set, including: supplementing incomplete data, cleaning data and deactivating words; the incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; removing stop words generally refers to removing unwanted words that occur more frequently;
s3: emotion analysis comprising the sub-steps of:
s31: attribute classification and labeling; classifying and labeling the preprocessed data, and performing dimension division, wherein the dimensions comprise: diet, price, entertainment, environment, service and travel experience, and carrying out fine granularity division on each dimension to obtain objects, entities or attributes of tourist evaluation; finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2;
s32: dividing the data set; randomly dividing the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1, a step of; testing and training the acquired data set;
s33: text vectorization; converting the segmented packed tokens into word vectors, inputting the obtained word vectors token_ids, attn_mask and seg_ids into a model, and obtaining the vector representation of each word, i.e. the vector representation of the hidden_reps and the vector representation of the first token;
s34: extracting attribute-emotion words; extracting comment text attribute characteristics by using a pre-training model and a self-attention mechanism;
s35: classifying emotion; the MLM and NSP of the pre-training model are utilized to perform pre-training, text classification is performed on the first toekn vector output by training, and then normalization processing is performed by utilizing softmax, so that probability of each text classification is obtained, and emotion classification is achieved.
2. The method for analyzing the online comment fine granularity emotion of the tourist according to claim 1, which is characterized in that: partitioning the guest comment data set, comprising: target layer, dimension, index and emotion tendency assignment, as shown in the following table:
each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension containing 2 to 3 indices; the emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.
3. The method for analyzing the online comment fine granularity emotion of the tourist according to claim 1, which is characterized in that: the pre-training model comprises: the system comprises an input layer, a text representation layer, an emotion feature extraction layer and an output layer;
input layer: the input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatment of word embedding, segment embedding and position embedding on the text through a pre-training model to obtain a coding vector of the input text;
text presentation layer: converting unstructured text which cannot be identified by a computer into text vectors which can be identified by the computer;
emotion feature extraction layer: the processed text expression vector is used as an input vector and is input into a pre-training model, and an encoder module of the pre-training model is adopted for feature extraction;
the feature extraction uses a self-attention mechanism to carry out weight distribution, and the importance degree of words in the text data set is displayed;
output layer: the method comprises the steps of splicing and fusing vectors obtained by extracting the characteristics of the text, and classifying the characteristics of the text by using a classifier.
4. A guest online comment fine granularity emotion analysis method according to claim 3, characterized in that: the text presentation layer is processed as follows: the pre-training model sets the maximum sequence length of the input text, and cuts the long text; for short single sentences, respectively adding [ CLS ] and [ SEP ] flag bits at the head and tail of the sentence, and filling insufficient positions with [ PAD ]; for a long sentence formed by two sentences, three flag bits of [ CLS ] [ SEP ] [ SEP ] can be respectively added at the head and the tail; if the length of the long sentence is greater than max_seq_length, cutting off the sentence with longer length from the tail;
further generating an attribute_mask of the vector representation;
then, generating token_ids by means of a preset dictionary;
finally, the intent_mask and token_ids are input into the trained model for processing, resulting in an ebedding representation of each word.
5. A guest online comment fine granularity emotion analysis method according to claim 3, characterized in that: carrying out emotion classification by using a soft max classifier finely tuned by a pre-training model in an output layer; four classification is carried out on the six attributes, and then each attribute is classified by adopting 6 models and classifiers; a numerical vector is normalized to a probability distribution vector using a Softmax activation function.
6. A guest online comment fine granularity emotion analysis system operable to implement the guest online comment fine granularity emotion analysis method of one of claims 1 to 5, comprising: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;
data collection and storage module: crawling a tourist comment data set corresponding to the travel destination on the network;
and a data preprocessing module: supplementing incomplete data, data cleaning and disabling words for the data set;
the emotion analysis module is used for dividing a data set, vectorizing texts, extracting attribute-emotion words and carrying out emotion classification by attribute classification and marking to obtain the probability of each text classification and realize emotion classification.
CN202310232968.6A 2023-03-10 2023-03-10 Tourist online comment fine granularity emotion analysis method and system Pending CN116737922A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310232968.6A CN116737922A (en) 2023-03-10 2023-03-10 Tourist online comment fine granularity emotion analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310232968.6A CN116737922A (en) 2023-03-10 2023-03-10 Tourist online comment fine granularity emotion analysis method and system

Publications (1)

Publication Number Publication Date
CN116737922A true CN116737922A (en) 2023-09-12

Family

ID=87903269

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310232968.6A Pending CN116737922A (en) 2023-03-10 2023-03-10 Tourist online comment fine granularity emotion analysis method and system

Country Status (1)

Country Link
CN (1) CN116737922A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131161A (en) * 2023-10-24 2023-11-28 北京社会管理职业学院(民政部培训中心) Electric wheelchair user demand extraction method and system and electronic equipment
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858973A (en) * 2019-02-18 2019-06-07 成都中科大旗软件有限公司 A kind of analysis method of regional tourism industry development
CN111078894A (en) * 2019-12-17 2020-04-28 中国科学院遥感与数字地球研究所 Scenic spot evaluation knowledge base construction method based on metaphor topic mining
CN111310474A (en) * 2020-01-20 2020-06-19 桂林电子科技大学 Online course comment sentiment analysis method based on activation-pooling enhanced BERT model
CN112597306A (en) * 2020-12-24 2021-04-02 电子科技大学 Travel comment suggestion mining method based on BERT
CN114896969A (en) * 2022-05-12 2022-08-12 南京优慧信安科技有限公司 Method for extracting aspect words based on deep learning
CN115129807A (en) * 2022-04-06 2022-09-30 国家计算机网络与信息安全管理中心 Fine-grained classification method and system for social media topic comments based on self-attention
CN115577072A (en) * 2022-10-09 2023-01-06 辽宁工程技术大学 Short text sentiment analysis method based on deep learning
CN115630653A (en) * 2022-11-02 2023-01-20 合肥学院 Network popular language emotion analysis method based on BERT and BilSTM

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858973A (en) * 2019-02-18 2019-06-07 成都中科大旗软件有限公司 A kind of analysis method of regional tourism industry development
CN111078894A (en) * 2019-12-17 2020-04-28 中国科学院遥感与数字地球研究所 Scenic spot evaluation knowledge base construction method based on metaphor topic mining
CN111310474A (en) * 2020-01-20 2020-06-19 桂林电子科技大学 Online course comment sentiment analysis method based on activation-pooling enhanced BERT model
CN112597306A (en) * 2020-12-24 2021-04-02 电子科技大学 Travel comment suggestion mining method based on BERT
CN115129807A (en) * 2022-04-06 2022-09-30 国家计算机网络与信息安全管理中心 Fine-grained classification method and system for social media topic comments based on self-attention
CN114896969A (en) * 2022-05-12 2022-08-12 南京优慧信安科技有限公司 Method for extracting aspect words based on deep learning
CN115577072A (en) * 2022-10-09 2023-01-06 辽宁工程技术大学 Short text sentiment analysis method based on deep learning
CN115630653A (en) * 2022-11-02 2023-01-20 合肥学院 Network popular language emotion analysis method based on BERT and BilSTM

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131161A (en) * 2023-10-24 2023-11-28 北京社会管理职业学院(民政部培训中心) Electric wheelchair user demand extraction method and system and electronic equipment
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method
CN117390141B (en) * 2023-12-11 2024-03-08 江西农业大学 Agricultural socialization service quality user evaluation data analysis method

Similar Documents

Publication Publication Date Title
CN110134757B (en) Event argument role extraction method based on multi-head attention mechanism
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
CN109062893B (en) Commodity name identification method based on full-text attention mechanism
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
CN110390018A (en) A kind of social networks comment generation method based on LSTM
CN113505204B (en) Recall model training method, search recall device and computer equipment
CN116737922A (en) Tourist online comment fine granularity emotion analysis method and system
CN108108468A (en) A kind of short text sentiment analysis method and apparatus based on concept and text emotion
CN111666376B (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN113239142A (en) Trigger-word-free event detection method fused with syntactic information
CN111143507A (en) Reading understanding method based on composite problems
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN112131345A (en) Text quality identification method, device, equipment and storage medium
CN114491023A (en) Text processing method and device, electronic equipment and storage medium
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
CN116562291A (en) Chinese nested named entity recognition method based on boundary detection
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN115357711A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN114840680A (en) Entity relationship joint extraction method, device, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination