CN116737922A

CN116737922A - Tourist online comment fine granularity emotion analysis method and system

Info

Publication number: CN116737922A
Application number: CN202310232968.6A
Authority: CN
Inventors: 王金丽; 袁泽辉; 吕宛青
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-09-12

Abstract

The invention discloses a tourist online comment fine granularity emotion analysis method and system. Processing the tourist comments, dividing the comments into a plurality of granularities, extracting a plurality of attribute-emotion word pairs, calculating and classifying through a model, and finally outputting emotion polarities corresponding to each attribute to acquire emotion tendencies of the tourists. The invention has the advantages that: the method has the advantages that massive unstructured comment data are processed to realize semantic relation learning, an index evaluation system of the emotion tendency of the tourists is established, a foundation is laid for online comment fine granularity emotion analysis of the tourists, comment data can be automatically extracted to conduct data optimization and processing, an accurate analysis result is finally obtained, analysis efficiency is improved, and analysis cost is reduced.

Description

Tourist online comment fine granularity emotion analysis method and system

Technical Field

The invention relates to the technical field of intelligent emotion analysis, in particular to a guest online comment fine granularity emotion analysis method and system based on a pre-training model.

Background

With the rapid development of computer technology and network technology, the Internet (Internet) plays an increasing role in people's daily life, learning and work. Moreover, with the development of the mobile internet, the internet is also moving.

Tourists will usually make their own subjective comments on the internet on each element of the travel destination. The evaluation of the tourist destination usually has rich emotion colors and subjectivity, the fine-granularity emotion tendency of tourists on each element of the tourist destination is excavated, factors for promoting and influencing the tourist destination are obtained, the reasons behind good evaluation or bad evaluation are analyzed, decision support is provided for tourist operators and managers, the tourists are assisted in selecting travel places and consumer products, and the image and the public praise of the tourist destination are improved.

But no analysis system and method for online comments of tourists has emerged.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a tourist online comment fine granularity emotion analysis method and system. Processing the tourist comments, dividing the comments into a plurality of granularities, extracting a plurality of attribute-emotion word pairs, calculating and classifying through a model, and finally outputting emotion polarities corresponding to each attribute to acquire emotion tendencies of the tourists.

In order to achieve the above object, the present invention adopts the following technical scheme:

a tourist online comment fine granularity emotion analysis method comprises the following steps:

s1: and crawling the tourist comment data set corresponding to the travel destination on the network.

S2: preprocessing a data set, including: supplement incomplete data, data cleansing and decommissioning words. The incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; deactivating words generally refers to removing unwanted words that occur more frequently.

S3: emotion analysis comprising the sub-steps of:

s31: attribute classification and labeling. Classifying and labeling the preprocessed data, and performing dimension division, wherein the dimensions comprise: diet, price, entertainment, environment, service and travel experience, and then fine granularity division is carried out on each dimension to obtain objects, entities or attributes of tourist evaluation. And (3) finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2.

S32: the data set is partitioned. Randomly dividing the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1. and testing and training the acquired data set.

S33: text vectorization. The packed tokens of the word are converted into word vectors, the obtained word vectors token_ids, attn_mask and seg_ids are input into a model, and the vector representation of each word, i.e. the vector representation of the hidden_reps and the vector representation of the first token, are obtained.

S34: attribute-emotion word extraction. Comment text attribute features are extracted using a pre-training model, a self-attention mechanism.

S35: and (5) emotion classification. The MLM and NSP of the pre-training model are utilized to perform pre-training, text classification is performed on the first toekn vector output by training, and then normalization processing is performed by utilizing softmax, so that probability of each text classification is obtained, and emotion classification is achieved.

Further, partitioning the guest comment data set includes: target layer, dimension, index and emotion tendency assignment, as shown in the following table:

each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension contains 2 to 3 metrics. The emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.

Further, the pre-training model comprises: an input layer, a text representation layer, an emotion feature extraction layer and an output layer.

Input layer: the input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatment of word embedding, segment embedding and position embedding on the text through a pre-training model to obtain the coding vector of the input text.

Text presentation layer: unstructured text that cannot be recognized by a computer is converted into text vectors that can be recognized by the computer.

Emotion feature extraction layer: and (3) taking the processed text representation vector as an input vector, inputting the input vector into a pre-training model, and extracting the characteristics by adopting an encoder module of the pre-training model.

Feature extraction uses a self-attention mechanism to perform weight distribution, displaying the importance of words in the text dataset.

Output layer: the method comprises the steps of splicing and fusing vectors obtained by extracting the characteristics of the text, and classifying the characteristics of the text by using a classifier.

Further, the text presentation layer is processed as follows: the pre-training model sets the maximum sequence length of the input text, and cuts the long text. For short single sentences, the [ CLS ] and [ SEP ] flag bits are respectively added at the head and the tail of the sentence, and insufficient positions are filled with [ PAD ]. For a long sentence composed of two sentences, three flag bits of [ CLS ] [ SEP ] can be added at the head and tail, respectively. If the length of the long sentence is greater than max_seq_leng th, a sentence having a longer length is truncated from the tail.

Further generating an attribute_mask of the vector representation;

then, generating token_ids by means of a preset dictionary;

finally, the intent_mask and token_ids are input into the trained model for processing, resulting in an ebedding representation of each word.

Further, the emotion classification is performed in the output layer using a pretrained model fine-tuned Softmax classifier. Four classification is carried out on the six attributes, and then each attribute is classified by adopting 6 models and classifiers. A numerical vector is normalized to a probability distribution vector using a Softmax activation function.

The invention also discloses a guest online comment fine granularity emotion analysis system which can be used for implementing the guest online comment fine granularity emotion analysis method, and specifically comprises the following steps: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;

data collection and storage module: and crawling the tourist comment data set corresponding to the travel destination on the network.

And a data preprocessing module: the data set is supplemented with incomplete data, data cleansing and decommissioning words.

The emotion analysis module is used for dividing a data set, vectorizing texts, extracting attribute-emotion words and carrying out emotion classification by attribute classification and marking to obtain the probability of each text classification and realize emotion classification.

Compared with the prior art, the invention has the advantages that:

1. in the face of massive unstructured comment data, general guest comment grammar semantic knowledge is formed by learning the context semantic relation of each comment, and the model can be applied to specific target tasks, so that semantic relation learning is realized;

2. aiming at massive online comments of tourists, an index evaluation system of the emotion tendencies of the tourists is established from multi-dimensional fine granularity, and a foundation is laid for the analysis of the emotion of the online comments of the tourists;

3. the comment data can be automatically extracted for data optimization and processing, so that an accurate analysis result is finally obtained, the analysis efficiency is improved, and the analysis cost is reduced.

Drawings

FIG. 1 is a diagram of a pre-training model structure in accordance with an embodiment of the present invention;

FIG. 2 is a fine granularity emotion analysis model framework diagram based on pre-training in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of an experiment of an embodiment of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the accompanying drawings and by way of examples in order to make the objects, technical solutions and advantages of the invention more apparent.

1. Definition of the objects of the invention

The tourist fine granularity emotion analysis of the tourist at the tourist destination is used for inputting text data such as direct emotion expression of the tourist at the tourist website on the tourist destination, and outputting emotion polarities of the tourist on various elements of the tourist, including positive, negative, neutral and the like. Tourist destinations include many factors such as price, environment, traffic, service, etc., some tourists may be satisfied with some aspects but not others, so tourist comments may have positive comments, negative comments, and neutral comments.

If the sample 1' the national village is very large, the four-to-five hour period is required, all the characteristics of the nations in the sample are very complete, the matched service facilities are very complete, the eating amount is very much, the water quality is very good, the scenery is general, but the eating amount is expensive, and the sample is suitable for being played by a person. In this embodiment, the tourists have positive attitudes for national village supporting facilities, things (things are very much eaten), water quality and the like, have neutral attitudes for scenery colors, and have negative attitudes for things (things are expensive) and entertainment prices (also expensive to play).

The invention aims to process the comment, divide the comment into a plurality of granularities, extract a plurality of attribute-emotion word pairs, calculate and classify the comment through a model, and finally output emotion polarities corresponding to each attribute to acquire emotion tendencies of tourists.

However, the method is a Chinese comment, a Chinese text is complex and consists of a plurality of equally spaced words, namely characters, a word or a sentence consists of a plurality of characters, and the problems of word ambiguity, syntax dependence and the like exist, so that the text needs to be preprocessed, and a corresponding label is allocated to each character of the text through a sequence, a position label and the like. A POSTIVE label is allocated for the positive emotion of tourists, which is shown by a certain attribute or characteristic of a tourist destination, and is indicated by '1'; assigning NEGTIVE labels to comment features showing negative emotions, wherein the NEGTIVE labels are represented by '1'; assigning a NUETRAL tag to comment features exhibiting neutral emotion, denoted by "0"; NOTM is labeled for non-mentioned comment features, denoted by "-2".

For the extracted tourist attribute-fine granularity emotion word pair, a binary group { ATT _nm SO represents }. Wherein, ATT _nm Represents all attribute sets, n is a primary index, m is a secondary index, SO represents emotion polarity, so= {1,0, -1, -2}. Assume that the input sight spot online comment is x= { w ₁ ,w ₂ ,…,w _n -w is _i (i=1, 2,3,) represents individual characters of the comment text. After the pre-training model processing treatment, the emotion polarities of n×m attributes are obtained, and are expressed as y= { (ATT) _ij ,1)，(ATT _(i+1)j ,0),……,(ATT _nm ,-1)}。

The given index system is { diet, price, entertainment, environment, service, travel experience }, the primary index "environment", "diet", "price," "environment", whether the corresponding secondary index "facility is complete-complete", "diversified-particularly-multiple", "price level-noble", "landscape-general", the corresponding output can be expressed as y= { (a [1,1], 1), (a [2,2], -1), (a [3, j ], -2), (a [4,3], 1), (a [4,1], 0), (a [5,j ], -2), (a [6,j ], -2) }, thus, the tourist emotion tendency is { (diet (diversified), positive); (price level), negative); (entertainment, not mentioned); (environment (facility), positive); (environmental (landscape), neutral); (entertainment, not mentioned); (service, not mentioned) }. The output result is finer granularity emotion tendencies of tourists.

2. Pre-training model

Pre-training can obtain a pre-training model irrelevant to specific tasks from large-scale data through self-supervision learning, and is essentially a transfer learning method. In natural language processing, the pre-training model learns large-scale data in advance to form a knowledge system containing a context relation, and then brings the knowledge system to a specific task, so that an accurate task execution result can be obtained.

The guest online comment pre-training model can implicitly learn the grammatical semantic knowledge of the online comment by inputting massive comment data into the model, inputting a contextually relevant representation of each comment. Further, knowledge learned in the open field is migrated to the downstream task, and the execution efficiency of the downstream task is improved.

The pre-training model is an optimized model based on the improvement of an attention mechanism and a transducer model, and can be divided into a masking language model (masked language model, MLM) and a predicting next sentence (next sentenceprediction, NSP), wherein the MLM and the NSP can realize word vector representation and sentence semantic feature extraction, and logical relations among sentences are mined. The pre-training model adopts bidirectional coding to analyze context semantic information in the text segment, and combines an attention mechanism to establish association strength between words, so that a plurality of texts can be efficiently processed in parallel, dependency relationship and semantic association information between the words and the context can be efficiently identified, and the problems of long-term dependence and word ambiguity in the texts are effectively solved. The pre-training model structure is shown in fig. 1.

In fig. 1, E denotes an input, trm denotes a transducer encoder, T denotes a resultant text, and the pre-training model is composed of a stack of layers of bidirectional transducer encoders. The pre-training model consists of an input layer, a coding layer and an output layer. The input layers include word embedding (token embedding), segment embedding (segmen t embeddings), and position embedding (position embeddings). Converting unstructured text which cannot be understood by a computer into structured text which can be understood by the computer; segment embedding can segment a sentence, and the sentence is segmented by inserting tags [ CLS ] and [ SEP ] at the beginning and the end of a sentence; the position embedding is to encode the position information of the vocabulary and save time sequence information. The output layer is a word vector sequence obtained by processing the input words. The obtained word vector can be directly applied to downstream tasks, such as text emotion analysis and the like.

3. Fine granularity emotion analysis model based on pre-training

The embodiment of the invention provides a pretrained fine granularity emotion analysis model based on fusion of fine granularity emotion analysis, and the model structure is shown in figure 2.

(1) Input layer (Input context)

The input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatments such as word embedding, segment embedding, position embedding and the like on the text through a pre-training model to obtain the coding vector of the input text. For a given user set U, a comment set V, for which any one comment text sequence can be expressed as s= { w ₁ ,w ₂ ,...,w _n Where n represents the total number of comment words. Corresponding word vectors can be obtained through a pre-training model and expressed asThe position embedding is very critical, and can endow each word vector with corresponding position information to judge the specific position of the word in the sequence, wherein the formula is shown in the formula (1):

where pos represents the position of the word, i represents the index of the position vector, d _model Representing the dimension of the word vector.

(2) Text presentation layer (Text vectorization representation)

Text representation is the conversion of unstructured text that is not recognizable by a computer into text vectors that can be recognized by the computer. Traditional text representations such as Word2Vec and the like are prone to problems such as Word ambiguity. Text is expressed by using a pre-training model, text is pre-trained based on word granularity, and text coding vectors obtained by an input layer are input into the pre-training model to obtain word vector codes.

The text representation is as follows: the pre-training model sets the maximum sequence length (max_se q_length) of the input text, and cuts the long text. For short single sentences, the [ CLS ] and [ SEP ] flag bits are respectively added at the head and the tail of the sentence, and insufficient positions are filled with [ PAD ]. For a long sentence composed of two sentences, three flag bits of [ CLS ] [ SEP ] can be added at the head and tail, respectively. If the length of the long sentence is greater than max_seq_length, a sentence having a longer length is truncated from the tail.

Assume that the input sentence is: "The service is all in readiness," and assuming that the length of max_length of the pre-training model is 10, the word-processed sentence is:

Tokens＝[[CLS],The,services,is,all,in,readiness,[SEP],[PAD],[PAD]]

an attribute_mask that further generates a vector representation may be expressed as:

attention_mask＝[1,1,1,1,1,1,1,1,0,0]

then, token_ids are generated by means of a dictionary set in advance.

token_ids＝[101,239,304,534,436,874,738,102,0,0]

(3) Emotion characteristics extraction layer (Sentiment feature extraction method)

And (3) taking the processed text representation vector as an input vector, inputting the input vector into a pre-training model, and extracting the characteristics by adopting an encoder module of the pre-training model. The accuracy of feature extraction is related to the number of layers of the encoder. The more layers the encode r is, the more accurate the feature extraction accuracy is.

The feature extraction layer uses a self-attention mechanism to perform weight distribution to display the importance of words in the text data set. The self-attention mechanism is a key component of a transducer model, a group of queries are mapped to corresponding sets of keys and values vectors, the core is to calculate the semantic relevance of words in a text, the main function is to enhance attention and improve feature extraction capability and accuracy, and the formula is as follows:

head _i ＝Attention(QW _i ^Q ,KW _i ^k ,VW _i ^v ) (3)

MultiHead(Q,K,V)＝Concat(head ₁ ,......,head _n )W ⁰ (4)

in the formula, Q, K, V represents input vectors query, key and value, W _i ^Q 、W _i ^k 、W _i ^v Representing the new matrix obtained after linear change, d _k Representing vector dimensions, W ⁰ Representing a weight matrix.

(4) Output layer (sentiment pooling, sentiment classifier, output context)

The output layer is used for splicing and fusing vectors obtained by extracting the characteristics of the text, and classifying the characteristics of the text by using a classifier. Emotion classification is performed herein using a pretrained model fine-tuned Softmax classifier. According to the embodiment of the invention, six attributes are classified four times, and each attribute is classified by adopting 6 models and classifiers. Softmax is an activation function that normalizes a vector of values to a vector of probability distributions, and the sum of the probabilities is 1. The Softmax layer is often used in conjunction with a cross entropy loss function. The Softmax function formula is as follows:

where i represents the number of the output node.

4. Index construction

Tourist emotion tendency evaluation dimension and index

Analyzing the fine granularity emotion of tourists, firstly, carrying out multi-granularity division on the comments of the tourists, wherein the multi-granularity division comprises a target layer, dimensions and indexes, and the details are shown in a table 2:

table 2 guest online comment fine-grained division table

According to table 2, each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension contains 2 to 3 metrics. The emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.

5. On-line comment fine granularity emotion analysis method for tourists based on pre-training model

(1) Data source acquisition:

the data set of the embodiment is derived from online comments of Yunnan nationality villages in the carrying network, and the website is as follows: https:// you. Ct rip.com/sight/kunming29/2973.Html. Data are obtained through four steps: (1) initiating a request. Packaging url carrying the comment of the program network into a request by using an http library tool, wherein the request comprises a request head, a browser and the like; (2) acquiring a webpage file. The server returns json format data of response according to a request initiated by the crawler program, and only comment text information and user text information required by research are saved; and (3) extracting the target text. Acquiring key information in a regular matching mode by calling a re library, analyzing all comments needing to be crawled, and storing the comments in a database; (4) changing the parameters of url. And (3) repeating the step (2), and then analyzing comment information by a regular expression matching method and storing the comment information in a database. And finally crawling 3274 data comments, wherein the training set comprises 2620 pieces of tagged comment data, the verification set comprises 327 pieces of tagged comment data, and the test set comprises 327 pieces of untagged comment data.

The guest fine granularity emotion analysis comprises data acquisition, data preprocessing, text vectorization, feature extraction, emotion classification, model performance comparison and the like. As shown in fig. 3:

(1) And (5) data acquisition. Python crawl is used as a dataset for online comments of tourists in the take-and-travel nationality village.

(2) And (5) preprocessing data. Text is preprocessed using python third party library jieba, including supplementing incomplete data, data cleansing, deactivating words, etc. The incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; deactivating words generally refers to removing unwanted words that occur more frequently.

(3) And (5) emotion analysis. Emotion analysis includes attribute-emotion word extraction, partitioning of data sets, text vectorization, emotion classification, and the like. Wherein (1) attribute classification and labeling. The preprocessed data are classified and marked, the preprocessed data are divided according to 6 dimensions of diet, price, entertainment activities, environment, service, travel experience and the like, and then each dimension is divided into fine grains, for example, the diet is divided into diversity, taste, quantity and the like, so that the object, entity or attribute of tourist evaluation can be obtained. And (3) finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2. (2) The data set is partitioned. Calling the trace_test_split () of sklearn to randomly divide the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1. and testing and training the acquired data set. (3) Text vectorization. The segmented tokens are transformed into word vectors by means of the bridges token to ids, the resulting word vectors token ids, attn mask and seg ids are input into a model, and the vector representation of each word, hidden_reps and the vector representation of the first token ([ CLS ]). (4) Attribute-emotion word extraction. Comment text attribute features are extracted using a pre-training model, a self-attention mechanism. (5) And (5) emotion classification. The method comprises the steps of pre-training by using MLM and NSP of a pre-training model, classifying texts of vectors of a first toekn ([ CLS ]) output by training, and normalizing by using softmax to obtain probability of each text classification, so as to realize emotion classification.

(4) Statistical analysis and performance comparison. The statistical analysis is to re-sample by using a Bootstrap method, and the performance comparison is to judge whether the model is good or bad by taking the precision rate P, the recall rate R, F value, the AUC and the like as evaluation indexes. Wherein (1) Bootstrap resampling. According to the experimental effect, the distribution proportion of the data set is readjusted, the sample capacity is enlarged on the basis of the existing sample, the average labeling data about neutral and negative evaluation of the set attribute is too little, so that the related data in the training set and the test set are little, the result is not ideal, the resampling is needed, and the proportion of neutral and negative average of the set attribute is enlarged. (2) Model performance comparison. And (3) taking BILSTM and textCNN as baselines, comparing Precision, recall, F values and AUC of the three models, and judging classification effects and merits of the models.

6. Examples the origins are as follows:

1. original text: bad scenic spots. The very near parking lot is withdrawn and the electric car is picked up for the guest and charged 25. Ticket 175, with only a few stones, is priced true. But simply receive high price tickets. Service personnel have bad attitudes. The second pass is not taken.

2. Original text: the ticket is purchased, the custom of other nationalities, life reduction and the like can be reflected, the people are carried with singing and dancing, the performances are rich, various local handmade products and clothes can be seen, the people can see well, the people can select fewer people in eating, and the people can carry dry grains.

3. Original text: the scenic spot is big, the eating is expensive and the playing is also expensive. But the environment is clean, and grasslands and flowers are well maintained. Is suitable for a person to play. The main minority of Yunnan is introduced, and the knowledge can be expanded.

The final output results of this example are shown in the following table:

in still another embodiment of the present invention, a system for analyzing a fine granularity emotion of a guest online comment is provided, where the system can be used to implement the above method for analyzing a fine granularity emotion of a guest online comment, and specifically includes: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will appreciate that the embodiments described herein are intended to aid the reader in understanding the practice of the invention and that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The online comment fine granularity emotion analysis method for the tourists is characterized by comprising the following steps of:

s1: crawling a tourist comment data set corresponding to the travel destination on the network;

s2: preprocessing a data set, including: supplementing incomplete data, cleaning data and deactivating words; the incomplete data is supplemented by the missing data; data cleansing is to delete a review that appears repeatedly or is not related to the subject; removing stop words generally refers to removing unwanted words that occur more frequently;

s3: emotion analysis comprising the sub-steps of:

s31: attribute classification and labeling; classifying and labeling the preprocessed data, and performing dimension division, wherein the dimensions comprise: diet, price, entertainment, environment, service and travel experience, and carrying out fine granularity division on each dimension to obtain objects, entities or attributes of tourist evaluation; finding out and marking emotion words corresponding to each attribute of the text, judging emotion tendencies expressed by each attribute to be positive, negative, neutral and unreferenced, wherein the label values are 1, -1, 0 and-2;

s32: dividing the data set; randomly dividing the data set into a test set, a training set and a verification set, wherein the proportion is 8:1:1, a step of; testing and training the acquired data set;

s33: text vectorization; converting the segmented packed tokens into word vectors, inputting the obtained word vectors token_ids, attn_mask and seg_ids into a model, and obtaining the vector representation of each word, i.e. the vector representation of the hidden_reps and the vector representation of the first token;

s34: extracting attribute-emotion words; extracting comment text attribute characteristics by using a pre-training model and a self-attention mechanism;

s35: classifying emotion; the MLM and NSP of the pre-training model are utilized to perform pre-training, text classification is performed on the first toekn vector output by training, and then normalization processing is performed by utilizing softmax, so that probability of each text classification is obtained, and emotion classification is achieved.

2. The method for analyzing the online comment fine granularity emotion of the tourist according to claim 1, which is characterized in that: partitioning the guest comment data set, comprising: target layer, dimension, index and emotion tendency assignment, as shown in the following table:

each dataset contains 6 dimensions: diet, price, entertainment, environment, service, travel experience, each dimension containing 2 to 3 indices; the emotional tendency corresponding to each index can be represented by positive direction, negative direction, neutrality and unreferenced, and the specific label values are 1, -1, 0 and-2.

3. The method for analyzing the online comment fine granularity emotion of the tourist according to claim 1, which is characterized in that: the pre-training model comprises: the system comprises an input layer, a text representation layer, an emotion feature extraction layer and an output layer;

input layer: the input layer carries out pretreatment on the input Chinese text, removes special punctuation and symbols, and then carries out processing treatment of word embedding, segment embedding and position embedding on the text through a pre-training model to obtain a coding vector of the input text;

text presentation layer: converting unstructured text which cannot be identified by a computer into text vectors which can be identified by the computer;

emotion feature extraction layer: the processed text expression vector is used as an input vector and is input into a pre-training model, and an encoder module of the pre-training model is adopted for feature extraction;

the feature extraction uses a self-attention mechanism to carry out weight distribution, and the importance degree of words in the text data set is displayed;

4. A guest online comment fine granularity emotion analysis method according to claim 3, characterized in that: the text presentation layer is processed as follows: the pre-training model sets the maximum sequence length of the input text, and cuts the long text; for short single sentences, respectively adding [ CLS ] and [ SEP ] flag bits at the head and tail of the sentence, and filling insufficient positions with [ PAD ]; for a long sentence formed by two sentences, three flag bits of [ CLS ] [ SEP ] [ SEP ] can be respectively added at the head and the tail; if the length of the long sentence is greater than max_seq_length, cutting off the sentence with longer length from the tail;

further generating an attribute_mask of the vector representation;

then, generating token_ids by means of a preset dictionary;

5. A guest online comment fine granularity emotion analysis method according to claim 3, characterized in that: carrying out emotion classification by using a soft max classifier finely tuned by a pre-training model in an output layer; four classification is carried out on the six attributes, and then each attribute is classified by adopting 6 models and classifiers; a numerical vector is normalized to a probability distribution vector using a Softmax activation function.

6. A guest online comment fine granularity emotion analysis system operable to implement the guest online comment fine granularity emotion analysis method of one of claims 1 to 5, comprising: the system comprises a data collection and storage module, a data preprocessing module and an emotion analysis module;

data collection and storage module: crawling a tourist comment data set corresponding to the travel destination on the network;

and a data preprocessing module: supplementing incomplete data, data cleaning and disabling words for the data set;