CN111858886B

CN111858886B - Object and viewpoint extraction system for airport comments

Info

Publication number: CN111858886B
Application number: CN202010666697.1A
Authority: CN
Inventors: 张日崇; 李肖杨; 孙凯; 胡志元
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-13
Filing date: 2020-07-13
Publication date: 2022-05-31
Anticipated expiration: 2040-07-13
Also published as: CN111858886A

Abstract

The invention relates to an object and viewpoint extraction system for airport comments, which belongs to the field of natural language processing, and is characterized in that a system logic architecture is arranged to comprise a data input module, a data preprocessing and data dividing module, a data enhancement module, a comment object extraction module, a comment content extraction module, an object and content matching module and a comment result output module, and the improved BilSTM-CRF-based model is utilized to realize the extraction of Chinese-based comment objects and comment contents, so that the labor cost for carrying out data annotation on emotion classification is reduced, a label system is expanded to pay attention to new comment objects, the emotion tendency of a specific comment object is displayed in a standardized manner, and the standardized comment matching result is finally output.

Description

Object and viewpoint extraction system for airport comments

Technical Field

The invention relates to the field of natural language processing, in particular to an object and viewpoint extraction system for airport comments.

Background

Comment object extraction is a basic task in the field of emotion analysis and opinion mining, and is a key problem for performing fine-grained emotion analysis. The goal is to identify and extract the objects evaluated in the text, where the objects evaluated are usually nouns or noun phrases. The comment object extraction can be divided into explicit extraction and implicit extraction, wherein the explicit extraction means that the comment object directly appears in the comment, and the implicitly extracted object does not obviously appear in the comment. There are generally three methods to solve the problem of comment object extraction, which are: rule-based methods, linear statistics-based methods, and deep learning-based methods. In recent years, deep learning approaches tend to perform better in many emotion analysis tasks. The existing deep learning method generally takes the extraction of comment objects as a sequence labeling problem, and obtains an evaluation object in a text sequence by labeling the text sequence.

Most of the current extraction models aim at English language systems, extracted objects are data of general fields generally, and no comment object extraction model aiming at specific fields, particularly the aviation field exists. The comment objects in the aviation field are terms and phrases having professional characteristics, unlike the daily comment objects. Meanwhile, other words in the comments and comments are usually spoken and have no complete syntactic structure. The above reasons make the aviation field review object extraction task challenging.

Disclosure of Invention

The technical scheme of the invention aims to realize an airport comment object and viewpoint extraction system, which comprises a data input module, a data preprocessing and data dividing module, a data enhancement module, a comment object extraction module, a comment content extraction module, an object and content matching module and a comment result output module on a system logic architecture;

the data input module is used for collecting and inputting external comments on flights and comments of an airport, and inputting data of the comments and comments to the data preprocessing and data dividing module, wherein the data preprocessing and data dividing module comprises two steps of data preprocessing and data dividing, the data preprocessing step is used for matching keywords of input comment data of unmarked flights by using an existing label system, taking the comment data as a comment object extraction label, performing word division on the comment data of the airport, screening nouns and noun phrases in the comment data, then manually filtering and modifying the comment data to obtain a corresponding extraction label, and manually deleting abnormal data in the comment data; the data dividing step selects a data combination with the lowest label repetition rate, and selects a label combination with the lowest label repetition rate in a plurality of modes;

The data enhancement module performs synonym replacement processing and duplicate removal processing on data and labels, and performs data enhancement by using an EDA algorithm to obtain a new label set;

the comment object extraction module extracts the comment object by using an improved extraction method based on a BilSTM-CRF model and sends the obtained comment object to the object and content matching module; the comment content extraction module is used for matching comment texts by utilizing an emotion dictionary formed by emotion words to obtain comment contents with emotion tendencies and sending the comment contents to the object and content matching module; the object and content matching module firstly screens the part-of-speech of the extracted result of the comment object, reserves the comment object of the part-of-speech of nouns and vernaculars, then splices the sentiment words extracted by the comment object extraction module with the comment object in each short sentence, and finally checks whether the spliced result is in the comment, if yes, the sentiment words are sent to the comment result output module as the final extracted result, and if not, the comment object is directly sent to the comment result output module as the result;

and the comment result output module is used for outputting the spliced comment result.

The method for selecting the label combination with the minimum label repetition rate in the multiple modes in the data dividing step is specifically that a target repetition rate of 30% and the data volume of a target training set are preset, comment data corresponding to one word are randomly selected to be added into a test set when the current repetition rate is smaller than the preset repetition rate, and the repetition rate is recalculated; if the current repetition rate is greater than the preset repetition rate, sentences corresponding to the words with less frequency are taken out and added into the test set, the words are ensured not to appear in the training set, and the process is continuously repeated until the number of the preset test set is reached; the whole process is repeated for 10 times, one time with the minimum final repetition rate is selected as a dividing result, and the preset repetition rate is set to be 50% or 40% or 30% or 20%.

The data enhancement step of the data enhancement module is realized by adopting an EDA algorithm, and the EDA algorithm adopts 4 random strategies to carry out data enhancement: synonym replacement, random insertion, random exchange, random deletion.

The improved review object extraction module modifies the feature input part and the auxiliary dictionary based on the BilSTM-CRF model: in the feature input part, word vectors of Chinese characters are used, and a bert pre-training model is used for embedding the word vectors; the position and part-of-speech characteristics simultaneously comprise two characteristics: firstly, the position of a character in a word is marked with characteristics by using { B, M, E, S } labels and an NLP tool, secondly, the part-of-speech characteristics, the part-of-speech of a word to which each character belongs is taken as the part-of-speech characteristics of the character, and the position and the part-of-speech characteristics pass through a bidirectional LSTM; the dictionary features are based on 4-gram dictionary matching features, the existing linguistic data are subjected to word segmentation, n-gram combinations are carried out on nouns obtained by word segmentation, the obtained nouns and noun phrases are added into a dictionary, for each character, the dictionary features judge whether the combination of the 4-gram before and after the character appears in the dictionary, and the obtained 8-dimensional vector is the dictionary feature; and splicing the three characteristics, inputting the characteristics into a bidirectional LSTM layer, and then passing through a CRF layer to obtain a final result.

The emotion dictionary of the comment content extraction module is divided into a positive emotion dictionary, a negative emotion dictionary and an adverb dictionary, and the comment text matching process comprises the following steps: firstly, dividing a whole sentence text into a plurality of short sentences according to punctuations, then matching corresponding emotional words from a positive emotional dictionary and a negative emotional dictionary for each short sentence, and finally finding corresponding adverbs before and after each emotional word according to an adverb dictionary pair to form comment content.

The technical effects are as follows:

the method and the device extract the comment objects of the user comments in the application scene of the flight and the airport. The extracted comment objects are mainly used for two aspects: firstly, the labor cost of carrying out data labeling on emotion classification can be reduced, and secondly, a new label different from an inherent label system can be found in an extracted comment object, so that the label system is expanded, and the new comment object is concerned. In actual business, airlines and airports compare the emotional tendency of a client to a particular review object. So after extracting the comment object, extracting and matching the related viewpoints to obtain the complete comment of the user.

The technical effects are taken as a system, and the following three technical effects can be realized:

First, there is no system for extracting review objects and review contents in the field of aviation, and it is an urgent need of airlines to develop such a system. By extracting the objects of the comments of the flights and the airports, the method can help the airlines to know the attention points and the requirements of the users, and further analyze the main opinions of the passengers. Secondly, the system is developed, so that the annotation of the related data set of emotion analysis can be assisted, and the labor cost is saved. In the comment emotion analysis task, comment objects and emotion polarities need to be labeled for a large number of comment texts. The comment object extraction system can automatically extract the comment object, saves time for marking emotion analysis, and is not limited by a set label system. Finally, the tag extraction system can discover new comment objects, not limited to extracting comment objects that appear in the training data. By extracting the comment objects from the new comment texts, new labels are often obtained, and the new labels can reflect new problems and can also be used for enriching the existing label system. Such a system has a positive guiding effect on airlines to improve the quality of service in a timely manner.

Drawings

FIG. 1: integral structure

FIG. 2 is a schematic diagram: extraction model structure

Detailed Description

In order to achieve the purpose, the system logic architecture comprises a data input module, a data preprocessing and data dividing module, a data enhancement module, a comment object extraction module, a comment content extraction module, an object and content matching module and a comment result output module. In the aspect of data set, firstly solving the problem of no labeling, and then carrying out data enhancement processing such as synonym replacement, duplicate removal, EDA (electronic design automation) and the like on data in order to solve the problems of too few label varieties and avoid overfitting. The overall structure of the model is shown in fig. 1.

Data preprocessing step

The comment data of the invention is derived from flight comments and airport comments, wherein the comment data comprises 30000 flight comments and 2000 airport comments, and the initial data are not labeled. The labels of the flight comments are obtained by means of keyword matching, namely 187 labels of the existing label system are matched with 3 ten thousand flight comments and used as the labels extracted as the comment objects. The tags of the airport comments are obtained by segmenting the comments to screen out nouns and noun phrases in the comments, and then manually filtering and modifying the nouns and the noun phrases to obtain corresponding extraction tags. Both the labeling method of keyword matching and the method of word segmentation plus manual filtering are due to the high cost of manual labeling.

For 32000 comments, the abnormal data in the comments is deleted, which comprises the following steps: (1) there are no reviews of Chinese text; (2) no comments matching the tag; (3) clearing the emoticons; (4) and deleting the messy code symbols. After data washing, 19926 pieces of data were left, and the total number of label types was 163.

Data partitioning step

Considering that one of the objectives of the label extraction of the present invention is to find new labels, the extracted labels of the training set and the test set are not repeated as much as possible to ensure the model to perform in the task of extracting new labels. It is necessary to select the data combination with the lowest tag repetition rate when dividing the data set.

When data is divided, a plurality of dividing methods are tried, and the situation that the label repetition rate is minimum is selected. Specifically, a target repetition rate of 30% and a target training set data amount are preset first. If the current repetition rate is less than the preset repetition rate, randomly adding comment data corresponding to one word into the test set, and recalculating the repetition rate; if the current repetition rate is greater than the preset repetition rate, sentences corresponding to the words with less frequency are taken out and added into the test set, and the words can not appear in the training set. This is repeated until the number of test sets is reached. Repeating the above process 10 times, and selecting the one with the minimum final repetition rate as the division result. For the preset repetition rate, the rate is decreased from 50% to 20% by 10%, and experiments show that the division effect of the preset 30% is the best for the current comment data. And finally, 15896 data in the training set and 4030 data in the flight comment data are tested, and the label repetition rate is reduced to 33%. For 2000 airport reviews, a method of word segmentation and manual screening is adopted to remove special reviews without labels, so that 1418 reviews and 708 new labels are obtained and added into a training set.

Data enhancement module

And the data preprocessing step and the data dividing step are used for processing the labels, and the obtained data have corresponding extraction labels. However, due to the limitation of the above tag matching method, the number of tags corresponding to 2 ten thousand pieces of data for the flight comment is only 163. The number of labels obtained is too small compared to the training data, especially in the flight review section, which tends to result in over-fitting, making it difficult to extract new labels.

Firstly, aiming at the problem that the number of labels is not enough, synonym replacement processing is carried out on data and labels. Specifically, a Chinese synonym dictionary is utilized to replace synonyms for the labels and the labels appearing in the comment text according to the proportion in the synonym dictionary, and the purpose of enriching the label types is achieved. The 163 tags are expanded to 395 tags.

Secondly, because the number of the comments is too large, data is subjected to deduplication processing, namely, the number of data corresponding to each tag is controlled. For each label, about 4 training data are kept.

Finally, considering that the texts of the Data set are all short texts and the composition is simple, the EDA (easy Data augmentation) algorithm proposed in 2019 is selected for Data enhancement. The algorithm is proven to significantly improve the performance of natural language processing models on small data sets and reduce the degree of overfitting. The purpose of the EDA algorithm is to generate new text with similar semantics to existing text, and the algorithm adopts 4 random strategies for data enhancement: (1) synonym replacement, in which a plurality of words without stop words are randomly selected from the text and replaced by the synonym; (2) random insertion, in which a word of a non-stop word is randomly found out from a text, a synonym of the word is obtained, the synonym is inserted into a random position in a sentence, and the process is repeated for a plurality of times; (3) random exchange, in which two words are randomly selected from a text to carry out position exchange, and the position exchange is repeated for a plurality of times; (4) random deletion, removing words from sentences with some fixed probability.

After the above processing, the final training set size was 2396 pieces of data, the test set size was 1440 pieces of data, and total 1016 labels, and the label repetition rate of the training set and the test set was 33%.

Comment object extraction module

Through data enhancement, training data suitable for comment object extraction are obtained. In the link of comment object extraction, a feature input part and an auxiliary dictionary are modified by using an extraction method based on a BilSTM-CRF model proposed by Yanzeng Li et al in 2018. The overall structure of the model is as follows.

At first, most of the extraction models are models aiming at English language systems and are models based on words. In Chinese, a word is a basic unit representing semantics, so the model is based on the word. In the feature input section, the first feature is a word vector of the Chinese character. In order to better improve the model effect, the invention uses a bert pre-training model to embed the word vectors.

The position and part-of-speech characteristics comprise two characteristics at the same time. One is the position of the character in the word, and features are marked on the character by using a { B, M, E, S } label and an NLP tool. The second is part-of-speech characteristics, which are originally based on the characteristics of words, and the part-of-speech of the word to which each word belongs is taken as the part-of-speech characteristics of the word. The location and part-of-speech features pass through the bi-directional LSTM.

The dictionary features are 4-gram based dictionary matching features, and the features depend on a predefined extraction dictionary. In the invention, the existing linguistic data is divided into words, n-gram combination is carried out on nouns obtained by dividing words, and the obtained nouns and noun phrases are added into a dictionary. For each character, the dictionary feature determines whether a 4-gram combination before and after the character appears in the dictionary. The resulting 8-dimensional vector is the dictionary feature.

And splicing the three characteristics, inputting the characteristics into a bidirectional LSTM layer, and then passing through a CRF layer to obtain a final result.

Comment content extraction module

Besides the comment object, in practice, the airline company often pays attention to the comment content corresponding to the comment object. Common emotional words are usually fixed, such as "good", "bad", and the like. By using the emotion dictionary formed by these emotion words, the comment text is matched, and the comment content having an emotion tendency can be obtained.

Specifically, the emotion dictionary is divided into a positive emotion dictionary, a negative emotion dictionary, and an adverb dictionary. Firstly, the whole sentence text is divided into a plurality of short sentences according to punctuation marks. And then matching corresponding emotion words from the positive emotion dictionary and the negative emotion dictionary for each short sentence. And finally, finding corresponding adverbs before and after each emotional word according to the adverb dictionary pair to form comment content.

Comments	Comment content
		Good meal, good blank sister face value	Can, good luck
Driving skill special stick of captain	Extraordinary bar

Table 1 review content extraction example

Comment object and comment content matching module

Through observation of the comment text, it can be found that if a comment object has relatively definite comment content, the comment content tends to appear near the corresponding comment object. In consideration of the characteristics, after the comment content and the comment object are respectively extracted, the comment content and the comment object are matched to obtain a complete comment.

Specifically, part-of-speech filtering is performed on the result of extraction of the comment objects, and the comment objects of nouns and verb nouns are reserved. And then, in each short sentence, the sentiment words extracted by the comment object extraction model are spliced with the comment objects in the short sentence. Finally, whether the splicing result appears in the comment is checked. If so, as a final result of the decimation; if the comment object does not appear, the comment object does not have the corresponding emotional vocabulary possibly, so that the comment object is directly output as a result.

Comments	Comment content	Comment object	Matching results
				Good meal, good blank sister face value	Can, good luck	Diet and empty sister face value	Good food, good blank and good face value
Driving skill special stick of captain	Extraordinary bar	Driving technique	Driving skill extraordinary bar

Table 2 comment content and comment object matching examples.

Claims

1. An object and viewpoint extraction system for airport reviews, characterized in that: the system logic architecture comprises a data input module, a data preprocessing and data dividing module, a data enhancement module, a comment object extraction module, a comment content extraction module, an object and content matching module and a comment result output module;

the data input module is used for collecting and inputting external comments on flights and comments of an airport, and inputting data of the comments and the comments to the data preprocessing and data dividing module, wherein the data preprocessing and data dividing module comprises two steps of data preprocessing and data dividing, the data preprocessing step is used for matching keywords of input comment data of unmarked flights by using an existing label system, taking the comment data as a comment object extraction label, performing word division on the comment data of the airport, screening nouns and noun phrases in the comment data, then obtaining a corresponding extraction label through manual filtration and modification, and deleting abnormal data in the comment data through a manual mode; the data dividing step selects a data combination with the lowest label repetition rate, and selects a label combination with the lowest label repetition rate in a plurality of modes;

the comment object extraction module extracts a comment object by using an improved extraction method based on a BilSTM-CRF model and sends the obtained comment object to the object and content matching module; the comment content extraction module is used for matching comment texts by utilizing an emotion dictionary formed by emotion words to obtain comment contents with emotion tendencies and sending the comment contents to the object and content matching module; the object and content matching module firstly performs part-of-speech screening on the result extracted by the comment object, reserves the comment object of the part-of-speech of nouns and action nouns, then splices the sentiment words extracted by the comment object extraction module with the comment object in each short sentence, and finally checks whether the spliced result appears in the comment, if so, the spliced result is sent to the comment result output module as the final result of extraction, and if not, the comment object is directly sent to the comment result output module as the result;

the comment result output module is used for outputting the spliced comment result;

The improved review object extraction module modifies the feature input part and the auxiliary dictionary based on a BilSTM-CRF model: in the feature input part, word vectors of Chinese characters are used, and a bert pre-training model is used for embedding the word vectors; the position and part-of-speech characteristics simultaneously comprise two characteristics: firstly, the position of a character in a word is marked with characteristics by using { B, M, E, S } labels and an NLP tool, secondly, the part-of-speech characteristics, the part-of-speech of a word to which each character belongs is taken as the part-of-speech characteristics of the character, and the position and the part-of-speech characteristics pass through a bidirectional LSTM; the dictionary features are based on 4-gram dictionary matching features, the existing linguistic data are subjected to word segmentation, n-gram combinations are carried out on nouns obtained by word segmentation, the obtained nouns and noun phrases are added into a dictionary, for each character, the dictionary features judge whether the combination of the 4-gram before and after the character appears in the dictionary, and the obtained 8-dimensional vector is the dictionary feature; and splicing the position, the part-of-speech characteristic and the dictionary characteristic of the character in the word, inputting the characteristic into a bidirectional LSTM layer, and then passing through a CRF layer to obtain a final result.

2. The system for extracting objects and viewpoints of airport reviews as claimed in claim 1, wherein: the method for selecting the label combination with the minimum label repetition rate in the multiple modes in the data dividing step is specifically that a target repetition rate of 30% and the data volume of a target training set are preset, comment data corresponding to one word are randomly selected to be added into a test set when the current repetition rate is smaller than the preset repetition rate, and the repetition rate is recalculated; if the current repetition rate is greater than the preset repetition rate, sentences corresponding to the words with less frequency are taken out and added into the test set, the words are ensured not to appear in the training set, and the process is continuously repeated until the number of the preset test set is reached; the whole process is repeated for 10 times, one time with the minimum final repetition rate is selected as a dividing result, and the preset repetition rate is set to be 50% or 40% or 30% or 20%.

3. The system for extracting objects and viewpoints of airport reviews as claimed in claim 2, wherein: the data enhancement step of the data enhancement module is realized by adopting an EDA algorithm, and the EDA algorithm adopts 4 random strategies to enhance data: synonym replacement, random insertion, random exchange, random deletion.

4. The system for extracting objects and viewpoints of airport reviews as claimed in claim 3, wherein: