CN114490937A

CN114490937A - Comment analysis method and device based on semantic perception

Info

Publication number: CN114490937A
Application number: CN202210079218.5A
Authority: CN
Inventors: 王亚文; 王俊杰; 石琳; 王青
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-13

Abstract

The invention discloses a comment analysis method and device based on semantic perception, which comprises the following steps: collecting comment texts of the target application; dividing each of the comment texts into at least one sentence; extracting comment attributes of the comment text; splicing the vector of each word in the sentence with the vector of the comment attribute; based on the splicing result, the probability vector of the word BIO label is calculated to obtain the defect characteristics in the comment text; and clustering the defect characteristics to obtain a comment analysis result. The method models the defect characteristic extraction task into named entity identification, and improves the accuracy of phrase identification by introducing a defect characteristic identification model with comment attributes.

Description

Comment analysis method and device based on semantic perception

Technical Field

The invention belongs to the technical field of computers, relates to technologies such as demand engineering and natural language processing, and particularly relates to a comment analysis method and device based on semantic perception,

Background

Mobile application (App) development has been active for over a decade, producing millions of apps available for handling a wide variety of tasks, such as shopping, banking, and social interactions. These mobile applications become increasingly indispensable in the daily life of the present invention, and the importance of mobile applications has prompted development teams to try to understand the new needs of users and defect reports and to develop quality assurance and software maintenance activities.

Users often write comments for the mobile phone applications they use on platforms such as apple app stores and google Play. These comments are typically some short text that the App developer can provide valuable information such as user experience, bug reports, and demand for new functionality. A full understanding of these comments helps developers improve application quality and user satisfaction. However, manually browsing and analyzing each user comment to gather information useful in the feedback is very time consuming, especially for popular applications that may receive hundreds of comments each day.

In recent years, automated techniques for mining App reviews have attracted much attention. Researchers have defined many tasks from different perspectives that help reduce the amount of work required to understand and analyze application reviews in a variety of ways, such as topic discovery and key phrase extraction tasks. However, the topic discovery task is primarily used to identify topics/aspects (e.g., compatibility, updates, networks, etc.) involved in user reviews, but developers still have no way of knowing which specific functions of the App a user complains of. On the other hand, the key phrase extraction task mainly utilizes heuristic-based techniques (such as part-of-speech templates, syntax parse trees and semantic dependency graphs) to extract the target phrase, and the techniques cannot sufficiently understand the semantics of the comments, so that the accuracy is unsatisfactory.

The technology related to the invention comprises Named Entity Recognition (NER) technology and language model pre-training technology.

1) NER is a classical task of sequence annotation in Natural Language Processing (NLP). Defined as a sequence of words, NER aims to predict whether a word belongs to a named entity, e.g. a person's name, an organization's name, a location, etc. The NER task can be solved by linear statistical Models, such as Maximum Entropy Markov Models (Maximum entry Markov Models), Hidden Markov Models (Hidden Markov Models), and Conditional Random Fields (CRF). Deep learning based techniques for solving the NER task typically use deep neural networks to capture sentence semantics and use CRF layers to learn sentence-level tagging rules. Typical neural network architectures include convolutional neural networks in combination with CRF (Conv-CRF), long-short term memory networks in combination with CRF (LSTM-CRF), and bidirectional LSTM networks in combination with CRF (BiLSTM-CRF). The BilSTM-CRF model can capture forward and backward information of an input sequence simultaneously by utilizing a bidirectional structure, and can generally obtain better performance than Conv-CRF and LSTM-CRF.

2) Language model pre-training techniques have been shown to be effective in improving many NLP tasks. Bert (bidirectional Encoder retrieval from transforms) is a transform-based representation model that is first trained in an original corpus and then fine-tuned for downstream tasks (e.g., NER tasks) using pre-training techniques. Performance can be further improved by using BERT instead of BiLSTM (abbreviated as BERT-CRF). By means of fine-tuning techniques, the BERT-CRF model can benefit from the performance gains that are brought about by pre-trained language models on large generalized corpora.

Disclosure of Invention

In order to overcome the defects of the prior art and better utilize App comments, the invention provides a comment analysis method and device based on semantic perception. According to the method, phrases (abbreviated as defect characteristics) related to defects mentioned in comments by App users and describing software characteristics are extracted, clustered and visualized, so that the specific functions of the mobile application are not satisfied by the users, App developers are helped to know the problems concerned by the users, problematic modules are positioned, and subsequent defect repair is carried out.

The technical scheme of the invention comprises the following steps:

a comment analysis method based on semantic perception comprises the following steps:

collecting comment texts of a target object, and dividing each comment text into at least one sentence;

extracting comment attributes of the comment text;

splicing the vector of each word in the sentence with the vector of the comment attribute, and calculating the probability vector of the word BIO label based on the splicing result to obtain the defect characteristics in the comment text containing the defect characteristics;

and clustering the defect characteristics to obtain a comment analysis result.

Further, the target object includes: and applying the mobile phone APP.

Further, after dividing each comment text into at least one sentence, preprocessing the sentences; the pretreatment method comprises the following steps: converting words into lowercase, using space for word rooting, correcting spelling errors, replacing digits with a first special symbol, and replacing target object names with a second special symbol.

Further, the first special symbol is < number >, and the second special symbol is < appname >.

Further, the comment attribute includes: the target object categories and comments describe emotions.

Further, the defect characteristics are obtained based on a defect characteristic identification model, wherein the defect characteristic identification model comprises: the system comprises a BERT model, a Dropout layer, an embedded layer, a multi-layer perceptron and a CRF layer, wherein the BERT model is used for obtaining a vector of each word in the sentence, the Dropout layer is used for avoiding overfitting when a defect characteristic recognition model is trained, the embedded layer is used for obtaining a vector of the comment attribute, the multi-layer perceptron is used for calculating probability vectors of word BIO labels, and the CRF layer is used for obtaining defect characteristics in the comment text.

Further, training the loss function of the defect characteristic identification model comprises: emission score (emission score) and transfer score (transition score).

Further, the clustering the defect characteristics includes:

1) acquiring a defect characteristic vector of the defect characteristic;

2) constructing a weighted undirected graph based on the defect characteristic vector, wherein the nodes of the weighted undirected graph are the defect characteristics, and the edges of the weighted undirected graph are constructed by comparing the similarity of the defect characteristic vector between any two nodes;

3) a Chinese whitepers algorithm is performed on the undirected graph to cluster the defect features.

Further, the visualized view form of the comment analysis result comprises: and the bubble size is defined as the ratio of the number of the defect characteristics of the target object a in the cluster c to the total number of the defect characteristics in the target object a.

A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the above method when executed.

An electronic device comprising a memory and a processor, wherein the memory stores a program that performs the above described method.

Compared with the prior art, the invention has the technical advantages that:

the defect characteristic extraction task is modeled as a Named Entity Recognition (NER) task, namely, the defect characteristics are regarded as named entities, and a defect characteristic recognition model is designed to solve the problem.

The textual description of the review is encoded using a BERT model, and the flaw characteristic phrase is identified using a CRF model. In particular, the invention introduces additional comment attributes (i.e. App category c and emotional tendency s of comment descriptions) in the CRF model to further improve the accuracy of phrase identification.

A graph-based clustering method is designed, and extracted defect characteristics can be clustered according to semantic relations among phrases.

Drawings

FIG. 1 is an overall flow diagram of the present invention.

FIG. 2 is a diagram of a defect feature identification model.

Fig. 3 social class App visualization results.

Fig. 4 communication class App visualization results.

Fig. 5 financial class App visualization results.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely specific embodiments of the present invention, rather than all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, are described.

The comment analysis method is a demand understanding technology, automatically extracts defect characteristics through a defect characteristic identification model, clusters the defect characteristics through a graph-based clustering method, and presents the distribution of the clustered defect characteristics in each cluster by using a visualization method, as shown in fig. 1, the main steps comprise:

1. data preprocessing: app user comments are crawled from an online App application store and preprocessed to obtain text descriptions and related comment attributes (namely App category c and emotional tendency s of comment descriptions) of the washed user comments.

2. Defect characteristic extraction: a defect characteristic recognition model is constructed and trained so as to automatically extract defect characteristics with fine granularity. The model combines the comment text description acquired in the step 1 and two comment attributes (namely c and s) as input, so as to better model the semantics of the comment and improve the phrase extraction performance of the traditional BERT-CRF model.

3. Defect characteristic clustering: a graph-based clustering method is designed, and the defect characteristics extracted in the step 2 can be clustered according to semantic relations among the defect characteristics so as to summarize common aspects of the defect characteristics.

4. Visualization: a visualization view is provided to show the distribution of the defect characteristics after clustering and to compare the differences in defect characteristics between apps in order to better understand which functions of apps are unsatisfactory to the user.

It is easy to understand that the invention is not only applicable to App applications, but also applicable to solving the problem of phrase extraction in other similar fields.

The technology disclosed by the invention is explained in detail below for App applications:

1. data pre-processing

a) Text data cleansing

The original App comment texts are usually submitted through a mobile device, and the submitted texts are limited by a mobile phone keyboard and usually contain a large number of frequently occurring noisy words, such as repeated words, misspelled words, abbreviations and the like. On the other hand, according to other practices based on CRF methods, the present invention takes each sentence of the comment text as an input unit. The specific method is that punctuation marks are matched through a regular expression, and each comment text is divided into sentences. Then, the invention filters all non-English sentences by using the tool Langid, and then cleans data by adopting the following steps to solve the problem of noise words:

lowercase: the present invention converts all words in the comment description into lower case.

Word rooting: the invention uses space to carry out word root so as to reduce the influence of lexical change.

Spelling correction: the present invention uses a common spelling correction tool contained in Ekphras to correct spelling errors.

Formatting: the present invention replaces all numbers with a special symbol "< number >", and furthermore, the present invention constructs a list containing all application names grabbed from the google Play shop and replaces the application names appearing in the comment text with a uniform special symbol "< appname >", to help the BERT model understand uniformly. .

b) Comment attribute acquisition

Some attributes related to comments and App categories are helpful for extracting defect characteristics, and the invention utilizes two attributes, namely App category c and comment description emotion s, and acquires the attributes in the data preprocessing stage. The present invention includes the App category because of differences in the functionality and themes involved with apps from different categories. Further, a review description of a negative emotion is more likely to contain a defect characteristic than a review description of a positive emotion. Therefore, the present invention takes the emotional tendency described by the comment as the second attribute used in the model of the present invention.

For App category acquisition, the present invention retrieves them from google Play stores. In order to obtain the emotional tendency of each comment sentence, the invention uses sentiStrength-SE, which is a tool specially applied in the field of software engineering to determine the positive and negative emotional intensity of short texts. sentiStrength-SE will assign a positive integer fraction ranging from 1 (non-positive) to 5 (very positive) and a negative integer fraction ranging from-1 (non-negative) to-5 (very negative) to each sentence. The two scores were used because psychological studies have shown that humans are dealing with positive and negative emotions in parallel. If the absolute value of the negative score multiplied by 1.5 is greater than the positive score, the invention assigns a negative sentiment score to the sentence; otherwise, the sentence is assigned a positive sentiment score.

2. Defect feature extraction

The invention models the defect feature extraction task as a Named Entity Recognition (NER) task, i.e., the defect features are treated as named entities, and the problem is solved by using the commonly used CRF technology. To better capture the semantics of App reviews, the present invention encodes the textual description of the reviews using the BERT model. In addition, the invention adds additional comment attributes (namely App category c and emotional tendency s described by comments) into the CRF model so as to further improve the accuracy of defect characteristic identification. Referring to the settings of other NER tasks, the present invention tags each comment sentence using the BIO tag format, wherein,

b-tag (Beginning): the word is the beginning of the target phrase.

I-tag (Inside): the word is in the target phrase, but not the beginning of the phrase.

O-tag (Outside): the word is outside the target phrase.

The BIO-tagged review statements are input into the defect feature identification model for further processing.

Fig. 2 shows a detailed structure of the defect characteristic identification model proposed by the present invention. Since App comments are short texts and involve relatively small vocabularies, the present invention uses a pre-trained model BERT_BASEAnd retrained using a fine tuning technique, the model has 12 layers, 768 hidden dimensions and 12 heads of attention. Each input sentence is given a special starting symbol [ CLS ]]At the beginning, the word is composed of 128 word symbols. For those sentences that are not long enough, the invention uses a special symbol PAD]The sentence symbol sequence is padded to a length of 128. By BERT, the present invention can obtain n vectors (the length of the input sentence sequence), each vector (denoted as v) having 768 dimensions, corresponding to each word in the input sentence. Conventionally, the present invention inputs the output of the BERT into a Dropout layer to avoid overfitting that occurs during model training.

The invention then incorporates the comment attributes into a text vector (v) to collectively capture the implied semantics of the comment sentence. Since the previously extracted comment attributes (c and s) are discrete values, the present invention first inputs them into the embedding layer, converting them into a continuous vector (denoted as h)_cAnd h_s). Taking the attribute s as an example, s may take 10 values (-5 to-1, 1 to 5), the embedding layer may represent each value with a continuous vector, and these continuous vectors may be trained in conjunction with other parameters of the entire model. Then, the invention connects h_c，h_sAnd v to obtain a vector for each input word, the concatenated vector being input to a Multi-layer Perceptron (MLP). MLP will compute a probability vector (denoted p) for the BIO tag for each word. Finally, p is input into the CRF layer, and the most probable tag sequence of the input sequence can be determined according to the Viterbi algorithm. Based on the predicted output tag sequence, the present invention can obtain defect characteristics. For example, if the input comment sentence of the present invention is "were" the output tag sequence is "<O><O><O><O><B><I><I><O><O><O>", according to the definition of BIO format, it can be determined that the extracted defect characteristic is" send a video ".

The loss function of the model should measure the likelihood of the entire sequence of true tags, rather than the likelihood of the true tag for each word in the sequence, so the commonly used cross-entropy loss does not fit this case. The loss function used by the present invention comprises two parts: emission score (emission score) and transfer score (transition score) were calculated as:

wherein

Representing a sequence of sentences of length T,

a sequence of tags representing the sentence is represented,

the emission fraction is calculated and is also the output of the MLP at the parameter θ. [ A ]]_i,jIs a transition score, is calculated from parameters in the CRF layer, models the transition from the ith state to the jth state in the CRF layer,

is a new parameter of the whole network, a sentence sequence

With corresponding tag sequences

The loss of (a) is the sum of the emission fraction and the transfer fraction.

The invention uses a greedy strategy to adjust the hyper-parameters in the model for optimal performance when training the model. The specific method is to give a hyperparameter P and a candidate value { v }thereof₁,v₂,...,v_nAnd executing n iterations to automatically tune each candidate parameter, and selecting a value capable of acquiring the best performance as the optimal tuning parameter of P. The learning rate of the model is set to 10 by tuning parameters^-4The optimization algorithm is selected as Adam algorithm, and the invention uses the optimization algorithm to accelerate the training processThe mini-batch technique is used and the batch size is set to 32. The drop rate in the Dropout layer is set to 0.1, which means that 10% of the neurons in the neural network will be randomly masked to avoid overfitting. The invention uses an open source Pythrch library transformations for natural language understanding and natural language generation to realize a defect characteristic identification model.

3. Defect feature clustering

Some of the defect characteristics extracted in step 2 may be different in expression but semantically similar, and this step clusters the extracted defect characteristics based on semantic relationships in order to provide an aggregated view of the defect characteristics. The traditional topic model uses a statistical technique (such as Gibbs sampling) based on word co-occurrence patterns, but the technique is not suitable for short texts (such as defect characteristics in the present invention), because the model has difficulty in capturing the co-occurrence patterns from the short texts, and therefore the model should consider semantic information of the defect characteristics during clustering. Furthermore, these topic models require a specified number of clusters/topics, but this parameter is difficult to determine in advance in the context of the present invention. In order to meet the challenges, the invention designs a graph-based clustering method which utilizes the semantic relation of defect characteristics and can accurately cluster the defect characteristics. The method comprises the following specific steps:

a) vectorization: the extracted defect characteristics are converted into a 512-dimensional vector using a Universal Sentence Encoder (USE). USE is a Transformer-based sentence embedding model that can capture rich semantic information of sentences and has proven to be more efficient than the traditionally used word embedding model on many tasks.

b) And (3) graph construction: and constructing a weighted undirected graph, wherein each defect characteristic serves as a node, and the cosine similarity score between the USE vectors of the two defect characteristics serves as the weight between the nodes. If the similarity score exceeds a certain threshold, an edge will be added between the two nodes. The threshold is a hyper-parameter which needs to be given in advance and measures semantic relevance between defect characteristics, and a higher threshold can result in a higher degree of cohesion of clustered clusters. The present invention sets this to 0.5 by tuning parameters on the training data.

c) Graph clustering: an algorithm Chinese Whispers (CW) is performed on the undirected graph constructed above to cluster the defect features. CW is an efficient graph clustering algorithm.

The graph-based clustering method can cluster defect characteristics with similar semantics into the same theme. The invention is realized based on open sources of USE and CW, and the clustering method is realized by using python.

4. Visualization

In order to more intuitively demonstrate the clustering result of the defect characteristics among multiple apps, the invention provides a visual view as shown in fig. 3-5 in the form of a bubble chart. The y-axis in the figure represents the App name under study, and the x-axis represents the id of each cluster. Bubble size (denoted as s) of application a in cluster c_a，c) The defect characteristic number of the application a in the cluster c is defined as the ratio of the defect characteristic number of the application a to the defect characteristic total number of the application a, namely the percentage of the defect characteristics belonging to the cluster c in the application a in all the defect characteristics of the application a. The aim is to ignore the effect of the absolute number of defect properties on the bubble size, and only focus on the relative number of defect properties.

In addition, when the cursor hovers over a bubble, the view will also show the details of this cluster (as shown in FIG. 5), including the cluster name, the number of defect characteristics, and an example of the corresponding defect characteristics and review description. For cluster names, the present invention first finds the most frequently occurring nouns or verbs (denoted as w) from all the defect characteristics contained in the cluster. The invention then counts the number of defect features that contain w and takes the phrase with the highest frequency of occurrence as the cluster name (i.e., the defect feature that is most representative of the cluster). By comparing the relative sizes of the bubbles, the distribution situation of the defect characteristics in different apps can be intuitively known, so that a developer can know which functions of the apps are not satisfied by a user.

Experimental data

Table 1 details the annotation data set D₁The statistical conditions of (1). D₁The comment data of 6 App (2 in each category) in three categories are contained, and the submission time is between 8 months in 2019 and 1 month in 2020.For each App, approximately 550 reviews (about 1500 sentences) were randomly drawn, totaling 3,426 reviews and 8,788 sentences. And then manually marking the comments to obtain a real label for verifying the performance of the method. In order to ensure the accuracy of the labeling result, two authors respectively label the comments of the same App, namely, mark the starting and ending positions of a target phrase (namely, a defect characteristic) in each comment sentence. Another author then compared the differences between the two annotated results and organized three for discussion to determine the final annotated result. The six apps all follow the same labeling procedure. For the first labeled App (Instagram), the results of two labels were 0.78 for Cohen's Kappa, and 0.86 for the last labeled App (chase Mobile). After two rounds of labeling, a consensus was achieved for each labeling result, which indicated a total of 3,090 defect features.

TABLE 1 notation data set D₁

Baseline method for defect extraction comparison:

BilSTM-CRF: is a commonly used algorithm in sequence tagging (e.g. NER) tasks. The method is a deep learning-based technology, sentence semantics are captured by using BilSTM, and sentence-level labels are learned by a CRF layer.

KEFE: is a latest method for identifying key functions from App comments. The key functions in this work refer to functions that are highly correlated with App scores. KEFE first uses a pattern-based filter to obtain candidate phrases and then uses a BERT-based classifier to identify functionally related phrases. Since its patterns are designed for Chinese, we replace the filter with patterns in SAFE to process English comments.

Caspar: is a method for extracting and synthesizing a small story (mini-story) about App problems reported by users from App comments. We take its first step (i.e., event extraction) as a reference model. An event in the method refers to a phrase that originates from a verb and contains other attributes related to the verb. In particular, Caspar employs NLP techniques based on patterns and grammars to extract phrases or clauses. We used the implementation provided in the original paper.

SAFE: a method for extracting defect characteristics from comments by 18 parts-of-speech patterns. For example, the 'verb-adjective-noun' schema may extract defective characteristics such as 'delete old mail'. We have implemented all 18 patterns to extract phrases based on the NLP toolkit NLTK.

Evaluation indexes are as follows:

the present invention uses several common indicators of Precision, Recall, and F1-Score to evaluate the performance of the present invention in defect feature extraction. If the phrase extracted from the comment sentence is the same as the actual phrase, we consider the phrase to be correctly extracted.

The three metrics are calculated by the following methods:

precision is the ratio of the number of correctly extracted phrases to the total number of extracted phrases.

Recall is the ratio of the number of correctly extracted phrases to the total number of actual phrases.

F1-Score is the harmonic mean of Precision and Recall.

The experimental results are as follows:

finally, it should be noted that: the described embodiments are only some embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Claims

1. A comment analysis method based on semantic perception comprises the following steps:

collecting comment texts of a target application, and dividing each comment text into at least one sentence;

extracting comment attributes of the comment text;

splicing the vector of each word in the sentence with the vector of the comment attribute, and calculating the probability vector of the word BIO label based on the splicing result to obtain the defect characteristic of the comment text containing the defect characteristic;

and clustering the defect characteristics to obtain a comment analysis result.

2. The method of claim 1, wherein the target application comprises: and applying the mobile phone APP.

3. The method of claim 1, wherein after each of the comment texts is divided into at least one sentence, the sentence is preprocessed; the pretreatment method comprises the following steps: converting words into lowercase, using space for word rooting, correcting spelling errors, replacing digits with a first special symbol, and replacing the name of the target application with a second special symbol.

4. The method of claim 3, wherein the first special symbol is < number > and the second special symbol is < appname >.

5. The method of claim 1, wherein the comment attribute comprises: the target application category and comments describe the emotion.

6. The method of claim 1, wherein the defect characteristics are derived based on a defect characteristic identification model, wherein the defect characteristic identification model comprises: the system comprises a BERT model, a Dropout layer, an embedded layer, a multi-layer perceptron and a CRF layer, wherein the BERT model is used for obtaining a vector of each word in the sentence, the Dropout layer is used for avoiding overfitting when a defect characteristic recognition model is trained, the embedded layer is used for obtaining a vector of the comment attribute, the multi-layer perceptron is used for calculating probability vectors of word BIO labels, and the CRF layer is used for obtaining defect characteristics in the comment text.

7. The method of claim 6, wherein training the loss function of the defect characteristic identification model comprises: an emission score and a transfer score.

8. The method of claim 1, wherein said clustering said defect characteristics comprises:

1) acquiring a defect characteristic vector of the defect characteristic;

2) constructing a weighted undirected graph based on the defect characteristic vector, wherein the nodes of the weighted undirected graph are the defect characteristics, and the edges of the weighted undirected graph are constructed by comparing the similarity of the defect characteristic vectors between any two nodes;

9. The method of claim 1, wherein the visual view form of the comment analysis result comprises: and the bubble graph, wherein the y axis of the bubble graph represents the name of the target application, the x axis represents the id of the cluster, and the bubble size is defined as the ratio of the number of the defect characteristics of the target application a in the cluster c to the total number of the defect characteristics of the target application a.

10. An electronic apparatus comprising a memory having a computer program stored therein and a processor arranged to execute the computer program to perform the method according to any of claims 1-9.