CN115905508A

CN115905508A - Comment text-oriented summarization method and device

Info

Publication number: CN115905508A
Application number: CN202211347612.9A
Authority: CN
Inventors: 王亚文; 王俊杰; 王青
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-04-04

Abstract

The invention discloses a comment text-oriented summarization method and device. The method comprises the following steps: preprocessing the comment subjects and the comment texts; calculating content similarity between the comment subjects and comment sentences in the comment text; calculating the demonstration relationship of each comment sentence in the comment text; calculating sentence content similarity between comment sentences in the comment text; calculating the centrality of each comment sentence based on the content similarity, the demonstration relationship and the sentence content similarity; and selecting the comment sentence with the highest centrality as the content abstract of the comment text. The invention can help the relevant personnel to better understand the discussion related to the comment subject.

Description

Comment text-oriented summarization method and device

Technical Field

The invention belongs to the technical field of computers, relates to technologies such as demand engineering and natural language processing, and particularly relates to a comment text-oriented summarization method and device.

Background

The success of a software system depends on the level and quality of service provided to its users. Demand harvesting is the practice of identifying and collecting demands from stakeholders of the system under development, which has a significant impact on the overall quality of the demand engineering process. The content provided by the open source platform user is an important knowledge source, and a wider prospect can be provided for improving the software quality. Compared with the traditional methods such as user survey or interview, the application of the information can greatly improve the efficiency of the demand acquisition activity.

Functional requirements in open source software problem-tracking systems (e.g., gitHub Issue Tracker, google Code Issue Tracker, and Bugzilla) are the most common sources of knowledge in the requirements that open source platform users can provide. After the user submits the function requirements, the stakeholders of the open source software follow up the problems and discuss the target function in a comment mode. Often this discussion of functional needs becomes tedious and unintelligible as reviews increase, especially for complex functions that may have a significant impact on the project. Furthermore, stakeholders of open source software often have different backgrounds, often requiring in-depth discussions before reaching consensus on the target function, which exacerbates the information overload problem in the functional requirements discussion. Previous studies of 82 GitHub projects by scholars have shown that there are on average 170 issue reports (including functional requirements) per project, each issue report containing more than 20 comments, and on average 10 participants involved in the discussion. A typical functional requirement discussion process is: different stakeholders express their position (support or opposition) to a functional need through detailed comments. Functional requirements in GitHub are usually tagged with a "vote required" or the like, asking stakeholders an opinion as to whether to accept the functional requirements. In order to effectively vote, stakeholders often need to review all previous discussions, distinguish comments against/against the function (standpoint detection), and summarize each party's view (view summary), a process that is time consuming and laborious.

The technology related to the invention comprises a position detection technology and a text summarization technology.

1) The position detection task aims at automatically identifying the position/attitude (supportive, objectionable or neutral, etc.) that an author expresses in the text for a particular proposition, topic or goal. This task is different from the emotion analysis task, which aims at classifying text according to the notion of polarity (positive, negative or neutral). Emotional polarity is often expressed explicitly in the text, while the relative orientation held by the theme is often more abstract and may not be mentioned directly in the text. Sometimes text expresses support for the target subject, but conveys a negative emotion. Previous studies have shown that the correlation between the standpoint and emotional measures is only 60%.

2) In the invention, the method of abstraction is adopted to determine the centrality of a sentence to determine whether the sentence should be included in the abstract. In a conventional graph-based ranking algorithm, a sequence of n sentences s is usually first formed ₁ ,s ₂ ,…,s _n A document D composed of is represented as a graph G = (V; E), wherein V is a vertex set in the graph and represents sentences in the document; e is the set of edges in the graph, representing the relevance between sentences in the document. Node pair<v _i ,v _j >Weight e between _ij Typically a measure of similarity between two sentences (e.g., the cosine distance between the two vector representations). Sentence s _i The centrality of (d) can be defined as:

such algorithms select the most important sentences (i.e., the sentences with the highest centrality) as the abstract, which basically assumes that the more important the sentences are, the more similar the sentences are to other sentences in the text.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a comment text-oriented summarization method and a comment text-oriented summarization device.

The technical content of the invention comprises:

a method of summarizing a comment-oriented text, the method comprising:

preprocessing the comment subjects and the comment texts; wherein the comment subject includes: a functional requirement;

calculating semantic similarity between the comment subjects and each comment sentence in the comment text;

calculating the demonstration relationship of each comment sentence in the comment text; wherein the demonstrative relationship represents a functional role and contribution of the comment sentence in the comment text;

calculating sentence content similarity between comment sentences in the comment text;

calculating the centrality of each comment sentence based on the semantic similarity, the argumentation relationship and the sentence content similarity;

and selecting the comment sentence with the highest centrality as the content abstract of the comment text.

Further, the preprocessing the comment subject includes:

filtering comment texts with insufficient information quantity;

deleting HTML tags and unrecognizable messy codes in the comment text;

using a tool space to perform lower case and word form reduction on all words in the comment text;

replacing words with insufficient information quantity by using special symbols; wherein the special symbol includes: < ref >, < code >, and < url >.

Further, the calculating semantic similarity between the comment subject and each comment sentence in the comment text includes:

respectively encoding each comment sentence and each comment topic into a USE vector by using a general sentence encoder;

and obtaining semantic similarity between the comment subjects and each comment sentence based on the cosine similarity scores between the USE vectors.

Further, the calculating the demonstrable relationship of each comment sentence in the comment text includes:

pre-training a DiSA model on open source data provided by the existing research;

constructing a training data set based on the GitHub data set to fine-tune the pre-trained DiSA model;

and inputting the comment sentence into the trimmed DiSA model to obtain the demonstration relationship of the comment sentence.

Further, the calculating sentence content similarity between the comment sentences in the comment text includes:

using a BERT pre-training model to comment on a sentence s _i Encoding into a continuous vector representation;

and calculating the similarity between the vector representations and carrying out normalization processing to obtain the sentence content similarity between the comment sentences.

Further, the centrality of each comment sentence

Wherein s is _i Denotes the ith comment sentence, λ denotes a weight, n denotes the number of comment sentences, e _ij Representing sentence content similarity, sr _i Representing semantic similarity, ar _i Showing the demonstration relationship.

Further, after the comment sentence with the highest centrality is selected as the content abstract of the comment text, the method further includes:

calculating the position of the comment text; wherein the calculating the standpoint of the comment text comprises:

extracting a reply relation between the comment texts based on the preprocessed comment texts;

generating a vector representation of the reply relationship based on the roles of the commentators of the comment text, the latest father comment text and the latest son comment text aiming at the reply relationship of each comment text;

calculating vector representation of the comment text, and splicing the vector representation of the comment text with the vector representation of the corresponding reply relationship to obtain final vector representation of the comment text;

and classifying the final vector representation to obtain the position of the comment text.

And forming an abstract list facing different perspectives of the comment subjects based on the perspectives and the content abstracts of the comment texts.

Further, the extracting, based on the preprocessed comment texts, the reply relationship between the comment texts includes:

extracting an explicit reply relationship between the preprocessed comment texts by matching a specific pattern through a regular expression;

carrying out dialogue decoupling on the preprocessed comment texts by using an irc-discientanglement model so as to extract an implicit answer relation between the comment texts;

under the condition that at least one explicit reply relationship exists between one comment text and other comment texts, using the closest explicit reply relationship in the time dimension as the reply relationship between the comment texts;

under the condition that an explicit reply relationship does not exist between a comment text and other comment texts, using the implicit reply relationship as a reply relationship between the comment texts;

in the case that there is no explicit reply relationship or implicit reply relationship between a comment text and other comment text, it indicates that there is no reply relationship between the comment text and other comment text.

Further, the generating a vector representation of the reply relationship for each comment text based on the comment roles of the comment text, the latest parent comment text and the latest child comment text includes:

the method comprises the steps of respectively obtaining the roles of commentators of a comment text, a latest father comment text and a latest son comment text; wherein the reviewer roles include: the publisher of the function requirement, the member of the open source software project, the contributor of the open source software project, the partner of the open source software project, the ordinary user without role or the reply relationship is lost;

reviewer corners for review text, most recent parent review text, and most recent child review textColor vectorization to obtain the vector representation r of the current reviewer role and the vector representation r of the father reviewer role _p And vector representation r of the sub-reviewer role _c ；

Connecting the vector representations r, r _p And the vector represents r _c And obtaining a vector representation of the recovery relation.

An apparatus for summarizing a comment text, the apparatus comprising:

the preprocessing module is used for preprocessing the comment subjects and the comment texts; wherein the comment subject includes: a functional requirement;

the semantic similarity calculation module is used for calculating semantic similarity between the comment subjects and each comment sentence in the comment text;

the argument relation calculation module is used for calculating the argument relation of each comment sentence in the comment text; wherein the demonstrative relationship represents a functional role and contribution of the comment sentence in the comment text;

the sentence content similarity calculation module is used for calculating the sentence content similarity between the comment sentences in the comment text;

the centrality calculation module is used for calculating the centrality of each comment sentence based on the semantic similarity, the demonstration relationship and the sentence content similarity;

and the content abstract acquisition module is used for selecting the comment sentence with the highest centrality as the content abstract of the comment text.

A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the comment text-oriented summarization method.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the comment text-oriented summarization method as described above.

A computer program product which, when run on a computer device, causes the computer device to perform the above comment text oriented summarization method.

Compared with the prior art, the invention has the technical advantages that:

1) The natural language described comment subject (comment subject and related comments) is preprocessed.

2) A drawing type summarization method based on graph sorting is designed, the sentence with the highest centrality is selected as the summary of each comment, and the method further combines the semantic relevance between the sentences in the current comment and the functional requirements and the demonstration relationship among the comment sentences to better correct the centrality of the summary sentences.

3) The method comprises the steps of obtaining the position of a comment text by extracting explicit and implicit reply relations between comments, combining the summaries of the supporting comments and the objection comments, and obtaining respective summaries of the two positions respectively.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely specific embodiments of the present invention, rather than all embodiments.

The invention aims to provide a comment text-oriented summarization method for detecting the position (support or objection) of a comment in a comment subject to the function and summarizing the view points of different positions.

The following describes the related art of the present invention in detail by taking the functional requirements in the open source software problem tracking system as an example. However, it should be understood that the implementation method of the present invention is not limited to the functional requirements in the open source software problem tracking system, and the abstract method of the comment text in all fields belongs to the protection scope of the present invention.

The comment text-oriented summarization method of the invention, as shown in fig. 1, comprises the following steps.

Step 1: data pre-processing

The original functional requirements and related comments on GitHub were written in natural language and submitted by stakeholders with different backgrounds. This results in some words/sentences with insufficient information content frequently appearing in the text, such as repeated references to comments, code fragments, HTML tags, etc., and thus the present invention requires preprocessing of functional requirements and related comments.

First, the present invention filters out reviews that are not informative because they do not contribute to the expression of opinions, such as robotic reviews and picture/GIF reviews. The invention also deletes HTML labels and unrecognizable messy codes in the text. Secondly, the invention formats the words with noise by using a tool space to perform lower case and word form reduction on all the words in the text, so as to reduce the influence caused by word form. Third, the present invention replaces words with insufficient information content using special symbols, which may confuse the model when training the model, such as replacing the referenced text with the symbol "< ref >", replacing code snippets with the symbol "< code >", replacing hyperlink addresses with the symbol "< url >", and the like.

Step 2: content summarization

The method performs content abstraction through three steps, namely semantic correlation acquisition, argument relation extraction and abstract extraction based on a graph. The invention regards the task as an extraction type abstract task, namely, a sentence with the largest information amount is selected as an abstract. Specifically, firstly, semantic relevance between each sentence in the comment and a function requirement is obtained, the demonstration relationship between the sentences in the comment is extracted, and then the demonstration relationship is combined with text information of each sentence in the comment and is merged into a graph-based ranking model to extract abstract sentences.

1) Semantic relatedness acquisition

Unlike the general summarization task, the content summarization involves both the review content that needs to be summarized and, to a certain extent, the functional requirements that determine the topic of discussion. The abstract should therefore have a high degree of relevance to the proposed functional requirement, i.e. sentences deviating from the subject of the functional requirement are unlikely to be abstract. In view of this, the present invention takes the semantic relevance of each sentence in the review to the functional description and incorporates it into the abstract extraction model to improve performance.

Specifically, the present invention first encodes each comment sentence and function description into a 512-dimensional vector using a general sentence encoder (USE), respectively. USE is a Transformer-based sentence embedding model that can capture rich semantic information and has proven to be more efficient than traditional word embedding models in many applications. Then, the cosine similarity score between the two USE vectors is calculated as the semantic correlation between the comment sentence and the functional requirement.

2) Extraction of argumentation relationships

Argumenting relationships can measure the functional role and contribution of each sentence throughout the document. One typical classification demonstrating relationships includes MajorClaim, claim, and Premise, where MajorClaim represents the primary opinion and Premise provides support for the effectiveness of Claim. The present invention observes that abstract sentences typically share similar location patterns as some demonstrative relationships, e.g., abstract sentences may appear more frequently at the beginning or end of a paragraph, similar to the location pattern of MajorClaim, and thus the present invention takes into account the demonstrative relationships.

In particular, the present invention applies the DiSA model to identify the demonstrative relationships of the review sentences, i.e., majorClaim, claim, premise, and Other. The DiSA is designed into a neural network with self attention, three sentence positions (namely a local position, a paragraph position and a global position) are adopted for position coding, and the position characteristics of demonstration relations can be fully described (particularly, the DiSA is particularly effective to MajorClaim which generally conveys the main idea of texts). The present invention first pre-trains the DiSA on the open source data provided by the existing research, and then constructs a data set containing about 800 comments from an open source GitHub data set to fine tune the DiSA. The invention uses the fine-tuned model to obtain the probability of demonstrating the relationship prediction as MajorClaim so as to improve the abstract performance.

3) Graph-based abstraction extraction

For each sentence in the comment, the invention obtains a semantic relevance (denoted as sr) to the functional requirement and a demonstration of the sentence within the commentRelationships (denoted ar). Then the invention constructs a graph for the comment, wherein the nodes are sentences in the comment, and the edges are the similarity between the two sentences. In particular, the present invention uses a BERT pre-training model to encode sentences into a continuous vector representation. After normalizing the similarity according to the practice of previous research to reduce the influence of the absolute value, the invention obtains a sentence s _i And s _j Similarity score e between _ij . Finally, sentence s _i Is defined as:

wherein, the invention is to

Viewed as s _i Semantic similarity with other sentences in the comment; will sr _i Viewed as s _i Semantic correlation with functional requirements; λ represents the weight of the two similarities. The final centrality is the demonstration relationship probability ar _i And the product of the sum of the two similarities, the sentence with the highest centrality will become the abstract of the current comment.

The hyper-parameter λ was parametrized experimentally on a validation set consisting of comments and corresponding digests. The invention realizes the content summarization module by using PyTorch based on open source software PACSUM.

And 3, step 3: vertical field detection

The method carries out the position detection through two steps, namely, the extraction of the reply relationship and the polarity classification. Specifically, the reply relationship between the comments is extracted first and combined with the text information of the comments, and the position polarity (support or opposition) of each comment is classified by fine-tuning the BERT model.

1) Return relationship extraction

When conducting discussions of functional requirements, the reply relationships between comments often imply a context switch in the discussion process. Some comments have an explicit answer relationship, such as a direct reference to someone; meanwhile, the stakeholders of the open source software can also implicitly reply to comments of others when discussing functional requirements. For the two cases, the invention uses pattern matching to extract the explicit reply relation, and uses the dialogue decoupling model to predict the implicit reply relation.

a) Extracting pattern matching of explicit reply relationships: the invention extracts explicit reply relationships by matching regular expressions with specific patterns. For example, the present invention recognizes the symbol "@" or prefix symbol of the reference text, and extracts explicit reply relationships by matching the username or the referenced text.

b) Dialogue decoupling model to predict implicit answer relations: conversation decoupling is a common task in natural language processing communities to identify individual conversations in a message stream. According to past practice, comments in a conversation have a reply relationship after decoupling. The present invention uses an irc-discrete analysis model for dialog decoupling, which uses a feed-forward model based on text features (e.g., word overlap and context) to predict respondents in a reply relationship by averaging the GloVe vector of each word as a vector representation of a sentence. The model was pre-trained using 77,563 messages on the Ubuntu channel and Linux channel in the online communication platform Internet Relay Chat. In the scenario of the invention, the invention takes all comments of the functional requirements as input, and for each comment, the pre-trained irc-discentangement model predicts the comment most likely to be replied by the current comment. The reversion relations (even if not 100% accurate) provide more clues to the model for the position detection.

Of the two reply relationships, the extracted explicit reply relationship has a higher priority, and when there is a conflict with the extracted implicit reply relationship, the explicit relationship is retained and the implicit relationship is deleted.

2) Polarity classification

The invention incorporates the reply relationships of the comment text into the BERT model for classifying the polarity of each comment. After the reply relationship is extracted, each comment has a parent comment (i.e. the comment replied by the current comment) and a child comment (i.e. the comment replied to the current comment), and the first comment and the last comment are special cases, which areOnly one of the child reviews or the parent review. In addition, the present invention observes that the review position is highly related to the reviewer's role (e.g., contributor, colorbuilder), and that position conversion is usually accompanied by conversion of the reviewer's role. Thus, the present invention utilizes the reviewer's role to represent the reply relationship, i.e., the current review and the reviewer role of the most recent (temporally and spatially closest) parent/child review to represent the reply relationship of the review. The reviewer roles are divided into the following: author, member, contributor, collaborator, none. The invention additionally adds a special label Null to mark the missing reply relationship. The reviewer's roles are discrete values that can be converted to continuous vectors (r represents the current reviewer role, r) by entering them into the embedding layer _p Representing the role of the parent reviewer, r _c Indicating a sub-reviewer role). The embedding layer may represent each value with a continuous vector and participate in the joint training of the entire model.

For each comment, the invention first enters the comment text into the BERT model to obtain a vector (denoted as v) that embeds the comment semantics. Then the invention is connected

To obtain a vector (denoted v') that embeds both the semantics and the reply relationship of the current comment. Finally, the concatenated vector v' is input to a dropout layer in turn to avoid overfitting, and a fully concatenated layer to compute the polar label (i.e., support/neutral/anti) probability vectors. Since this task is a classification task, the present invention uses the commonly used cross entropy as a loss function, which is defined as:

where θ is a parameter of the neural network, y _i And p _i (θ) are the true polarity label and the predicted probability distribution of comment i under parameter θ, respectively.

The invention uses a greedy strategy to adjust the hyper-parameters in the model to obtain optimal performance, in particularThe method comprises the following steps: given a hyperparameter P and its candidate value, the present invention performs n iterations of automatic tuning, and selects the value that yields the best performance as the tuning value for P. After parameter adjustment, the learning rate is set to 10 ^-3 (ii) a The optimizer is Adam algorithm; the invention uses mini-batch technology to accelerate the training process, wherein the batch size is 32; the drop rate was set to 0.5, which means that 50% of the neuronal cells would be randomly masked to avoid overfitting. The invention uses PyTorch to realize the position detection module.

And 4, step 4: opinion summary

Based on the summary of contents obtained in step 2 and the standpoint obtained in step 3, the invention can automatically obtain the standpoint (support or objection) of the function by the comments in the function requirement, and summarize the viewpoints of different standpoints to form summary lists of different standpoints.

In summary, the final output of this example is a summary of supporting and countering perspectives in the functional requirements-related discussion to help stakeholders of the open-source software system decide whether to accept the request or improvement of the function.

The invention also discloses a comment text-oriented summarization device, which comprises: the system comprises a preprocessing module, a content similarity calculation module, a demonstration relationship calculation module, a sentence content similarity calculation module, a centrality calculation module, a content abstract acquisition module and a view abstract generation module.

the content similarity calculation module is used for calculating the content similarity between the comment subjects and each comment sentence in the comment text;

the demonstration relation calculation module is used for calculating the demonstration relation of each comment sentence in the comment text; wherein the demonstrative relationship represents a functional role and contribution of the comment sentence in the comment text;

the centrality calculating module is used for calculating the centrality of each comment sentence based on the content similarity, the demonstration relationship and the sentence content similarity;

For the explanation of the specific execution process, beneficial effects, etc. of the device module, please refer to the description of the above method embodiment, which is not described herein again.

In an exemplary embodiment, there is also provided a computer device, which includes a memory and a processor, where the memory stores a computer program, and the computer program is loaded and executed by the processor to implement the comment text-oriented summarization method.

In an exemplary embodiment, there is also provided a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the comment text-oriented summarization method as described above.

In an exemplary embodiment, there is also provided a computer program product which, when run on a computer device, causes the computer device to perform the comment text oriented summarization method described above.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for summarizing a comment text, the method comprising:

calculating the centrality of each comment sentence based on the semantic similarity, the demonstration relationship and the sentence content similarity;

2. The method of claim 1, wherein said pre-processing a comment topic comprises:

filtering comment texts with insufficient information quantity;

deleting HTML (hypertext markup language) tags and unrecognizable messy codes in the comment text;

3. The method of claim 1, wherein the calculating semantic similarities between the comment subject and comment sentences in the comment text comprises:

4. The method of claim 1, wherein said calculating an argumentation relationship for each review sentence in the review text comprises:

constructing a training data set based on the GitHub data set to finely adjust the pre-trained DiSA model;

5. The method of claim 1, wherein said calculating sentence-content similarity between comment sentences in the comment text comprises:

6. The method of claim 1, wherein the centrality of each comment sentence is

7. The method of claim 1, wherein after the selecting the comment sentence having the highest centrality as the content digest of the comment text, the method further comprises:

generating a vector representation of the reply relation based on the comment texts, the latest father comment texts and the comment role of the latest child comment texts aiming at the reply relation of each comment text;

classifying the final vector representation to obtain the position of the comment text;

8. The method of claim 7, wherein extracting the reply relationship between the comment texts based on the preprocessed comment texts comprises:

extracting an explicit reply relation between the preprocessed comment texts by matching the regular expression with a specific pattern;

carrying out dialog decoupling on the preprocessed comment texts by using an irc-discrete element model to extract an implicit answer relation between the comment texts;

9. The method of claim 7, wherein the generating, for each review text, a vector representation of the reply relationship based on the reviewer roles for the review text, the most recent parent review text, and the most recent child review text comprises:

obtaining the roles of the reviewers of the comment text, the latest father comment text and the latest son comment text respectively; wherein the reviewer role includes: the publisher of the function requirement, the member of the open source software project, the contributor of the open source software project, the partner of the open source software project, the ordinary user without role or the reply relationship is lost;

vectorizing the comment roles of the comment text, the latest father comment text and the latest child comment text to obtain the vector representation r of the current comment role and the vector representation r of the father comment role _p And vector representation r of the sub-reviewer role _c ；

10. An apparatus for summarizing a comment text, the apparatus comprising:

the preprocessing module is used for preprocessing the comment subjects and the comment texts; wherein the comment subject includes: functional requirements;

the demonstration relation calculation module is used for calculating the demonstration relation of each comment sentence in the comment text; wherein the argument relationship represents a functional role and a contribution of the comment sentence in the comment text;

the centrality calculation module is used for calculating the centrality of each comment sentence based on the semantic similarity, the argumentation relationship and the sentence content similarity;