CN111930930A

CN111930930A - Abstract comment abstract generation method based on commodity aspect alignment

Info

Publication number: CN111930930A
Application number: CN202010663601.6A
Authority: CN
Inventors: 潘浩杰; 蔡登�; 杨荣钦; 周鑫; 王睿; 刘晓钟
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-07-10
Filing date: 2020-07-10
Publication date: 2020-11-13
Anticipated expiration: 2040-07-10
Also published as: CN111930930B

Abstract

The invention discloses an abstract review summary generation method based on commodity aspect alignment, which comprises the following steps: (1) the method comprises the steps of obtaining commodity comment data, dividing comments in a commodity, screening out low-quality comments and high-quality comments, and constructing the screened comments into a multi-comment abstract data set in an aspect alignment mode; (2) establishing a sequence model, wherein the sequence model comprises an encoder, a decoder and an attention mechanism based on a recurrent neural network; simultaneously adding an attention mechanism based on aspects; (3) training the sequence model by using the multi-comment abstract data set until the model converges; (4) and performing a comment abstract generation task by using the trained model, and automatically generating an abstract after inputting the comment of the commodity. By utilizing the method, the (review collection, abstract) pairs can be efficiently constructed for neural network model training, and the cost of manual marking is greatly reduced; the trained model can generate high-quality multi-comment abstracts.

Description

Abstract comment abstract generation method based on commodity aspect alignment

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to an abstract comment abstract generation method based on commodity aspect alignment.

Background

Review systems are intended to help users make better transactions while shopping online and have become an important part of the active e-commerce environment. However, when the number of comments is large, the user is unlikely to efficiently process the historical comments, considering that most comments may lack key information. On the other hand, the following comments of the e-commerce products are often short, the amount of praise is small, and the commodity Aspect (Aspect) is also small. Therefore, a multi-review summary system is needed to help users efficiently use multiple reviews and digest the most relevant information.

Some of the previous work was extractive, focusing on predicting the overall rating of an entity or estimating the rating of different product features. Abstract methods may be more suitable for summarizing the evaluation text because in the context of a multi-document summary, a decimated review summary may result in a summary that is too verbose or biased towards certain sources. Previous abstract multi-comment summarization work employed an unsupervised approach to reduce multi-comment summaries to a subset of the best phrases to select, and then generate the summary using Natural Language Generation (NLG). The accuracy of training data consisting of manually written review summaries becomes the bottleneck for multiple review summaries. Therefore, we can hardly define a supervised learning paradigm for large-scale evaluation of emerging methods. This has led researchers to employ complex preprocessing methods and predefined rules to summarize reviews.

Modern neural network-based models depend to a large extent on the quality of the available training data, with the most advanced performance already being achieved in abstract text summaries. In particular, this neural network-based model employs an encoder-decoder structure, plus some attention mechanisms. For example, A Neural Attention Model for Abstract Session summary published in 2015 on pages 379 and 389 of The Natural Language Processing Top-level conference, The summary with Point-Generator Networks published in 2017 on The International Top-level computing Neural theory conference for computing Linear concepts was abstracted more simply, and a Pointer mechanism-based generation Model was proposed, so that The task effect is greatly improved.

The method improves the performance of automatic summarization by using data sets such as a Gigaword, New York Time data set, a CNN/Daily Mail corpus and the like. However, the data sets for the multiple review summaries are few and very costly to build. Meanwhile, such models do not consider the most critical part in commodity reviews, namely commodity aspects (Aspect), common aspects such as quality, logistics, customer service, and the like.

Disclosure of Invention

The invention provides an abstract comment abstract generation method based on commodity aspect alignment, which can be used for generating abstract comment abstract under large-scale unsupervised comment linguistic data and generating high-quality multi-comment abstract on the aspects of fluency, diversity, information richness and the like.

An abstract review summary generation method based on commodity aspect alignment comprises the following steps:

(1) the method comprises the steps of obtaining commodity comment data, dividing comments in a commodity, screening out low-quality comments and high-quality comments, and constructing the screened comments into a multi-comment abstract data set in an aspect alignment mode;

(2) establishing a sequence model, wherein the sequence model comprises an encoder, a decoder and an attention mechanism based on a recurrent neural network; simultaneously adding an attention mechanism based on aspects;

(3) training the sequence model by using the multi-comment abstract data set until the model converges;

(4) and performing a comment abstract generation task by using the trained model, and automatically generating an abstract after inputting the comment of the commodity.

In the present invention, the Aspect of a commodity (Aspect) refers to quality, logistics, customer service, and the like involved in the review.

The method comprises the steps of firstly constructing a review set-abstract pair used as end-to-end neural network abstract model training based on aspect alignment based on large-scale unsupervised commodity review corpora, and then constructing a neural network model of an encoder-decoder, an attention mechanism and an aspect attention mechanism, wherein the model can directly generate the review abstract after being trained in the data. The method can reduce the cost of manually marking the end-to-end comment abstract data set, and meanwhile, the trained model can generate high-quality multi-comment abstract on aspects of fluency, diversity, information richness and the like.

The specific process of the step (1) is as follows:

(1-1) collecting comment data of a commodity, pruning some automatically generated low-quality comments (according to certain heuristic rules, the comments are regarded as meaningless comments), and deleting more than 20 times of high-frequency comments displayed under the same product;

(1-2) running an Aspect extractor on these remaining comment data and deleting comments that do not cover any predefined aspects; thereby obtaining reviews of each product containing different aspects; the reviews convey the user's opinion of one or more aspects of the product.

(1-3) for each product, first finding high quality reviews with a number of likes greater than 10 and containing more than 3 facets; then for each corresponding aspect, finding 10-40 low-quality comments to form a low-quality comment set, wherein the low-quality comments include comments with the number of praise less than 1 and only include one aspect;

(1-4) repeating the steps to generate a plurality of groups of (review sets, abstracts) pairs consisting of the low-quality review sets and the corresponding high-quality reviews as review abstract data sets.

In the step (2), the sequence model mainly comprises an encoder and a decoder, and a long-time memory network is used as a main parameter component of the sequence model; the cyclic neural network in the encoder is a bidirectional long-time and short-time memory network.

The sequence model adds an attention mechanism for generating original texts of the words and the comment sets and an attention mechanism for generating words and Aspect vectors.

In the step (2), the working process of the sequence model is as follows:

(2-1) the encoder encodes each word of each comment in the comment set into a vector, and the vector generated for each comment generates a corresponding comment vector through a self-attention mechanism;

(2-2) randomly initializing an aspect vector for each predefined aspect;

(2-3) Co-attentive to the set of aspect vectors of the set of review vectors (Co-Attention);

(2-4) adopting an attention mechanism to carry out weighted summation on the vector of each word of each comment in the comment set, the vector of each comment subjected to common attention and the vector of each aspect subjected to common attention by the hidden layer vector generated by the decoder each time, and integrating to obtain a context vector c_tThen the hidden layer vector h_tAnd a context vector c_tAnd integrating by using a linear function and sending into a softmax function to obtain the probability distribution of the prediction sequence.

In step (2-4), the formula of the softmax function is:

P(y_t|y_＜t，x)＝softmax(W_pc_t+W_qh_t+b_p)

wherein, W_p、W_qAnd b_pAre all parameters to be trained, y_tIs the t-th word output by the decoder.

In step (3), an Adam optimizer is used to train a sequence model on the multi-review summary dataset.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention uses a commodity alignment-based method, can efficiently construct (review set, abstract) pairs for neural network model training, greatly reduces the cost of manual labeling, and can generate an abstract multi-review abstract data set with higher quality.

2. The invention fully utilizes commodity information to improve the traditional neural network text abstract model based on an encoder-decoder + attention mechanism, so that the neural network text abstract model is more suitable for the scene of multi-comment abstract, and the trained model can generate high-quality multi-comment abstract on aspects of fluency, diversity, information richness and the like.

Drawings

FIG. 1 is a schematic diagram of the overall structure of a multi-comment summary sequence model based on aspect alignment in the present invention.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

As shown in fig. 1, a method for generating abstract review summary based on commodity aspect alignment includes the following steps:

and S01, automatically constructing a large-scale abstract multi-comment summary data set.

This embodiment constructs a multi-review summary dataset starting from 17,936,900,612 product reviews under the category of "clothing".

(1) Reviews of the category that are not related to "clothing" are screened out and some automatically generated reviews are pruned, which are considered meaningless reviews by some heuristic rules (e.g., heuristic rules). More than 20 reviews displayed under the same product are deleted. After screening, a total of 246,585,986 reviews were obtained.

(2) An aspect extractor is run on these comments and comments that do not cover any predefined aspects are removed. At this stage, 153,702,854 reviews have been obtained, accounting for approximately 62.33% of the available reviews, which convey the user's opinion of one or more aspects of the product.

(3) Comments containing aspects have been collected by product ID. For each product, first find "high quality reviews" (> 10 praise, containing > 3 facets), and then for the corresponding facets, find a set of 10-40 (< 1 praise, containing 1 facet) "low quality reviews". Finally 58,418 (panel-summary) pairs were obtained.

S02, training a sequence model of the multi-comment abstract.

Given comment list R ═ { R ═ R⁽¹⁾，r⁽²⁾，...，r^(m)The model of the invention aims to generate an output summary y ═ w₁，w₂，...，w_k}. First, each comment is directly encoded and an unordered set is obtained to represent the comment. Since the reviews represent sequential data with variable length, long-term short-term memory networks are used to learn their feature representation. The use of unordered list coding addresses the differences between the review and article summarization tasks and addresses the importance of aspect alignment between reviews and summaries. After the context is obtained, the summary is generated using a neuro-decoder.

(1) Comment coding

Let R be { R ═ R⁽¹⁾，r⁽²⁾，...，r^(m)As an input comment list, the goal is to encode R as S ═ S⁽¹⁾，s⁽²⁾，...，s^(m)}。

For one having n_iComments r of individual words⁽ⁱ⁾It is expressed as a sequence of word vectors, i.e.

For a single comment, the comment is coded by using a long and short memory cycle network (LSTM) to obtain the dependency between adjacent words

Thus, a sequence of concealment vectors is obtained

Then for each H⁽ⁱ⁾Obtaining corresponding vector s through a layer of structured self-attention mechanism⁽ⁱ⁾

(2) Aspect alignment

Given the facet list, each facet vector is initialized with the average word vector associated with the facet:

in order to take advantage of the link between the different aspects, the

Through a layer of fully-connected neural network to obtain a coded hidden aspect vector A ═ a₁，a₂，...，a_k}

Here W_aIs h-by-d in shape_eH is the size of the hidden state, d_eIs the embedding size of the word, b_aIs the bias term.

The ith comment information obtained by performing attention mechanism on A

Obtained by the following method:

e_i，j＝cosine(s⁽ⁱ⁾，a_j)

thus obtaining a new vector sequence of comments

The same attention mechanism canTo obtain a sequence of aspect vectors

(3) Nerve abstract generator

The main network of the generator is also a long and short memory neural network LSTM, as follows:

concatenating the output of each word of each comment via the encoded LSTM to form a vector sequence:

by using

To pair

Using the attention mechanism, 3 context vectors are obtained and compared with

Concatenation may result in a final context vector h for output_out,tAs follows:

sending the prediction sequence into a softmax function to obtain probability distribution of the prediction sequence; the formula of the softmax function is as follows:

P(y_t|y_＜t，x)＝softmax(W_ph_out，t+b_p)

in the training process, a cross entropy loss function is adopted and optimized by using an Adam optimizer.

S03, evaluating the performance of the sequence model on the data set.

Evaluating the performance of the sequence model by using (1) an abstract generation index ROUGE (2) an Aspect accuracy rate (3) and comparing the sequence model with 3 basic models, wherein the three basic models are respectively as follows:

i) 1, arbitrarily splicing three different low-quality comments;

ii) Seq-to-Seq + attn, which is a simple encoder-decoder + attention model mentioned in the introduction;

iii) Pointer-generator, which is an abstract text abstract generation model published at the front edge of ACL 2017 in background introduction, provides a word of 'copying' input document by a Pointer mechanism, so that the generated abstract has higher quality.

The final comparison is shown in table 1:

TABLE 1

Three models were evaluated manually using i) whether it was generated for AI ii) whether it was smooth iii) whether it was useful to understand a commodity, and the results are shown in table 2:

TABLE 2

It can be seen that the model of the invention has a very great promotion effect regardless of automatic evaluation or manual evaluation.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. An abstract review summary generation method based on commodity aspect alignment is characterized by comprising the following steps:

2. The method for generating abstract review summary based on commodity aspect alignment according to claim 1, wherein the concrete process of step (1) is as follows:

(1-1) collecting comment data of a commodity, pruning some automatically generated low-quality comments, and deleting more than 20 times of high-frequency comments displayed under the same product;

(1-2) running an Aspect extractor on these remaining comment data and deleting comments that do not cover any predefined aspects; thereby obtaining reviews of each product containing different aspects;

3. The method for generating abstract review summary based on commodity aspect alignment according to claim 1, wherein in the step (2), the recurrent neural network in the encoder is a bidirectional long-term and short-term memory network.

4. The method for generating abstract review summary based on commodity Aspect alignment of claim 1, wherein in step (2), the sequence model adds an attention mechanism for generating words and original text of the review set and an attention mechanism for generating words and Aspect vectors.

5. The method for generating abstract review summary based on commodity aspect alignment according to claim 1, wherein in the step (2), the sequence model works as follows:

(2-2) randomly initializing an aspect vector for each predefined aspect;

(2-3) co-attentive to the set of aspect vectors of the set of review vectors;

6. The method for generating abstract review summary based on commodity aspect alignment according to claim 5, wherein in the step (2-4), the formula of the softmax function is:

P(y_t|y_＜t,x)＝softmax(W_pc_t+W_qh_t+b_p)

7. The method for generating abstract review summary based on commodity aspect alignment according to claim 1, wherein in step (3), an Adam optimizer is used to train a sequence model on the multi-review summary data set.