CN115879669A

CN115879669A - Comment score prediction method and device, electronic equipment and storage medium

Info

Publication number: CN115879669A
Application number: CN202211515339.6A
Authority: CN
Inventors: 郑天翔; 江炅辉; 徐灯明
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-03-31

Abstract

The invention discloses a method and a device for predicting comment scores, electronic equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining user data, product data and comment vectors, respectively determining implicit vectors of the three data, wherein the implicit vectors can contain incidence relations among users, products and comments, splicing and inputting the three implicit vectors into a training spliced vector marked with training comment scores to be used as training data for training to obtain a prediction model, enabling the spliced vector to pass through a ReLU activation function before outputting results in the prediction model, and finally passing through a Softmax activation function, so that the prediction model can comprehensively comment texts, users and products on the comment scores, and obtaining authenticity scores of the comment texts. The comment authenticity score obtained based on prediction enables tourists to obtain information with reference value from a large number of comments and avoids being misled by false comments.

Description

Comment score prediction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of neural network models, in particular to a comment scoring prediction method and device, electronic equipment and a storage medium.

Background

Nowadays, tourists pay more attention to the tourism experience, and in order to reduce the risk and uncertainty of tourism, also can search for information in advance, and make a tourism plan according to online reviews of tourism products and destinations. However, because people can issue online comments through the internet, the comments presented to tourists are often complicated. On the one hand, the scoring of online reviews is susceptible to interference by abnormal means, and part of the reviews or the scoring does not necessarily reflect the real evaluation of the travel product. On the other hand, due to the fact that platforms for posting comments are different, the amount of comment data is huge, and for most tourists, it is often difficult for the tourists to extract real and useful evaluations and scores from thousands of online comments and to clearly distinguish whether the selected tourism products or destinations are good or bad.

Thus, how to identify a true rating or score is important to help tourists make plans.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for predicting a review score, an electronic device, and a storage medium, which are capable of identifying a real and effective evaluation or score.

One aspect of the embodiments of the present invention provides a method for predicting a review score, including:

acquiring preprocessing data, wherein the preprocessing data comprise user data, product data and comment vectors, and the user data and the product data are integers;

inputting the user data and the product data into respective corresponding embedded layers to obtain a user hidden vector corresponding to the user data and a product hidden vector corresponding to the product data;

inputting the comment vector into a Tiny-BERT pre-training model to obtain a comment hidden vector corresponding to the comment vector;

splicing the user hidden vector, the product hidden vector and the comment hidden vector to obtain a spliced vector;

inputting the splicing vector into a prediction model to obtain the authenticity score of the comment text corresponding to the comment vector, wherein the prediction model is obtained by training a training splicing vector marked with a training comment score as training data, and the activation functions of the prediction model comprise a ReLU activation function and a Softmax activation function.

Preferably, the processing procedure of the preprocessed data includes:

respectively encoding and constructing a user character string of a user identifier and a product character string of a product identifier by using a Term Frequency algorithm to obtain user data corresponding to the user character string and product data corresponding to the product character string;

and converting the comment text into a comment vector by using a BERT pre-training algorithm, wherein the comment vector comprises a word vector, a text vector and a position vector, and the comment text is an English text.

Preferably, the converting the comment text into a comment vector by using a BERT pre-training algorithm, where the comment vector includes a word vector, a text vector, and a position vector, and the comment text is an english text, and includes:

removing other characters except the text characters in the comment text, and lowercase the comment text words to obtain a target comment text;

dividing the target comment text into a plurality of subwords by adopting a greedy longest priority algorithm, and determining codes corresponding to the subwords according to a vocabulary preset in the BERT pre-training algorithm;

and forming the codes into word vectors, and determining position vectors and text vectors according to the word vectors.

Preferably, the splicing the user implicit vector, the product implicit vector, and the comment implicit vector to obtain a spliced vector includes:

and splicing the user implicit vector, the product implicit vector and the comment implicit vector based on PyTorch to obtain a spliced vector.

Another aspect of the embodiments of the present invention further provides a device for predicting a review score, including:

the data acquisition unit is used for acquiring preprocessing data, wherein the preprocessing data comprises user data, product data and comment vectors, and the user data and the product data are integers;

a first vector determining unit, configured to input the user data and the product data into respective corresponding embedding layers, respectively, to obtain a user implicit vector corresponding to the user data and a product implicit vector corresponding to the product data;

the second vector determining unit is used for inputting the comment vector to a Tiny-BERT pre-training model to obtain a comment hidden vector corresponding to the comment vector;

the vector splicing unit is used for splicing the user implicit vector, the product implicit vector and the comment implicit vector to obtain a spliced vector;

and the score prediction unit is used for inputting the splicing vector into a prediction model to obtain the authenticity score of the comment text corresponding to the comment vector, the prediction model is obtained by training a training splicing vector marked with a training comment score as training data, and the activation function of the prediction model comprises a ReLU activation function and a Softmax activation function.

Preferably, the data acquisition unit includes:

the code construction unit is used for respectively carrying out code construction on a user character string of a user identifier and a product character string of a product identifier by using a Term Frequency algorithm to obtain user data corresponding to the user character string and product data corresponding to the product character string;

the comment text conversion unit is used for converting the comment text into a comment vector by using a BERT pre-training algorithm, wherein the comment vector comprises a word vector, a text vector and a position vector, and the comment text is an English text.

Preferably, the vector conversion unit includes:

the first vector conversion unit is used for removing other characters except for text characters in the comment text and lowercase the comment text words to obtain a target comment text;

the second vector conversion unit is used for dividing the target comment text into a plurality of subwords by adopting a greedy longest priority algorithm, and determining codes corresponding to the subwords according to a vocabulary table preset in the BERT pre-training algorithm;

and the third vector conversion unit is used for forming the codes into word vectors and determining position vectors and text vectors according to the word vectors.

Preferably, the vector stitching unit includes:

and the vector splicing subunit is used for splicing the user implicit vector, the product implicit vector and the comment implicit vector based on PyTorch to obtain a spliced vector.

Another aspect of the embodiments of the present invention further provides an electronic device, which includes a processor and a memory;

the memory is used for storing programs;

the processor executes the program to realize the method.

Another aspect of the embodiments of the present invention also provides a computer-readable storage medium storing a program, which is executed by a processor to implement the above-mentioned method.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

According to the method, user data, product data and comment vectors are obtained, implicit vectors of the three data are respectively determined, the implicit vectors can contain the incidence relation among users, products and comments, the three implicit vectors are spliced and input into a training spliced vector marked with training comment scores to serve as training data to be trained to obtain a prediction model, the spliced vectors can firstly pass through a ReLU activation function and finally pass through a Softmax activation function before a result is output in the prediction model, and therefore the prediction model can comprehensively comment the influence of three dimensions of texts, users and products on the comment scores and obtain authenticity scores of the comment texts. Based on the comment authenticity score obtained by prediction, a tourist can quickly judge whether the product is really and objectively evaluated by each comment and also can determine various comments about malicious evaluation of the product, so that the tourist can obtain information with reference value from various comments, and misleading by false comments is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for predicting comment scores according to an embodiment of the present invention;

fig. 2 is a diagram illustrating a curve of a ReLU activation function according to an embodiment of the present invention;

fig. 3 is a diagram illustrating a curve of a Softmax activation function according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an example of a prediction of review score according to an embodiment of the present invention;

FIG. 5 is a block diagram of an exemplary model for predictive review scoring according to an embodiment of the present invention;

fig. 6 is a block diagram of a prediction apparatus for comment scoring according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, an embodiment of the present invention provides a method for predicting a review score, which specifically includes the following steps:

step S100: the method comprises the steps of obtaining preprocessing data, wherein the preprocessing data comprise user data, product data and comment vectors, and the user data and the product data are integers.

Specifically, the user data may be a user id, and the product data may be an id of a product. The data types of the user data and the product data may be unique strings, and strings of such data types may be converted to integers prior to entering the predictive model. And the comment vector may be a vector determined from the comment text.

Step S110: and inputting the user data and the product data into the corresponding embedding layers respectively to obtain a user implicit vector corresponding to the user data and a product implicit vector corresponding to the product data.

Specifically, the embedded Layer (Embedding Layer) corresponding to the User data may learn the implicit vector representation of the User data, where the implicit vector may have different meanings according to different contexts, and is represented as a vector reflecting the User relationship in this embodiment, the embedded Layer corresponding to the Product data may learn the implicit vector representation of the Product, and the two embedded layers map the User unique identifier User and the Product unique identifier Product to a low-dimensional space.

In the deep learning field, embedding refers to mapping a high-dimensional space with one dimension being the number of possible inputs into a dense vector space with a relatively low dimension. Theoretically, a well-trained Embedding can map inputs with similar meanings or parts of speech into an adjacent Embedding space, and the embodiment can use the Embedding representation of the Embedding as vector representation of users or products, so as to extract the similar relation between the users and the products.

Step S120: and inputting the comment vector into a Tiny-BERT pre-training model to obtain a comment hidden vector corresponding to the comment vector.

Specifically, the Tiny-BERT can learn the association between each word and other words in the text and the correlation between the words based on a Multi-headed Self-attention mechanism (Multi-headed Self-attention). The method can explicitly express the relevance among the words in the form of word vectors, so that the context information of the sentence can be hidden in each word vector, the word vector corresponding to each word in the context can be dynamically given, the text expression with higher quality is improved, and the purpose of parallel processing can be achieved.

Therefore, the comment implicit vector obtained through the Tiny-BERT pre-training model can contain the relevance of the comment text context, and the truth of the comment can be predicted more favorably.

Step S130: and splicing the user implicit vector, the product implicit vector and the comment implicit vector to obtain a spliced vector.

Specifically, the model structure is realized based on the PyTorch frame, after implicit vectors of comments, users and products are obtained, the three can be spliced to obtain { E } _BERT (Review),E _u (User),E _p (Product) }, realizing the fusion of multiple information. The PyTorch is an open-source Python machine learning library, is developed based on the Torch, and can be used for application programs such as natural language processing.

Step S140: inputting the splicing vector into a prediction model to obtain authenticity scores of comment texts corresponding to the comment vectors, wherein the prediction model is obtained by training data of training splicing vectors marked with training comment scores, and the activation functions of the prediction model comprise a ReLU activation function and a Softmax activation function.

Referring to fig. 2, the relu activation function is a simple calculation that directly returns the value provided as an input if the input is greater than 0; if the input is 0 or less, a value of 0 is returned. For values greater than zero, this function is linear, which means that it has many of the desirable characteristics of a linear activation function when a back propagation training neural network is used. However, it is a non-linear function, since negative values are always output as zero. Since the function is linear in half of the input domain and non-linear in the other half, it is called a piece-wise linear function.

Referring to FIG. 3, softmax is an activation function for the last layer of a multiclass problem in which class membership is required for more than two class labels. For an arbitrary real vector of length K, softmax can compress it into a real vector of length the total number of classes, with values in the range of (0,1), and the sum of the elements in the vector is 1. Softmax differs from the normal max function: the max function only outputs the maximum value, but Softmax ensures that smaller values have less probability and are not directly discarded.

The prediction model outputs the probability of various prediction scores of each implicit vector through the rule learning of the three types of implicit vectors, and selects the type with the maximum probability as the prediction score.

After the spliced vector passes through the two activation functions of the prediction model, the prediction model can output a comment score of the spliced vector object, the comment score can represent the authenticity of the comment, and the higher the score is, the more reliable and real the comment is.

In some embodiments of the present invention, the process of acquiring the preprocessed data in step S100 is introduced, and the following description will be made on the process of preprocessing data.

Specifically, the following may be included:

s1, coding and constructing a user character string of a user identification and a product character string of a product identification by using a Term Frequency algorithm respectively to obtain user data corresponding to the user character string and product data corresponding to the product character string.

Specifically, for the user data and the product data, TF (Term Frequency) algorithm may be adopted for encoding construction. Considering that the number of users or products is huge and the influence on the performance of the prediction model is large, the users or products with few comment data can be conveniently removed by numbering according to the frequency. The method comprises the steps of firstly counting the occurrence frequency of each user or product id, numbering the users or products according to the occurrence frequency of the ids from high to low, and converting character string data into integer data after numbering again.

S2, converting the comment text into a comment vector by using a BERT pre-training algorithm, wherein the comment vector comprises a word vector, a text vector and a position vector, and the comment text is an English text.

Specifically, the process of determining the comment vector may include the following:

and S21, removing the other characters except the text characters in the comment text, and lowercase the comment text words to obtain the target comment text.

And S22, dividing the target comment text into a plurality of subwords by adopting a greedy longest priority algorithm, and determining codes corresponding to the subwords according to a vocabulary preset in the BERT pre-training algorithm.

And S23, forming the codes into word vectors, and determining position vectors and text vectors according to the word vectors.

In order to describe the above step S2 more clearly, a specific example will be described below, and the specific process is as follows:

(1) characters such as replacement characters, control characters, and blanks and additional symbols are removed.

(2) Text words are lowercase.

(3) And removing blank characters on two sides of the text, and performing word segmentation based on the blank spaces and punctuation marks.

(4) For each word, according to the vocabulary of the pre-trained BERT model, a greedy longest priority algorithm is adopted to divide each word from right to left as much as possible into a plurality of subwords (subwords), and other subwords except the first subwords need to be added with "# #" prefixes, for example: the "unaffable" is processed and divided into { "un", "# # aff", "# # able" }.

(5) The subwords are replaced with corresponding codes according to a vocabulary (which is provided by a pre-trained BERT model) to form word vectors.

(6) The head of the word vector is added with the code of the reserved word [ CLS ] "and the tail is added with the code of the reserved word [ SEP ]".

(7) Clipping the word vector to the required length: since the BERT model is used in the model, the memory occupation of the model increases exponentially according to the used length, and thus the input needs to be cut correspondingly. The present embodiment employs 512 as the maximum length of the word vector.

(8) And finally, generating a text vector and a position vector according to the word vector. The position vector is a sequence of 0-n, such as [0,1,2, …, n ], where n is the vector length +1. It provides the BERT with sequential information (BERT cannot obtain position information from the word vector and thus requires additional assistance); the text vector is a sequence of marked paragraphs, for example [0,0,0,0,0] indicates that the corresponding word vector is in the same paragraph; [0,0,1,1,1] indicates that the corresponding word vector is in a different paragraph. This vector enables BERT to distinguish between different paragraph texts.

For example, the following steps are carried out: assuming that the comment text content is "This's unaffable", the specific preprocessing algorithm steps are as follows:

1) Removing a special character-This s unaffable

2) Text word lowercase-this s unaffatable

3) Removing the blank symbol on both sides of the text, and performing word segmentation based on the space and punctuation mark [ "this", "s", "unaffable" ]

4) Each word is divided as far as possible from right to left into sub-words [ "this", "s", "un", "# # aff", "# # able" ]

5) Replacing sub-words with corresponding codes to form word vectors

[2024，1056，4896，10355，3086]

6) Adding the code of reserved word "[ CLS ]" at the beginning of the word vector and adding the code of reserved word "[ SEP ]" at the end

[102，2024，1056，4896，10355，3086，103]

7) Clipping the word vector to the required length

[102，2024，1056，4896，10355，3086，103]

8) Generating text vectors and location vectors

[102，2024，1056，4896，10355，3086，103]

[0，0，0，0，0，0，0]

[0，1，2，3，4，5，6]

The process of the predictive review scoring of the predictive model of the present invention will be described in detail with reference to fig. 4 and 5.

Specifically, a user ID character string is obtained as user data, a product ID character string is used as product data, and a character string of a comment text is used as comment data; integer type data can be obtained by the user data and the product data through a TF algorithm, and comment vectors can be obtained by the comment data through a BERT pre-training algorithm; inputting the book data and the product data of integer types into an Embedding layer to respectively obtain a user hidden vector and a product hidden vector, and inputting the comment vector into a Tiny-BERT pre-training model to obtain a comment hidden vector; and then the three implicit vectors are spliced and then input into a prediction model, and finally the score of the comment text is obtained and used as the reference recommendation degree for the authenticity of the comment.

Referring to fig. 6, an embodiment of the present invention provides a device for predicting a review score, including:

a first vector determining unit, configured to input the user data and the product data into respective corresponding embedded layers, respectively, to obtain a user hidden vector corresponding to the user data and a product hidden vector corresponding to the product data;

the second vector determining unit is used for inputting the comment vector to a Tiny-BERT pre-training model to obtain a comment implicit vector corresponding to the comment vector;

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those of ordinary skill in the art will be able to practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for predicting a review score, comprising:

inputting the user data and the product data into respective corresponding embedding layers respectively to obtain a user implicit vector corresponding to the user data and a product implicit vector corresponding to the product data;

2. The method for predicting comment score as claimed in claim 1, wherein the process of preprocessing the data comprises:

3. The method of predicting comment scoring as claimed in claim 2, wherein said transforming comment text into a comment vector using a BERT pre-training algorithm, said comment vector comprising a word vector, a text vector and a position vector, said comment text being in english text, comprises:

removing the rest characters except the text characters in the comment text, and carrying out lowercase writing on the comment text words to obtain a target comment text;

dividing the target comment text into a plurality of sub-words by adopting a greedy longest priority algorithm, and determining codes corresponding to the sub-words according to a vocabulary preset in the BERT pre-training algorithm;

4. The method for predicting comment score as claimed in claim 1, wherein the concatenating the user implicit vector, the product implicit vector and the comment implicit vector to obtain a concatenated vector comprises:

5. A prediction apparatus for comment scoring, comprising:

the data acquisition unit is used for acquiring preprocessed data, wherein the preprocessed data comprise user data, product data and comment vectors, and the user data and the product data are integers;

6. The apparatus for predicting a review score as set forth in claim 5, wherein the data obtaining unit includes:

7. The apparatus for predicting a review score as claimed in claim 6, wherein the vector conversion unit comprises:

8. The apparatus for predicting a review score as claimed in claim 5, wherein the vector stitching unit comprises:

9. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method of any one of claims 1 to 4.

10. A computer-readable storage medium, characterized in that the storage medium stores a program which is executed by a processor to implement the method according to any one of claims 1 to 4.