CN117453919A

CN117453919A - Comment analysis report generation method, device and storage medium based on large language model

Info

Publication number: CN117453919A
Application number: CN202311508418.9A
Authority: CN
Inventors: 王明涛; 邹雨轩
Original assignee: Shanghai Zhongyin Culture Communication Co ltd
Current assignee: Shanghai Zhongyin Culture Communication Co ltd
Priority date: 2023-11-14
Filing date: 2023-11-14
Publication date: 2024-01-26

Abstract

The invention relates to a comment analysis report generation method, equipment and storage medium based on a large language model. Through the use of a large language model to design proper context prompt words to carry out emotion judgment, attribute classification and other pre-processing and feature processing on comments, comment data are divided into different attribute spaces, the probability of misclassification is reduced, meanwhile, the large model summarizing capability is applied, the natural language viewpoint is directly provided, users can clearly know the market demands and feedback conditions, and the method is more accurate than general comment analysis and comment viewpoint automatic extraction.

Description

Comment analysis report generation method, device and storage medium based on large language model

Technical Field

The invention relates to the technical field of data analysis, in particular to a comment analysis report generation method, comment analysis report generation equipment and a storage medium based on a large language model.

Background

Currently, with the development of the advertising industry, the center of gravity of brand marketing gradually changes from traditional offline channels and traditional media to online platforms (such as small red books, tremble sounds, learning likes and the like), marketing modes are changed into modes of passing notes, pictures and texts or videos, and the carriers bring huge flow and opportunities to brands, meanwhile, the novel mode breaks through the passive mode of traditional advertising, the interaction between advertising and target clients is directly promoted, clients express views of brands or products through modes of praise, comments and the like of works, and therefore, the comments are very important windows for the brands to know client demands.

At present, the conventional natural language model has great limitation in the aspect of processing comments, and has strong context due to less comment short text data, and the processing process usually breaks a sentence into phrases by a word segmentation mode, then carries out emotion analysis by a phrase analysis mode or a vectorization mode, subject analysis and the like, so that the analysis effect is obviously inaccurate. Based on the situation, in order to perform comment analysis more effectively and accurately, the brand party is helped to effectively mine the requirements of target clients from comments, so that the brand party is helped to provide powerful support in the aspects of brand construction and product popularization.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a comment analysis report generation method, equipment and storage medium based on a large language model, which can increase text processing of an upper context and a lower context by applying the large language model, design different product dimension labels to classify comments, then cluster similar comments by adopting a vectorization and clustering method, summarize comments of the same category by applying the large language model, and present BI report to the result, so that brands can be in insight from multiple dimensions.

The above object of the present invention is achieved by the following technical solutions:

a comment analysis report generation method based on a large language model comprises the following steps:

step S1, collecting notes and comment data, namely, regularly crawling the notes of the target brands which are horizontally published on the network through a crawler technology, and sequentially crawling corresponding comment data according to the crawled notes to obtain a note data set and a comment data set;

step S2, extracting the theme of the notes in the step S1, wherein the notes are formed by describing and popularizing certain characteristics of brands and products, so that each note can have one theme or a plurality of themes;

extracting plain text information of the notes of the platform, inputting the plain text notes into a large language model if the notes are presented in a plain text mode, and identifying and converting the plain text information into the plain text information through a voice recognition technology if the notes are presented in a video mode, and inputting the plain text information into the large language model as well as designed prompt words to obtain a theme set of each note;

s3, comment data cleaning processing is carried out, the comment data set grabbed in the step S1 is cleaned, meaningless characters such as single punctuation marks are filtered, and an effective comment data set after cleaning is obtained;

s4, extracting the attribute of the comment data processed in the step S3, inputting the subject in the step S2 as the context of the large language model, and designing a prompt word to extract the attribute of the comment to obtain an attribute tag of each comment;

s5, classifying attribute tags according to the processing result of the step S4;

s6, extracting the views under the same label;

step S7, classifying and summarizing based on natural language processing, namely, aggregating comment texts of the same digital label by using the cluster labels obtained in the step S6, splicing all comment texts into a long text by using separators, inputting the long text into a large language model to comment and summarize, controlling the summarizing length within a certain range, and obtaining viewpoint summarizing under the same label;

s8, generating a comment report: and (3) obtaining a comment processing result through the steps, persisting the processed comment into a database, and preparing the BI signboard.

As a further technical scheme of the invention: in the step S1, according to brands or product types to be analyzed, published notes and comments are crawled in compliance through a crawler technology on a corresponding platform, notes and comments are marked by uuid with different lengths, note ids are marked as note_ids, comment ids are marked as comment_ids, one note_id corresponds to a plurality of comment_ids, one comment_id corresponds to one note_id, and a note set and a comment set are obtained, wherein the note set and the comment set have a corresponding relationship.

As a further technical scheme of the invention: in the step S2, a large language model is used to extract topics from notes, a large language model with a long context is selected, a reasonable instruction format or a prompt word is designed, the prompt word contains a brand name or a product name as the context, the design principle of the prompt word prompt is followed, plain text notes are merged into the prompt word for extracting topics, the number of extracted topics is not more than n (n can be adjusted according to actual needs), each note obtains a topic number not more than n, and the same processing is performed on all target notes to obtain a topic set of each note.

As a further technical scheme of the invention: in the step S4, the attribute extraction of the comment data processed in the step S3 is mainly performed to extract the following attributes:

1. emotional attributes (positive, negative, neutral);

2. product attribute extraction (e.g., product efficacy, product composition, product effect, product usage, product purchase, product price, product branding, product packaging and appearance, product quality and durability, product service);

3. and extracting keywords.

As a further technical scheme of the invention: in the step S4, the LLM is used to extract the attribute (emotion attribute, product attribute, topic correlation, keyword extraction) of the comment data, the topic set of the note obtained in the step S2 and the comment set corresponding to the note in the step S3 are extracted in units of notes, the design of the prompting words of the step-by-step obtained result is adopted, the design of the prompting words of different models is different, but the finishing frames are consistent, and the prompting words comprise the following parts: 1) Setting a role; 2) Providing a context; 3) Instruction 4) set an input format; 5) An output format is set.

As a further technical scheme of the invention: in the step S5, attribute tag classification is performed according to the processing result in the step S4, and the specific implementation method is as follows:

and adopting the emotion attribute and the product attribute to make a Cartesian product to form a label combination (e.g. a front-side product efficacy comment), generating a given label by each comment, and aggregating comments of the same label into a category to obtain comment sets of various labels.

As a further technical scheme of the invention: in the step S6, extracting views under the same label specifically comprises the following steps that comment sets under various labels are obtained in the step S4, each comment set is processed by adopting a shifting 624_text2vec-base-Chinese to encode each label to obtain a vector with a dimension of 768, then adopting a BERTopac clustering model to carry out self-adaptive clustering on comment vectors in each classified set, the minimum classification is set to be 1, and each comment is obtained by adopting a numerical label.

As a further technical scheme of the invention: in the step S8, the dimension mainly comprises the emotion distribution of comments (from the step S4), the distribution of the comments in the product attribute (from the step S4), the joint distribution of the comments in the product attribute and the emotion attribute (from the step S5), the comment viewpoint insight and the duty ratio (from the step S7), and the word cloud (from the step S4).

A comment analysis report generating apparatus based on a large language model, comprising:

a memory for storing a computer program;

and a processor for implementing the comment analysis report generating method based on the large language model when executing the computer program stored in the memory.

A computer-readable storage medium storing a computer program which, when executed by a processor, implements the comment analysis report generating method based on a large language model as described above.

The invention discloses a comment analysis report generation method, equipment and storage medium based on a large language model, which are used for obtaining target notes and comments of corresponding notes through a crawler technology and solving the problem that the traditional machine learning cannot well process short text capacity with context scenes by applying the strong context processing capacity of the large language model. Meanwhile, the traditional method generally adopts word segmentation processing, influences the completeness of sentences, cannot accurately understand the potential meaning of comments (such as special sentences of irony and the like), carries out emotion judgment on the comments, carries out early preprocessing and feature processing such as attribute classification and the like by using a large language model to design proper context prompt words, so as to divide comment data into different attribute spaces, reduce the probability of misclassification, and uses labels as available context information, then adopts comment vectorization to carry out clustering analysis mining on the comments from specific attributes, and further accurately learns the viewpoint of product expression from the more accurately mined users in the comments, simultaneously adopts large model summarization capability, directly provides natural language viewpoint, enables the users to clearly know the market demands and feedback conditions, and is more accurate compared with the general comment analysis, and the comment viewpoint is automatically extracted.

Drawings

FIG. 1 is a flow chart of a comment analysis of the present invention.

Fig. 2 is a schematic diagram of comment classification and viewpoint extraction according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application; it is apparent that the described embodiments are only a part of the embodiments of the present application, not all of the embodiments, and all other embodiments obtained by a person having ordinary skill in the art without making creative efforts based on the embodiments in the present application are within the scope of protection of the present application.

Embodiment one:

referring to fig. 1, a comment analysis report generating method based on a large language model disclosed by the invention comprises the following steps:

s6, extracting the views under the same label;

In the step S1, according to brands or product types to be analyzed, published notes and comments are crawled in compliance through a crawler technology on a corresponding platform, notes and comments are marked by uuid with different lengths, note ids are marked as note_ids, comment ids are marked as comment_ids, one note_id corresponds to a plurality of comment_ids, one comment_id corresponds to one note_id, and a note set and a comment set are obtained, wherein the note set and the comment set have a corresponding relationship.

In the step S2, a large language model is used to extract topics from notes, a large language model with a long context is selected, a reasonable instruction format or a prompt word is designed, the prompt word contains a brand name or a product name as the context, the design principle of the prompt word prompt is followed, plain text notes are merged into the prompt word for extracting topics, the number of extracted topics is not more than n (n can be adjusted according to actual needs), each note obtains a topic number not more than n, and the same processing is performed on all target notes to obtain a topic set of each note.

In the step S4, the attribute extraction of the comment data processed in the step S3 is mainly performed to extract the following attributes:

1. emotional attributes (positive, negative, neutral);

3. and extracting keywords.

Further, in the step S4, the LLM is used to extract the attribute (emotion attribute, product attribute, topic correlation, keyword extraction) of the comment data, the topic set of the note obtained in the step S2 and the comment set corresponding to the note obtained in the step S3 are extracted in units of notes, the design of the prompting words of the step-by-step obtained result is adopted, the design of the prompting words of different models is different, but the finishing frames are consistent, and the prompting words comprise the following parts: 1) Setting a role; 2) Providing a context; 3) Instruction 4) set an input format; 5) Setting an output format;

in practice, the role can be set as a product manager with rich experience, and a certain product (the product to be analyzed) is deeply understood; the context adopts the theme of the notes; because of a plurality of attribute extraction, the instruction of distributing and disassembling tasks is adopted, for example, the input comment is processed by adopting the following steps of firstly judging the positive and negative of emotion, adding some fine tuning when judging the positive and negative, for example, inquiring the condition of a product and judging the positive, and the like, and secondly, the product attribute is generalized, and the product attribute is defined: such as

1) The product has the effects: the product can meet my needs and desires;

2) The components of the product are as follows: the components in the product are safe and harmless, and meet the preference and the requirement of my;

3) The product effect is as follows: obvious effect or improvement can be seen after the product is used;

4) The use mode of the product is as follows: the product is convenient and easy to use;

5) The purchasing mode of the product is as follows: product purchase channel, product preference information;

6) Price of the product: the price of the product is reasonable, and the cost performance of the product is improved;

7) Product branding praise: the reputation and the public praise of the brand have good user evaluation and higher trust;

8) Product packaging and appearance: the package and the appearance of the product are attractive and accord with the aesthetic feeling of my;

9) Product quality and durability: the quality of the product is good, and the product is not damaged after long-term use;

10 Product service: after purchasing the product, good after-sales service, goods returning policy, door-to-door installation service and warranty service are provided;

11 Other than the description in points 1 to 10;

the input format adopts sharp or separator to define data input, so as to avoid interference instructions; the output definition adopts json data format, and the output definition of each step is related to key, so that the data processing and verification are facilitated. Through the prompt word setting, the comment fusion prompt is input into the LLM to obtain three-dimensional labels, namely (emotion dimension, product attribute, related subject or not and keyword set).

In the step S5, attribute tag classification is performed according to the processing result in the step S4, and the specific implementation method is as follows:

Specifically, the comments of the product dimension and the emotion dimension are classified according to the processing result in the step S4, and the comments are classified from the emotion dimension and the product dimension, and the specific implementation method is as follows: if the product dimension and the emotion dimension are used for carrying out Cartesian products (such as positive-product efficacy and positive-product purchase mode), 11 classifications in S4 are adopted for the product dimension, three classifications of positive, negative and neutral are adopted for the emotion classification, the attribute combination is 33 classifications, all comments are divided into 33 subclasses, and each comment only belongs to one classification.

Referring to FIG. 2, in the step S6, extracting views under the same label specifically includes the steps that comment sets under various labels are obtained in the step S4, each comment set is processed by encoding each label by using a shifting 624_text2 vec-base-element to obtain a vector with a dimension of 768, then adaptive clustering is performed on comment vectors in each classified set by using a BERTopac clustering model, the minimum classification is set to be 1, and each comment obtained adaptive label is marked by a number.

Specifically, after the classification in step S5, the comments are drilled and classified. Specific embodiments: and (3) extracting a certain type of comments in the step (S5), carrying out vectorization on each comment by adopting a Sentence converter, selecting a shifting 624_text2vec-base-Chinese coding model, and clustering vectors by adopting a BerTopic clustering model after vectorization of the comments to obtain a category label.

In the step 7, each comment has two non-hierarchical labels, the first label is a label of a product dimension and an emotion dimension in the step 5, the other label is a label in the step 6, comments of the same label in the step 5 and the step 6 are aggregated, the same comments are spliced by using separators, and the labels of the same category are input into the LLM, so that the LLM can summarize the comments of the same view.

In the step S8, the dimension mainly comprises the emotion distribution of comments (from the step S4), the distribution of the comments in the product attribute (from the step S4), the joint distribution of the comments in the product attribute and the emotion attribute (from the step S5), the comment viewpoint insight and the duty ratio (from the step S7), and the word cloud (from the step S4).

Embodiment two:

the invention also discloses comment analysis report generating equipment based on the large language model, which comprises the following steps:

a memory for storing a computer program;

a processor, configured to implement the comment analysis report generating method based on the large language model according to the first embodiment when executing the computer program stored in the memory.

Embodiment III:

the invention also discloses a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the comment analysis report generating method based on the large language model according to the first embodiment can be realized when the computer program is executed by a processor.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-only memory (ROM), a random access memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification, and all processes or units of any method or apparatus so disclosed, may be employed, except that at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, any of the claimed embodiments can be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. Several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.

The implementation principle of the invention is as follows: the invention discloses a comment analysis report generation method, equipment and storage medium based on a large language model, which are used for obtaining target notes and comments of corresponding notes through a crawler technology and solving the problem that the traditional machine learning cannot well process short text capacity with context scenes by applying the strong context processing capacity of the large language model. Meanwhile, the traditional method generally adopts word segmentation processing, influences the completeness of sentences, cannot accurately understand the potential meaning of comments (such as special sentences of irony and the like), carries out emotion judgment on the comments, carries out early preprocessing and feature processing such as attribute classification and the like by using a large language model to design proper context prompt words, so as to divide comment data into different attribute spaces, reduce the probability of misclassification, and uses labels as available context information, then adopts comment vectorization to carry out clustering analysis mining on the comments from specific attributes, and further accurately learns the viewpoint of product expression from the more accurately mined users in the comments, simultaneously adopts large model summarization capability, directly provides natural language viewpoint, enables the users to clearly know the market demands and feedback conditions, and is more accurate compared with the general comment analysis, and the comment viewpoint is automatically extracted.

The embodiments of the present invention are all preferred embodiments of the present invention, and are not intended to limit the scope of the present invention in this way, therefore: all equivalent changes in structure, shape and principle of the invention should be covered in the scope of protection of the invention.

Claims

1. A comment analysis report generation method based on a large language model is characterized by comprising the following steps:

s6, extracting the views under the same label;

2. The comment analysis report generating method based on the large language model according to claim 1, wherein in the step S1, according to the brand or the product model to be analyzed, the published notes and comments are crawled in compliance through the crawler technology on the corresponding platforms, the notes and comments are marked by uuids with different lengths, the note ids are marked as note_ids, the comment ids are marked as comment_ids, one note_id obviously corresponds to a plurality of note_ids, one note_id only corresponds to one note_id, and a note set and a comment set are obtained, and the note set and the comment set have a corresponding relationship.

3. The comment analysis report generating method based on the large language model according to claim 1, wherein in the step S2, the large language model is adopted to extract topics of notes, the large language model with long context is selected, the instruction format or the prompt word with reasonable design is selected, the prompt word contains brand names or product names as context, the design principle of the prompt word prompt is followed, plain text notes are merged into the prompt word for extracting topics, the number of extracted topics is set to be not more than n (n can be adjusted according to actual needs), each note obtains the number of topics not more than n, and the same processing is performed on all target notes to obtain the topic set of each note.

4. The comment analysis report generating method based on the large language model according to claim 1, wherein in the step S4, the comment data attribute processed in the step S3 is extracted, mainly by extracting the following attributes:

1) Emotional attributes (positive, negative, neutral);

2) Product attribute extraction (e.g., product efficacy, product composition, product effect, product usage, product purchase, product price, product branding, product packaging and appearance, product quality and durability, product service);

3) And extracting keywords.

5. The method for generating comment analysis report based on large language model according to claim 1, wherein in step S4, LLM is used to extract attributes (emotion attributes, product attributes, topic relevance, keyword extraction) of comment data, and in units of notes, the topic set of notes obtained in step S2 and the comment set corresponding to notes in step S3 are extracted, and the design of the prompting words of step-by-step obtained results is adopted, and the design of prompting words of different models is different, but the arrangement frame is consistent, and the prompting words include the following parts: 1) Setting a role; 2) Providing a context; 3) Instruction 4) set an input format; 5) An output format is set.

6. The comment analysis report generating method based on the large language model according to claim 1, wherein in the step S5, attribute tag classification is performed according to the processing result of the step S4, and the specific implementation method is as follows:

7. The comment analysis report generating method based on the large language model according to claim 1, wherein in the step S6, extracting the views under the same label specifically includes the steps of obtaining comment sets under various labels in step S4, performing processing on each comment set, encoding each label by using a shifting 624_text2vec-base-Chinese to obtain a vector with a dimension of 768, performing adaptive clustering on comment vectors in each classified set by using a bertopac clustering model, setting a minimum classification to be 1, and obtaining an adaptive label by using a numerical label for each comment.

8. The method for generating a comment analysis report according to claim 1, wherein in the step S8, the dimensions mainly include a comment emotion distribution (from step S4), a comment distribution of product attributes (from step S4), a comment joint distribution of product attributes and emotion attributes (from step S5), a comment viewpoint insight and ratio (from step S7), and a word cloud (from step S4).

9. A comment analysis report generating apparatus based on a large language model, characterized by comprising:

a memory for storing a computer program;

10. A computer-readable storage medium storing a computer program which, when executed by a processor, implements the comment analysis report generating method based on a large language model as described above.