WO2023016267A1

WO2023016267A1 - Spam comment identification method and apparatus, and device and medium

Info

Publication number: WO2023016267A1
Application number: PCT/CN2022/108563
Authority: WO
Inventors: 邓冰娜; 谢永恒; 火一莽; 郭子剑
Original assignee: 北京锐安科技有限公司
Priority date: 2021-08-12
Filing date: 2022-07-28
Publication date: 2023-02-16
Also published as: CN113656580A

Abstract

Disclosed in the embodiments of the present application are a spam comment identification method and apparatus, and a device and a medium. The method comprises: acquiring a plurality of comments to be identified and a subject word set, which correspond to a target article, and calculating the similarity between each comment to be identified and the target article according to the calculation weight of each subject word among a plurality of subject words in the subject word set; identifying, according to a similarity calculation result, candidate spam comments and normal comments from the plurality of comments to be identified; when it is determined that there are comments to be identified, which have not been successfully identified, among the plurality of comments to be identified, and according to the identified normal comments, performing at least one of the following operations on the subject word set to obtain a new subject word set to serve as the subject word set: performing subject word augmentation on the subject word set, updating the calculation weights of the plurality of subject words in the subject word set, and updating the plurality of comments to be identified to the comments to be identified that have not been successfully identified; and returning to execute the step of calculating the similarity between each of the plurality of comments to be identified and the target article according to the calculation weight of each of the subject words in the subject word set, until all the comments to be identified are successfully identified.

Description

Method, device, equipment and medium for identifying spam comments

This application claims the priority of the Chinese patent application with application number 202110925078.4 submitted to the China Patent Office on August 12, 2021, the entire content of which is incorporated herein by reference.

technical field

The embodiments of the present application relate to the technical field of big data mining, for example, to a method, device, equipment and medium for identifying spam comments.

Background technique

With the rapid development of Internet technology, the comment information on the Internet is growing explosively. How to filter the comment information on the Internet and identify spam comments has become an urgent problem to be solved.

In related technologies, for network spam comments, methods for preventing and identifying spam comments are mainly divided into two categories: manual identification methods and automatic identification methods. Among them, the automatic identification method can be further divided into classification identification method based on training set and identification method based on similarity.

However, the method of artificial recognition can only identify newly published comments, filter out the spam comments in the newly published comments, but can do nothing for the published spam comments; the method of artificial recognition requires continuous manual maintenance, which is not very convenient ; Moreover, spammers can use a variety of proxy methods to cheat filtering mechanisms. The classification method based on the training set, due to the convenience of the network, the comment update speed is relatively fast, and the feature words change a lot, so in order to make the classifier more accurately identify spam comments, the training samples must be changed with this change If the training sample changes, the feature item must be reselected, and the weight of the feature item must be recalculated and extracted, which seriously affects the efficiency of the system operation and brings inconvenience.

Contents of the invention

The embodiments of the present application provide a method, device, device and medium for identifying spam comments, so as to realize automatic identification of spam comments in Internet comment information.

The embodiment of the present application provides a method for identifying spam comments, including:

Obtain multiple to-be-recognized comments and subject word sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject term set;

Identifying alternative spam comments and normal comments among the plurality of comments to be identified according to the similarity calculation result;

If it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms The subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated to the unrecognized Successfully identified to-be-recognized comments;

Returning to the execution of calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.

The embodiment of the present application also provides a spam comment identification device, the device includes:

The similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;

The comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;

The subject term set updating module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments if it is determined that there are unrecognized comments among the plurality of subject term sets, Obtaining a new subject term set as the subject term set: performing subject term expansion on the subject term set, and updating the calculation weights of the plurality of subject terms in the subject term set; The comment to be identified is updated to the comment to be identified that has not been successfully identified;

The comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.

The embodiment of the present application also provides a computer device, and the computer device includes:

one or more processors;

A storage device for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the How to identify spam comments.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the method for identifying spam comments as described in any embodiment of the present application is implemented.

Description of drawings

Fig. 1 is a flowchart of a method for identifying spam comments in Embodiment 1 of the present application;

Fig. 2 is a flowchart of a method for identifying spam comments in Embodiment 2 of the present application;

Fig. 3a is a flow chart of a method for identifying spam comments in Embodiment 3 of the present application;

Fig. 3b is an overall block diagram of a spam comment identification method in Embodiment 3 of the present application;

Fig. 4 is a schematic structural diagram of an identification device for spam comments in Embodiment 4 of the present application;

FIG. 5 is a schematic structural diagram of a computer device in Embodiment 5 of the present application.

Detailed ways

The application will be described below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures.

Embodiment one

Fig. 1 is the flowchart of the method for identifying spam comments provided by Embodiment 1 of the present application. This embodiment is applicable to the situation of identifying spam comments in Internet comment information, and the method can be executed by an identification device for spam comments. The device can be implemented in the form of hardware and/or software, and can generally be integrated into a computer device with a function of identifying spam comments, for example, a terminal device or a server, etc. The method specifically includes the following steps:

S110. Acquire multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.

The comments to be identified refer to Internet comment information that corresponds to the target article and needs to be identified. The subject term set refers to a term set composed of a plurality of subject terms corresponding to the target article.

Exemplarily, the calculation weight of each subject term can be calculated according to the formula 1+log10(1+n), wherein, n represents the number of times the subject term appears in the target article.

S120. According to the similarity calculation result, identify candidate spam comments and normal comments from the plurality of comments to be identified.

Spam comments refer to comments with low similarity with the target article, that is, comments that are not strongly related to the target article; alternative spam comments refer to comments with a low initial judgment of similarity with the target article , the category needs to be confirmed in the next step to finalize its category; normal comments refer to comments with a high degree of similarity with the target article, that is, comments with a strong correlation with the target article.

According to the calculation result of the similarity between each to-be-recognized comment and the target article, multiple to-be-recognized comments can be classified to distinguish candidate spam comments from normal comments.

In an optional real-time mode of this embodiment, comments to be identified whose similarity calculation results are greater than or equal to a preset threshold (for example, 90%) can be directly determined as normal comments, and the similarity calculation results are less than or equal to Comments to be identified equal to a preset threshold (for example, 5%) are directly determined as candidate spam comments, and comments to be identified whose similarity calculation results are within a preset threshold (for example: 5%-90%) , determined as unrecognized comments to be recognized.

In the embodiment, the candidate spam comments may be directly determined as spam comments, or a secondary screening may be performed on the candidate spam comments, which is not limited in this embodiment.

S130. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.

Unrecognized unrecognized comments refer to the third category of comments that are neither spam candidates nor normal comments. To expand the subject term set refers to adding newly selected subject terms to the subject term set. The selection rules for new subject terms can be set according to actual needs. For example, the synonyms of the subject terms in the target article can be used as new Subject headings, which are not limited in this embodiment.

S140 , return to perform calculation of the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.

It is worth noting that the unrecognized comments in this step refer to unrecognized unrecognized comments. By calculating the weights of all newly expanded keywords in the expanded keyword set, we can calculate the weight of each unsuccessfully identified one. The similarity between the to-be-recognized comments and the target article is used to complete the identification of all unsuccessfully identified to-be-recognized comments. If all the unrecognized comments to be identified cannot be identified after the subject term set is expanded once, then the subject term set is expanded again until all the subject term sets are successfully identified.

The technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each of the multiple keywords in the keyword set, and according to the similarity calculation results, in multiple Alternative spam comments and normal comments are identified in the comments to be identified, if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, after updating the subject word set, the unidentified unidentified ones are identified again. The comments are identified until all the comments to be identified are identified, so that multiple rounds of automatic identification of spam comments in Internet comment information can be realized, and the identification effect of spam comments is improved.

Embodiment two

Fig. 2 is the flow chart of the method for identifying spam comments provided by the second embodiment of the present application. This embodiment is based on the above-mentioned embodiment for refinement, wherein obtaining a plurality of comments to be identified corresponding to the target article includes: obtaining and All the comments corresponding to the target article, and each comment is matched with the network common words dictionary; according to the matching results, the alternative spam comments, the alternative normal comments, and the unrecognized comments are obtained, and the alternative normal comments and the described The comment cannot be identified, and it is determined as the plurality of comments to be identified.

Optionally, the number of the identified candidate spam comments is multiple; after all the comments to be identified are successfully identified, it also includes: filtering the identified multiple candidate spam comments, and according to Filter the results and identify each candidate spam comment as spam or normal comment.

As shown in Figure 2, the method includes the following steps:

S210. Obtain all comments corresponding to the target article, and match each comment with a dictionary of commonly used words on the Internet.

Internet common words refer to many conventional words, words or phrases appearing on the Internet, for example, such words as top, refueling, support, sofa, boredom, soy sauce, occupying a seat, and pouring water; Contains thesaurus of commonly used words on the Internet.

S220. Obtain candidate spam comments, candidate normal comments, and unrecognizable comments according to the matching result, and determine the candidate normal comments and the unrecognizable comments as the plurality of comments to be identified.

Alternative spam is short spam comments; Alternative normal comments are short normal comments; Unrecognized comments are all non-short comments.

In an optional implementation, after word segmentation, part of speech preservation, repetition removal, and stop word removal are performed on all comments corresponding to the target article, the length L of each comment is calculated and a threshold T is set to determine the length of the comment Evaluate, for example, 5≤T≤8; when L<T, the comment is a short comment, and the set of short comments is defined as ShorD, when L is greater than or equal to T, the comment is a non-short comment, a non-short comment The set is LongD; for each comment in the set ShorD, it is searched and matched with the words in the network common language lexicon, the number of matched network normal words is recorded as num1, and the number of matched network spam words is recorded as num2; if num1>=num2, the comment is marked as an alternative normal comment, and if num1<num2, the comment is marked as an alternative spam comment.

Since the short comments containing commonly used words on the Internet basically do not contain words related to the content of the target article, for this kind of short comments containing common words on the Internet, using the method of text similarity to identify their categories, the effect It must be bad. Therefore, in this embodiment, aiming at the length of the comments, firstly, the short comments are identified by using the network common phrases lexicon, and then the unidentifiable comments are identified by text similarity, so that no matter whether it is a short comment or a non-short comment, it can be identified. Comment spam identified.

S230. Acquire multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.

S240. According to the similarity calculation result, identify candidate spam comments and normal comments from the plurality of comments to be identified.

S250. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.

S260. Go back and execute calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.

For details not explained in this embodiment, please refer to the foregoing embodiments, and details are not repeated here.

S270. Filter the multiple identified candidate spam comments, and identify each identified candidate spam comment as a spam comment or a normal comment according to the filtering result.

In the technical solution of the embodiment of the present application, by matching all the comments corresponding to the target article with the dictionary of commonly used words in the network, the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined The comment is determined as a comment to be identified, and the text similarity calculation method is used to identify the comment to be identified, and after the successful identification of all the comments to be identified, a secondary filtering process is performed on the identified multiple candidate spam comments, so that no matter Both short and non-short comments can identify spam comments, realize automatic identification of spam comments in Internet comment information, and improve the recognition effect of spam comments.

Secondary filtering refers to the use of common Internet terms and subject words to filter the candidate spam comments again by comparing the proportion of normal words in the candidate spam comments and the proportion of spam words in the total vocabulary in the candidate spam comments. When the proportion of the total vocabulary in the candidate spam comments is greater than or equal to a threshold, the comment is considered to be a normal comment; when the proportion of normal words to the total vocabulary in the candidate spam comments is less than the threshold, the comment is considered to be a spam comment , so as to reduce the possibility of normal comments being identified as spam comments. The threshold can be set according to actual requirements, which is not limited in this embodiment.

Embodiment three

Fig. 3a is a flow chart of the spam comment identification method provided by the third embodiment of the present application, and Fig. 3b is an overall block diagram of the spam comment identification method provided by the third embodiment of the present application. This embodiment is based on the above-mentioned embodiment. Wherein, according to the similarity calculation result, identifying the candidate spam comments and normal comments among the plurality of comments to be identified includes: obtaining the similarity calculation result corresponding to each comment to be identified; if it is determined that each comment to be identified If the corresponding similarity calculation result is less than or equal to the first threshold, it is determined that each comment to be identified is a candidate spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to the second threshold, Then it is determined that each comment to be identified is a normal comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment cannot be identified. The unidentified comments are successfully identified.

As shown in Figure 3a, the method includes the following steps:

S310. Obtain all the comments corresponding to the target article, and match each comment with the dictionary of commonly used words on the Internet.

S320. Obtain candidate spam comments, candidate normal comments, and unrecognizable comments according to the matching result, and determine the candidate normal comments and the unrecognizable comments as the plurality of comments to be identified.

S330. Obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.

Optionally, wherein, the calculation weight of each subject term includes: the weight of each subject term in the target article; according to the calculation weight of each subject term in a plurality of subject terms in the subject term set, calculate each The similarity between a comment to be identified and the target article P, including:

use the formula

Calculate the similarity between each comment to be identified and the target article.

Among them, C _k represents the vector of the kth comment to be identified, P represents the vector of the target article, n is the dimension of the vector, w _i represents the weight of the topic word i in the target article, and w _ik represents the kth of the topic word i The weights in the comments to be identified, S _i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article, S _i is 1, and in the remaining rounds

Sim(P _i ,C _i'k ) indicates the similarity score between the subject word i in the kth comment to be recognized and the synonym i' of the subject term i in the target article, if they are the same word, then the value is 1;

Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C _k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized. because

The value is a number not greater than 1. Multiplying it will reduce the value of the entire formula and affect the similarity score, so a smoothing factor of 0.5 is added to the formula.

In this embodiment, n is the dimension of the vector C _k of the kth comment to be identified and the vector P of the target article, and n is numerically equal to the number of subject words.

In addition, in the remaining rounds, in order to make up for the inability of the traditional similarity method to identify synonyms and improve the similarity score between the comment to be identified and the target article, this embodiment proposes a method based on the similarity between words, word position information and The improved cosine similarity formula of morphological similarity is a method to calculate the similarity between the comment to be recognized and the target article. The improved specific formula is as follows:

Among them, Similarity'(P,C _k ) is the similarity between the improved k-th comment C _k to be identified and the target article P, w' _i =w _i *L(t), w' _ik =wi _ik *L(t), L(t) represents the position of the subject word i in the target article.

S340. Obtain a similarity calculation result corresponding to each comment to be identified.

S350. If it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, determine that each comment to be identified is a candidate spam comment.

The first threshold refers to a preset value used to evaluate the comment to be identified as a candidate spam comment. The first threshold may be set according to specific actual requirements, which is not limited in this embodiment.

S360. If it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, determine that each comment to be identified is a normal comment.

The second threshold refers to a preset value used to evaluate the comment to be identified as a normal comment. The second threshold may be set according to specific actual requirements, which is not limited in this embodiment.

S370. If it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, determine that each comment to be identified has not been successfully identified.

Exemplarily, the first threshold is set to a value b, the second threshold is set to a value, and if the similarity calculation result corresponding to the comment to be identified is less than or equal to b, then the comment to be identified is determined to be an alternative spam comment ; If the similarity calculation result corresponding to the comment to be identified is greater than or equal to a, it is determined that the comment to be identified is a normal comment; Review for successful identification.

S380. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.

Optionally, according to the identified normal comments, the subject term set is expanded, including: obtaining the high-frequency terms included in the identified normal comments, and using the high-frequency terms as new subject terms Add to the set of keywords, and carry out weight setting for newly added keywords; count the frequency of occurrence of each keyword included in the new keyword set in the target article, according to the frequency of occurrence, in the co-occurrence words The co-occurrence words associated with the occurrence frequency of at least one of the words in the new headword set obtained through library matching are added to the headword set as new words, and weight settings are performed for the newly added words.

Among them, according to the formula

Adjust the weight of high-frequency words. Among them, Weight'(t _r ) is the adjusted weight of high-frequency words; t _r is the word that appears in normal comments; T(t _r ) is the weight of word t _r in normal comments, and the calculation formula is 1+log10 (1+n _k ); T(k) is the number of normal comments including word t _r after K rounds of similarity comparisons; N(k) is the total number of normal comments after K rounds of similarity comparisons.

By calculating the weight of the words in the normal comments, it can intuitively reflect the weight of the words in the normal comments in the target article, so that the words with higher weights can be added to the subject word set.

Optionally, according to the identified normal comments, the calculation weights of multiple keywords in the keyword set are updated, including:

use the formula

The calculated weight is updated for each subject term in the subject term set.

Among them, Weight'(i) is the calculation weight of the updated word i, the word i is the subject word in the subject word set of the target article, and the synonym i' is the synonym of the word i appearing in the normal comment; n _p is the word i in the The number of occurrences in the target article; n _k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is the number of normal comments including word i after K rounds of similarity comparison; N (k) is the total number of normal comments after K rounds of similarity comparison; N _i' is the set of synonyms of word i, Weight(i') is the weight of the synonyms i' of word i in all normal comments, Sim( i, i') is the similarity score between word i and synonym i', μ is an adjustment factor greater than 0, which adjusts the weight value and similarity of each synonym in the synonym set of word i to the effect of word i The degree of influence of the weight.

Among them, 1+log(1+n _p + _nk ) represents the word frequency of the word, adding 1 to the logarithm is to avoid the value calculated by the logarithm from zero, because the value of n _p is likely to be 0, at this time If the value of n _k is 1, the value calculated by log(n _p +n _k ) is 0, so add 1 to the logarithm, and, since the value calculated by the logarithm is generally less than 1, it will make The decrease of the value of the whole formula may have an adverse effect on the subsequent comment classification, so in this embodiment, 1 is added before the logarithm. In the formula

Indicates the ratio of the number of normal comments where word i appears to the total number of normal comments. Because it is not that the frequency of a word appears as high as possible, it also depends on whether the number of articles in which the word appears is even. Here, the larger T(k) is, the better, which means that the word is more evenly distributed in this class, and it means that everyone is discussing this issue. because

is a number less than 1, multiplying it can reduce the weight of high-frequency keywords, thereby reducing the negative impact on classification, and can also reduce the impact of high-frequency keywords on review classification in false reviews to a certain extent, Reduce the similarity score of fake reviews.

If a synonym of a topic word always appears in other comments, it means that what everyone is discussing is related to this topic word and also to the content of the article, so such a topic word should increase the calculation weight, and the keyword of the topic word should be added The part of the weight information of synonyms, that is,

In an optional embodiment, the calculation weight of the synonym i' of the term i may be adjusted according to the following formula, so as to realize the adjustment of the calculation weight of the subject term i.

Among them, i' is the synonym of the word i in the subject word set that appears in the comment; T(i') is the weight of the word i' in the normal comment, and the calculation formula is 1+log10(1+n _k ); T(k) is the number of normal comments containing word i' after K rounds of similarity comparison; N(k) is the total number of normal comments that appear in word i' after K rounds of similarity comparison; T(i _p ) is the number of normal comments for word i' The weight of the synonym i _p of ', the word i _p is the word in the subject word set before calculating the weight adjustment. N _p is a set of synonyms of word i', which is obtained from the subject word set of the target article.

The above technical solution can make the recognition result of the comment to be recognized more accurate and reliable by continuously adjusting the calculation weight of the subject words.

S390. Return to perform calculation of the similarity between each of the multiple unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.

S3100. Filter the multiple identified candidate spam comments, and identify each identified candidate spam comment as a spam comment or a normal comment according to the filtering result.

In the technical solution of the embodiment of the present application, by matching all the comments corresponding to the target article with the dictionary of commonly used words in the network, the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined The comment is determined to be a comment to be identified, and on the basis of setting the first threshold and the second threshold, the text similarity calculation method is used to identify the comment to be identified, and after all the comments to be identified are successfully identified, the alternative spam comment Secondary filtering is performed, so that no matter whether it is a short comment or a non-short comment, the spam comment can be identified, and the automatic identification of the spam comment in the Internet comment information is realized, and the identification effect of the spam comment is improved.

Embodiment Four

FIG. 4 is a schematic structural diagram of an apparatus for identifying spam comments provided in Embodiment 4 of the present application. The apparatus can implement a method for identifying spam comments involved in the above-mentioned embodiments. This device can be implemented in the form of software and/or hardware. As shown in Figure 4, the identification device of the spam comments includes: similarity calculation module 410, comment identification module 420, subject term set update module 430, comment success identification module 440.

The similarity calculation module 410 is configured to obtain a plurality of comments to be identified and keyword sets corresponding to the target article, and according to the calculation weight of each keyword in the plurality of keywords in the keyword set, calculate the relationship between each comment to be identified and The similarity between the target articles; the comment identification module 420 is set to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result; the subject term set update module 430 is set to if It is determined that there are unidentified comments to be identified among the plurality of comments to be identified, and then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the Subject term set: expand the subject term set, and update the calculation weights of the plurality of subject terms in the subject term set, and update the plurality of comments to be identified as the unsuccessful Identify the comment to be identified; the comment success identification module 440 is configured to return to execute the calculation weight of each keyword in the subject word set, and calculate the weight between each of the plurality of comments to be identified and the target article Similarity until all the comments to be identified are successfully identified.

The technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each keyword in the keyword set, and according to the similarity calculation result, among multiple comments to be identified Identify alternative spam comments and normal comments. If it is determined that there are unidentified unrecognized comments, after updating the subject word set, re-identify unrecognized unrecognized comments until all unrecognized comments are identified categories, which can realize multiple rounds of automatic identification of spam comments in Internet comment information, and improve the identification effect of spam comments.

Optionally, the similarity calculation module 410 is configured to obtain all comments corresponding to the target article, and match each comment with the dictionary of commonly used words in the network; obtain alternative spam comments, alternative normal comments and The comment cannot be identified, and the candidate normal comment and the unidentified comment are determined as the plurality of comments to be identified.

Optionally, the number of identified alternative spam comments is multiple; the identification device for spam comments also includes a secondary filtering module, which is configured to filter the identified multiple comments after all comments to be identified are successfully identified. Each candidate spam comment is filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering results.

Optionally, the comment identification module 420 is configured to obtain a similarity calculation result corresponding to each comment to be identified; if it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, then determine Each comment to be identified is an alternative spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, then it is determined that each comment to be identified is a normal comment; if it is determined If the similarity calculation result of each comment to be identified is greater than the first threshold and less than the second threshold, then it is determined that each comment to be identified has not been successfully identified.

Optionally, the calculation weight of each subject term includes: the weight of each subject term in the target article; the similarity calculation module 410 is set to use the formula

Calculate the similarity between each comment to be identified and the target article; among them, C _k represents the vector of the kth comment to be recognized, P represents the vector of the target article, n is the dimension of the vector, and w _i represents the subject word i in The weight in the target article, w _ik represents the weight of the topic word i in the k comment to be recognized, and S _i represents the semantic information between words. When calculating the similarity between the first round of the comment to be recognized and the target article, S _i is 1, and in the remaining rounds

Sim(P _i ,C _i'k ) represents the similarity score between the subject word i in the kth comment to be identified and the synonym i' of the subject term i in the target article,

Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C _k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized.

Optionally, the subject term set update module 430 is configured to obtain the high-frequency terms included in the identified normal comments, add the high-frequency terms as new subject terms to the subject term set, and The newly added subject words carry out weight setting; Count the frequency of occurrence of each subject word included in the target article in the new subject words set, according to the frequency of occurrence, match the co-occurrence word thesaurus to obtain the new subject words The co-occurrence words associated with the frequency of occurrence of at least one subject term in the set are added to the subject term set as new subject terms, and weight settings are performed for the newly added subject terms.

Optionally, the subject term set update module 430 is set to use the formula

Update the calculated weight of each subject term in the subject term set; among them, Weight'(i) is the calculated weight of the updated term i, term i is the subject term in the subject term set of the target article, and the synonym i' is the normal Synonyms of word i appearing in comments; n _p is the number of times word i appears in the target article; n _k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is K rounds of similarity After degree comparison, the number of normal comments including word i; N(k) is the total number of normal comments after K rounds of similarity comparison; N _i' is the set of synonyms for word i, and Weight(i') is the The weight of i's synonym i' in all normal comments, Sim(i,i') is the similarity score between word i and synonym i', μ is an adjustment factor greater than 0.

The spam comment identification device provided in the embodiment of the present application can execute the spam comment identification method provided in any embodiment of the present application, and has corresponding functional modules for executing the method.

Embodiment five

FIG. 5 is a schematic structural diagram of a computer device provided in Embodiment 5 of the present application. As shown in FIG. 5, the computer device includes a processor 510, a memory 520, an input device 530, and an output device 540; The quantity can be one or more, and a processor 510 is taken as an example in FIG. Take the bus connection as an example.

The memory 520, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the method for identifying spam comments in the embodiment of the present application (for example, the device for identifying spam comments The similarity calculation module 410, the comment identification module 420, the subject term set update module 430 and the comment success identification module 440). The processor 510 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 520 , that is, realizes the above-mentioned method for identifying spam comments.

The method includes: obtaining a plurality of to-be-recognized comments and subject word sets corresponding to the target article, and calculating the relationship between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject word set the similarity; according to the similarity calculation result, identify alternative spam comments and normal comments in the plurality of comments to be identified; if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to For the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms: subject term expansion is performed on the subject term set, and the subject The calculation weights of the plurality of subject terms in the term set are updated, and the plurality of comments to be identified are updated as the unsuccessfully identified comments to be identified; return to execute according to the calculation weight of each subject term in the subject term set , calculating the similarity between each of the plurality of unidentified comments and the target article, until all unidentified comments are successfully identified.

The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 520 may further include memory located remotely from the processor 510, and these remote memories may be connected to the computer device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer equipment. The output device 540 may include a display device such as a display screen.

Embodiment six

Embodiment 6 of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to implement a method for identifying spam comments when executed by a computer processor, the method includes: obtaining and target articles Corresponding multiple to-be-recognized comments and keyword sets, and according to the calculation weight of each keyword in the multiple keyword sets in the keyword set, calculate the similarity between each to-be-identified comment and the target article; calculate according to the similarity As a result, alternative spam comments and normal comments are identified in the plurality of comments to be identified; Performing at least one of the following operations on the subject term set to obtain a new subject term set as the subject term set: performing subject term expansion on the subject term set, and performing the subject term expansion on the plurality of subject term sets in the subject term set The calculation weight of the subject words is updated, and the plurality of comments to be identified are updated to the comments to be identified that have not been successfully identified; return to execute according to the calculation weight of each subject term in the subject term set, calculate the plurality of comments to be identified Identify the similarity between each unidentified comment in the comments and the target article until all unidentified comments are successfully identified.

Of course, a storage medium containing computer-executable instructions provided in the embodiments of the present application, the computer-executable instructions are not limited to the method operations described above, and can also implement the spam comment identification method provided in any embodiment of the present application Related operations in .

Through the above description about the implementation, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and of course it can also be implemented by hardware, but in many cases the former is a better implementation mode . Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product can be stored in a computer-readable storage medium, such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a personal computer) , server, or network device, etc.) execute the methods described in multiple embodiments of the present application.

It should be noted that in the embodiment of the device for identifying spam comments above, the multiple units and modules included are only divided according to functional logic, but are not limited to the above-mentioned divisions, as long as the corresponding functions can be realized; In addition, the specific name of each functional unit is only for the convenience of distinguishing each other, and is not used to limit the protection scope of the present application.

Claims

A method for identifying spam comments, comprising:

Obtain multiple to-be-recognized comments and subject word sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject term set;

Identifying alternative spam comments and normal comments among the plurality of comments to be identified according to the similarity calculation result;

When it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, at least one of the following operations is performed on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated as the Comments to be identified that were not successfully identified;

Returning to the execution of calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
The method according to claim 1, wherein obtaining a plurality of comments to be identified corresponding to the target article comprises:

Obtain all the comments corresponding to the target article, and match each comment with the dictionary of commonly used words on the Internet;

According to the matching result, candidate spam comments, candidate normal comments, and unrecognizable comments are obtained, and the candidate normal comments and the unrecognizable comments are determined as the plurality of comments to be identified.
The method according to claim 2, wherein the number of the identified alternative spam comments is multiple;

After all the comments to be identified are successfully identified, it also includes:

The plurality of identified candidate spam comments are filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering result.
The method according to any one of claims 1-3, wherein, according to the similarity calculation result, identifying candidate junk comments and normal comments among the plurality of comments to be identified includes:

Obtain the similarity calculation result corresponding to each comment to be identified;

In the case where the similarity calculation result corresponding to each comment to be identified is less than or equal to a first threshold, determine that each comment to be identified is a candidate spam comment;

In the case where the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, it is determined that each comment to be identified is a normal comment;

In a case where the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment to be identified has not been successfully identified.
The method according to any one of claims 1-3, wherein the calculation weight of each subject term comprises: the weight of each subject term in the target article;

The calculation of the similarity between each comment to be identified and the target article is calculated according to the calculation weight of each of the multiple keywords in the keyword set, including:

use the formula
Calculate the similarity between each comment to be identified and the target article;

Among them, C k represents the vector of the kth comment to be identified, P represents the vector of the target article, n is the dimension of the vector, w i represents the weight of the topic word i in the target article, and w ik represents the kth of the topic word i The weights in the comments to be identified, S i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article, S i is 1, and in the remaining rounds
Sim(P i ,C i'k ) represents the similarity score between the subject word i in the kth comment to be identified and the synonym i' of the subject term i in the target article,
Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized.
The method according to any one of claims 1-3, wherein, according to the identified normal comments, the subject term set is expanded, comprising:

Obtaining the high-frequency words included in the identified normal comments, adding the high-frequency words as new keywords to the keyword set, and setting weights for the newly added keywords;

Count the frequency of occurrence of each subject term included in the new subject term set in the target article, and according to the frequency of occurrence, obtain the frequency of occurrence of at least one subject term in the new subject term set by matching in the co-occurrence term thesaurus Associated co-occurrence words are added to the set of themes as new keywords, and weight settings are performed for the newly added keywords.
The method according to any one of claims 1-3, wherein, according to the identified normal comments, updating the calculation weights of the plurality of subject terms in the subject term set includes:

use the formula
Update the calculation weight of each subject term in the subject term set;

Among them, Weight'(i) is the calculation weight of the updated word i, the word i is the subject word in the subject word set of the target article, and the synonym i' is the synonym of the word i appearing in the normal comment; n p is the word i in the The number of occurrences in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is the number of normal comments including word i after K rounds of similarity comparison; N (k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms of word i, Weight(i') is the weight of the synonyms i' of word i in all normal comments, Sim( i, i') is the similarity score between word i and synonym i', and μ is an adjustment factor greater than 0.
A device for identifying spam comments, comprising:

The similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;

The comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;

The subject term set update module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments when it is determined that there are unrecognized comments among the plurality of subject term sets that are not successfully identified One, obtain a new subject term set as the subject term set: carry out subject term expansion to the subject term set, and update the calculation weights of the plurality of subject terms in the subject terms; The multiple unidentified comments are updated to the unidentified unidentified comments;

The comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.
A computer device comprising:

at least one processor;

A storage device configured to store at least one program, and when the at least one program is executed by the at least one processor, the at least one processor can realize the identification of spam comments according to any one of claims 1-7 method.
A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for identifying spam comments according to any one of claims 1-7 is implemented.